MQSeries.net :: View topic - converting a UTF-8 Numeric Character Reference to ISO CP

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » converting a UTF-8 Numeric Character Reference to ISO CP

converting a UTF-8 Numeric Character Reference to ISO CP

« View previous topic :: View next topic »

Author

Message

seshank

Posted: Thu Jul 21, 2011 3:18 am Post subject: converting a UTF-8 Numeric Character Reference to ISO CP

Newbie

Joined: 21 Jul 2011
Posts: 2

We have a xml that is having a Codepage of 819 but having some UTF-8 characters. These UTF-8 characters are represented using the Numeric Character Reference (NCR)

When the xml field having the NCR is mapped to an other xml sturcture, we get the below error when the message is written to MQ output Queue.

InputMessage :
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<Employee><ID>12313213131</ID><Address>9-10-&#....33</Address></Employee>
In the above xml .... is 8722;

ESQL CODE :

Code:

   CALL CopyMessageHeaders();

   DECLARE destRef REFERENCE TO OutputRoot;
   CREATE LASTCHILD OF OutputRoot AS destRef DOMAIN 'XMLNSC' NAME 'XMLNSC';
   SET destRef.(XMLNSC.XmlDeclaration)*.(XMLNSC.Attribute)Version = '1.0';
   SET destRef.(XMLNSC.XmlDeclaration)*.(XMLNSC.Attribute)Encoding = 'IS0-8859-1';

   SET destRef.Emp = InputRoot.XMLNSC.Employee;
   RETURN TRUE;

Error :
Source character ''1222'' in field ''3c003f0078006d006c002000760065007200730069006f006e003d0022003
1002e0030002200200065006e0063006f00640069006e0067003d00220049
00530030002d0038003800350039002d00310022003f003e003c0045006d00
70003e003c00490044003e003100320033003100330032003100330031003
30031003c002f00490044003e003c0041006400640072006500730073003e0
039002d00310030002d001222330033003c002f0041006400640072006500
730073003e003c002f0045006d0070003e00'' cannot be converted from Unicode to codepage '819'.

The source character is an invalid code point within the given codepage.

Quote:

I was expecting the same NCR to be copied to the output xml. Any idea why the error ?

skoobee

Posted: Thu Jul 21, 2011 4:12 am Post subject:

Acolyte

Joined: 26 Nov 2010
Posts: 52

[quote]Source character ''1222'' in field '...'
cannot be converted from Unicode to codepage '819'.

The source character is an invalid code point within the given codepage.
[/quote]

Seems pretty clear. Which part don't you understand?

seshank

Posted: Thu Jul 21, 2011 4:37 am Post subject:

Newbie

Joined: 21 Jul 2011
Posts: 2

The source character 1222 (ሢ) is not there in the input message. I dont understand why &#8722 is getting changed to character 1222 .

fjb_saper

Posted: Thu Jul 21, 2011 7:18 am Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20771
Location: LI,NY

seshank wrote:

The source character 1222 (ሢ) is not there in the input message. I dont understand why &#8722 is getting changed to character 1222 .

Most probably because &#8722 is not a valid XML information...

Quote:

all "&" chars need to be escaped in XML so at best we should be seing something like

Quote:

Also for non existing chars in the target system you may get substitution chars...

My advice: keep it in UTF-8 !

_________________
MQ & Broker admin

rekarm01

Posted: Thu Jul 21, 2011 6:27 pm Post subject: Re: converting a UTF-8 Numeric Character Reference to ISO CP

Grand Master

Joined: 25 Jun 2008
Posts: 1415

seshank wrote:

We have a xml that is having a Codepage of 819 but having some UTF-8 characters. These UTF-8 characters are represented using the Numeric Character Reference (NCR)

The XML parsers can include numeric character references when parsing input messages, but they don't use them when writing output messages. The broker routines for character conversion are not specific to XML, so they throw an exception for an invalid code point, rather than generate an XML character reference.

The easiest option is to use an output encoding that doesn't require character references, such as

fjb_saper wrote:

UTF-8 !

More difficult options require careful use of either opaque parsing or field type constants such as XMLNSC.AsisElementContent. Consult the documentation for more details before considering either of these alternatives.

seshank wrote:

I dont understand why &#8722 is getting changed to character 1222 .

'−' (character reference) = U+2212 (Unicode) = X'12 22' (UCS-2 little endian)

seshank wrote:

InputMessage:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<Employee><ID>12313213131</ID><Address>9-10-−33</Address></Employee>

U+2212 is a mathematical operator ("MINUS SIGN"). Mathematical operators probably don't belong in an element called <Address>.

smdavies99

Posted: Thu Jul 21, 2011 8:43 pm Post subject: Re: converting a UTF-8 Numeric Character Reference to ISO CP

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

rekarm01 wrote:

U+2212 is a mathematical operator ("MINUS SIGN"). Mathematical operators probably don't belong in an element called <Address>.

Yes they do.

For example,
17-21 Basingstoke Road

Perfectly valid.... Mostly used for business addresses.

I see this problem all the time where people cut/paste addresses from MS-Word docs (cp 850) into a web page(unknown CP, could be UTF or 817). Everything is fine until you want to parse the resulting XML.
So you have to scan the unparsed XML and replace the errant stuff.

the Euro sign is also a problem. 8859-1 won't hack it. you need 8859-15.
_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.

fjb_saper

Posted: Fri Jul 22, 2011 12:06 am Post subject: Re: converting a UTF-8 Numeric Character Reference to ISO CP

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20771
Location: LI,NY

smdavies99 wrote:

rekarm01 wrote:

U+2212 is a mathematical operator ("MINUS SIGN"). Mathematical operators probably don't belong in an element called <Address>.

Still it should not be a "minus" sign but a "-" dash sign...
I believe that even though they may have the same graphical representation, those are 2 different characters in UTF-8...

So yes just cut and paste can give you a different response than typing it again...

Have fun

_________________
MQ & Broker admin

rekarm01

Posted: Sun Jul 24, 2011 9:44 am Post subject: Re: converting a UTF-8 Numeric Character Reference to ISO CP

Grand Master

Joined: 25 Jun 2008
Posts: 1415

smdavies99 wrote:

rekarm01 wrote:

U+2212 is a mathematical operator ("MINUS SIGN"). Mathematical operators probably don't belong in an element called <Address>.

Yes they do.

No, they don't. Although some non‐ASCII characters may look similarâ€”hyphens, dashes, minus signs, even ancient symbols for Roman weights and measuresâ€”they serve specific purposes, and are not interchangeable. The multipurpose ASCII '-' ("HYPHEN-MINUS") often works better, when specificity is not important and the implied meaning is clear. For example:

smdavies99 wrote:

17-21 Basingstoke Road

That uses the ASCII '-', which is usually OK, as opposed to the minus sign, which would change its meaning. Compare:

'17‑21 Basingstoke Road' ("NON-BREAKING HYPHEN"; word join: '"17‑21" Basingstoke Road')
'17‒21 Basingstoke Road' ("FIGURE DASH"; implied meaning)
'17â€“21 Basingstoke Road' ("EN DASH"; range: 'from 17 Basingstoke Road to 21 Basingstoke Road')
'17â€”21 Basingstoke Road' ("EM DASH"; interrupt: '17â€”no, waitâ€”21!â€”21 Basingstoke Road')
'17−21 Basingstoke Road' ("MINUS SIGN"; subtraction: equivalent to '−4 Basingstoke Road')

Applications don't just display strings, particularly from XML messages. The wrong character can also affect layout, portability, searches, queries, comparisons, sorts, conversions, translations, transformations, parsing, interpreting, processing, etc. How a character looks is probably the least important thing about it.

smdavies99 wrote:

I see this problem all the time where people cut/paste addresses from MS-Word docsâ€”

That's a different problem. Using the default "AutoCorrect/AutoFormat" settings, MS Word replaces ASCII hyphens with dashes, not with the minus sign. None of the more common single‐byte code pages define the minus sign. Any use of the minus sign is almost certainly deliberate.

smdavies99 wrote:

â€”(cp 850) into a web page (unknown CP, could be UTF or 817).

MS Word is more likely to use a Windows code page (such as 1252), rather than a DOS (OEM) code page (such as 850). There is no 817 code page, (nor "UTF").

smdavies99

Posted: Sun Jul 24, 2011 12:45 pm Post subject:

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

Whilst your points might be correct do you honestly think that someone entering and address cares if they use an of these

Quote:

17‑21 Basingstoke Road' ("NON-BREAKING HYPHEN"; word join: '"17‑21" Basingstoke Road')
'17‒21 Basingstoke Road' ("FIGURE DASH"; implied meaning)
'17â€“21 Basingstoke Road' ("EN DASH"; range: 'from 17 Basingstoke Road to 21 Basingstoke Road')
'17â€”21 Basingstoke Road' ("EM DASH"; interrupt: '17â€”no, waitâ€”21!â€”21 Basingstoke Road')
'17−21 Basingstoke Road' ("MINUS SIGN"; subtraction: equivalent to '−4 Basingstoke Road')

In my experience, they don't (with homage to GWTW),' they don't give a damm.

I should have said 819 not 817. my mistake.

Re the use of 1252 or 850. I see data mismatches with 850 20 times as often as with 1252. You are right the MS-Word would produce 1252. However there are good number of other 'business applications' that have been around since the DOS days that still use 850 and even 866(Cyrillic) Some of these apps are still sold today. Their devs have not bothered to fix their character set problems.
Our support people also have to 'fix' a wide range of other character set issues originating from cut/paste. The languages include Polish, Serbian, Lithuanian, Turkish and a variety of Cyrillic languages. Such are the joys of living in a multicultural society.
Thankfully, all the salesforce have been issued with iPads and the old cut/paste problems are a thing of the past.
_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.

rekarm01

Posted: Mon Jul 25, 2011 3:00 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 1415

smdavies99 wrote:

do you honestly think that someone entering and address cares

That's a different question. Most of the time, users would just use the ASCII '-' on their keyboard (available in most countries), and be done with it. And most of the time, that would be the right thing to do. No worries. Problem solved.

But someone who goes to the extra trouble to hunt down an obscure Unicode character to use instead might also care about picking the right one.

fjb_saper

Posted: Mon Jul 25, 2011 4:57 am Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20771
Location: LI,NY

rekarm01 wrote:

smdavies99 wrote:

do you honestly think that someone entering and address cares

I think it would be adequate to say that most of the addresses get entered through cut and paste from somewhere else. Once you cut and paste nobody cares whether it should be a dash or any other of the signs that look like it...

_________________
MQ & Broker admin

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » converting a UTF-8 Numeric Character Reference to ISO CP

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP