Author |
Message
|
seshank |
Posted: Thu Jul 21, 2011 3:18 am Post subject: converting a UTF-8 Numeric Character Reference to ISO CP |
|
|
Newbie
Joined: 21 Jul 2011 Posts: 2
|
We have a xml that is having a Codepage of 819 but having some UTF-8 characters. These UTF-8 characters are represented using the Numeric Character Reference (NCR)
When the xml field having the NCR is mapped to an other xml sturcture, we get the below error when the message is written to MQ output Queue.
InputMessage :
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<Employee><ID>12313213131</ID><Address>9-10-&#....33</Address></Employee>
In the above xml .... is 8722;
ESQL CODE :
Code: |
CALL CopyMessageHeaders();
DECLARE destRef REFERENCE TO OutputRoot;
CREATE LASTCHILD OF OutputRoot AS destRef DOMAIN 'XMLNSC' NAME 'XMLNSC';
SET destRef.(XMLNSC.XmlDeclaration)*.(XMLNSC.Attribute)Version = '1.0';
SET destRef.(XMLNSC.XmlDeclaration)*.(XMLNSC.Attribute)Encoding = 'IS0-8859-1';
SET destRef.Emp = InputRoot.XMLNSC.Employee;
RETURN TRUE;
|
Error :
Source character ''1222'' in field ''3c003f0078006d006c002000760065007200730069006f006e003d0022003
1002e0030002200200065006e0063006f00640069006e0067003d00220049
00530030002d0038003800350039002d00310022003f003e003c0045006d00
70003e003c00490044003e003100320033003100330032003100330031003
30031003c002f00490044003e003c0041006400640072006500730073003e0
039002d00310030002d001222330033003c002f0041006400640072006500
730073003e003c002f0045006d0070003e00'' cannot be converted from Unicode to codepage '819'.
The source character is an invalid code point within the given codepage.
Quote: |
I was expecting the same NCR to be copied to the output xml. Any idea why the error ? |
|
|
Back to top |
|
 |
skoobee |
Posted: Thu Jul 21, 2011 4:12 am Post subject: |
|
|
Acolyte
Joined: 26 Nov 2010 Posts: 52
|
[quote]Source character ''1222'' in field '...'
cannot be converted from Unicode to codepage '819'.
The source character is an invalid code point within the given codepage.
[/quote]
Seems pretty clear. Which part don't you understand? |
|
Back to top |
|
 |
seshank |
Posted: Thu Jul 21, 2011 4:37 am Post subject: |
|
|
Newbie
Joined: 21 Jul 2011 Posts: 2
|
The source character 1222 (ሢ) is not there in the input message. I dont understand why − is getting changed to character 1222 . |
|
Back to top |
|
 |
fjb_saper |
Posted: Thu Jul 21, 2011 7:18 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
seshank wrote: |
The source character 1222 (ሢ) is not there in the input message. I dont understand why − is getting changed to character 1222 . |
Most probably because − is not a valid XML information...
Quote: |
<Address>9-10-&#....33</Address> |
all "&" chars need to be escaped in XML so at best we should be seing something like
Quote: |
<Address>9-10-&Amp;#....33</Address> |
Also for non existing chars in the target system you may get substitution chars...
My advice: keep it in UTF-8 !  _________________ MQ & Broker admin |
|
Back to top |
|
 |
rekarm01 |
Posted: Thu Jul 21, 2011 6:27 pm Post subject: Re: converting a UTF-8 Numeric Character Reference to ISO CP |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
seshank wrote: |
We have a xml that is having a Codepage of 819 but having some UTF-8 characters. These UTF-8 characters are represented using the Numeric Character Reference (NCR) |
The XML parsers can include numeric character references when parsing input messages, but they don't use them when writing output messages. The broker routines for character conversion are not specific to XML, so they throw an exception for an invalid code point, rather than generate an XML character reference.
The easiest option is to use an output encoding that doesn't require character references, such as
More difficult options require careful use of either opaque parsing or field type constants such as XMLNSC.AsisElementContent. Consult the documentation for more details before considering either of these alternatives.
seshank wrote: |
I dont understand why − is getting changed to character 1222 . |
'−' (character reference) = U+2212 (Unicode) = X'12 22' (UCS-2 little endian)
seshank wrote: |
InputMessage:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<Employee><ID>12313213131</ID><Address>9-10-−33</Address></Employee> |
U+2212 is a mathematical operator ("MINUS SIGN"). Mathematical operators probably don't belong in an element called <Address>. |
|
Back to top |
|
 |
smdavies99 |
Posted: Thu Jul 21, 2011 8:43 pm Post subject: Re: converting a UTF-8 Numeric Character Reference to ISO CP |
|
|
 Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
rekarm01 wrote: |
U+2212 is a mathematical operator ("MINUS SIGN"). Mathematical operators probably don't belong in an element called <Address>. |
Yes they do.
For example,
17-21 Basingstoke Road
Perfectly valid.... Mostly used for business addresses.
I see this problem all the time where people cut/paste addresses from MS-Word docs (cp 850) into a web page(unknown CP, could be UTF or 817). Everything is fine until you want to parse the resulting XML.
So you have to scan the unparsed XML and replace the errant stuff.
the Euro sign is also a problem. 8859-1 won't hack it. you need 8859-15. _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
 |
fjb_saper |
Posted: Fri Jul 22, 2011 12:06 am Post subject: Re: converting a UTF-8 Numeric Character Reference to ISO CP |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
smdavies99 wrote: |
rekarm01 wrote: |
U+2212 is a mathematical operator ("MINUS SIGN"). Mathematical operators probably don't belong in an element called <Address>. |
Yes they do.
For example,
17-21 Basingstoke Road
Perfectly valid.... Mostly used for business addresses.
I see this problem all the time where people cut/paste addresses from MS-Word docs (cp 850) into a web page(unknown CP, could be UTF or 817). Everything is fine until you want to parse the resulting XML.
So you have to scan the unparsed XML and replace the errant stuff.
the Euro sign is also a problem. 8859-1 won't hack it. you need 8859-15. |
Still it should not be a "minus" sign but a "-" dash sign...
I believe that even though they may have the same graphical representation, those are 2 different characters in UTF-8...
So yes just cut and paste can give you a different response than typing it again...
Have fun  _________________ MQ & Broker admin |
|
Back to top |
|
 |
rekarm01 |
Posted: Sun Jul 24, 2011 9:44 am Post subject: Re: converting a UTF-8 Numeric Character Reference to ISO CP |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
smdavies99 wrote: |
rekarm01 wrote: |
U+2212 is a mathematical operator ("MINUS SIGN"). Mathematical operators probably don't belong in an element called <Address>. |
Yes they do. |
No, they don't. Although some non‐ASCII characters may look similar—hyphens, dashes, minus signs, even ancient symbols for Roman weights and measures—they serve specific purposes, and are not interchangeable. The multipurpose ASCII '-' ("HYPHEN-MINUS") often works better, when specificity is not important and the implied meaning is clear. For example:
smdavies99 wrote: |
17-21 Basingstoke Road |
That uses the ASCII '-', which is usually OK, as opposed to the minus sign, which would change its meaning. Compare:- '17‑21 Basingstoke Road' ("NON-BREAKING HYPHEN"; word join: '"17‑21" Basingstoke Road')
- '17‒21 Basingstoke Road' ("FIGURE DASH"; implied meaning)
- '17–21 Basingstoke Road' ("EN DASH"; range: 'from 17 Basingstoke Road to 21 Basingstoke Road')
- '17—21 Basingstoke Road' ("EM DASH"; interrupt: '17—no, wait—21!—21 Basingstoke Road')
- '17−21 Basingstoke Road' ("MINUS SIGN"; subtraction: equivalent to '−4 Basingstoke Road')
Applications don't just display strings, particularly from XML messages. The wrong character can also affect layout, portability, searches, queries, comparisons, sorts, conversions, translations, transformations, parsing, interpreting, processing, etc. How a character looks is probably the least important thing about it.
smdavies99 wrote: |
I see this problem all the time where people cut/paste addresses from MS-Word docs— |
That's a different problem. Using the default "AutoCorrect/AutoFormat" settings, MS Word replaces ASCII hyphens with dashes, not with the minus sign. None of the more common single‐byte code pages define the minus sign. Any use of the minus sign is almost certainly deliberate.
smdavies99 wrote: |
—(cp 850) into a web page (unknown CP, could be UTF or 817). |
MS Word is more likely to use a Windows code page (such as 1252), rather than a DOS (OEM) code page (such as 850). There is no 817 code page, (nor "UTF"). |
|
Back to top |
|
 |
smdavies99 |
Posted: Sun Jul 24, 2011 12:45 pm Post subject: |
|
|
 Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
Whilst your points might be correct do you honestly think that someone entering and address cares if they use an of these
Quote: |
17‑21 Basingstoke Road' ("NON-BREAKING HYPHEN"; word join: '"17‑21" Basingstoke Road')
'17‒21 Basingstoke Road' ("FIGURE DASH"; implied meaning)
'17–21 Basingstoke Road' ("EN DASH"; range: 'from 17 Basingstoke Road to 21 Basingstoke Road')
'17—21 Basingstoke Road' ("EM DASH"; interrupt: '17—no, wait—21!—21 Basingstoke Road')
'17−21 Basingstoke Road' ("MINUS SIGN"; subtraction: equivalent to '−4 Basingstoke Road')
|
In my experience, they don't (with homage to GWTW),' they don't give a damm.
I should have said 819 not 817. my mistake.
Re the use of 1252 or 850. I see data mismatches with 850 20 times as often as with 1252. You are right the MS-Word would produce 1252. However there are good number of other 'business applications' that have been around since the DOS days that still use 850 and even 866(Cyrillic) Some of these apps are still sold today. Their devs have not bothered to fix their character set problems.
Our support people also have to 'fix' a wide range of other character set issues originating from cut/paste. The languages include Polish, Serbian, Lithuanian, Turkish and a variety of Cyrillic languages. Such are the joys of living in a multicultural society.
Thankfully, all the salesforce have been issued with iPads and the old cut/paste problems are a thing of the past. _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
 |
rekarm01 |
Posted: Mon Jul 25, 2011 3:00 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
smdavies99 wrote: |
do you honestly think that someone entering and address cares |
That's a different question. Most of the time, users would just use the ASCII '-' on their keyboard (available in most countries), and be done with it. And most of the time, that would be the right thing to do. No worries. Problem solved.
But someone who goes to the extra trouble to hunt down an obscure Unicode character to use instead might also care about picking the right one. |
|
Back to top |
|
 |
fjb_saper |
Posted: Mon Jul 25, 2011 4:57 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
rekarm01 wrote: |
smdavies99 wrote: |
do you honestly think that someone entering and address cares |
That's a different question. Most of the time, users would just use the ASCII '-' on their keyboard (available in most countries), and be done with it. And most of the time, that would be the right thing to do. No worries. Problem solved.
But someone who goes to the extra trouble to hunt down an obscure Unicode character to use instead might also care about picking the right one. |
I think it would be adequate to say that most of the addresses get entered through cut and paste from somewhere else. Once you cut and paste nobody cares whether it should be a dash or any other of the signs that look like it...  _________________ MQ & Broker admin |
|
Back to top |
|
 |
|