ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » converting a UTF-8 Numeric Character Reference to ISO CP

Post new topic  Reply to topic
 converting a UTF-8 Numeric Character Reference to ISO CP « View previous topic :: View next topic » 
Author Message
seshank
PostPosted: Thu Jul 21, 2011 3:18 am    Post subject: converting a UTF-8 Numeric Character Reference to ISO CP Reply with quote

Newbie

Joined: 21 Jul 2011
Posts: 2

We have a xml that is having a Codepage of 819 but having some UTF-8 characters. These UTF-8 characters are represented using the Numeric Character Reference (NCR)

When the xml field having the NCR is mapped to an other xml sturcture, we get the below error when the message is written to MQ output Queue.

InputMessage :
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<Employee><ID>12313213131</ID><Address>9-10-&#....33</Address></Employee>
In the above xml .... is 8722;

ESQL CODE :
Code:
       CALL CopyMessageHeaders();

      DECLARE destRef REFERENCE TO OutputRoot;
      CREATE LASTCHILD OF OutputRoot AS destRef DOMAIN 'XMLNSC' NAME 'XMLNSC';
      SET destRef.(XMLNSC.XmlDeclaration)*.(XMLNSC.Attribute)Version = '1.0';
      SET destRef.(XMLNSC.XmlDeclaration)*.(XMLNSC.Attribute)Encoding = 'IS0-8859-1';
   
      SET destRef.Emp = InputRoot.XMLNSC.Employee;
      RETURN TRUE;

Error :
Source character ''1222'' in field ''3c003f0078006d006c002000760065007200730069006f006e003d0022003
1002e0030002200200065006e0063006f00640069006e0067003d00220049
00530030002d0038003800350039002d00310022003f003e003c0045006d00
70003e003c00490044003e003100320033003100330032003100330031003
30031003c002f00490044003e003c0041006400640072006500730073003e0
039002d00310030002d001222330033003c002f0041006400640072006500
730073003e003c002f0045006d0070003e00'' cannot be converted from Unicode to codepage '819'.

The source character is an invalid code point within the given codepage.

Quote:
I was expecting the same NCR to be copied to the output xml. Any idea why the error ?
Back to top
View user's profile Send private message
skoobee
PostPosted: Thu Jul 21, 2011 4:12 am    Post subject: Reply with quote

Acolyte

Joined: 26 Nov 2010
Posts: 52

[quote]Source character ''1222'' in field '...'
cannot be converted from Unicode to codepage '819'.

The source character is an invalid code point within the given codepage.
[/quote]

Seems pretty clear. Which part don't you understand?
Back to top
View user's profile Send private message
seshank
PostPosted: Thu Jul 21, 2011 4:37 am    Post subject: Reply with quote

Newbie

Joined: 21 Jul 2011
Posts: 2

The source character 1222 (ሢ) is not there in the input message. I dont understand why &#8722 is getting changed to character 1222 .
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Thu Jul 21, 2011 7:18 am    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20756
Location: LI,NY

seshank wrote:
The source character 1222 (ሢ) is not there in the input message. I dont understand why &#8722 is getting changed to character 1222 .

Most probably because &#8722 is not a valid XML information...
Quote:
<Address>9-10-&#....33</Address>

all "&" chars need to be escaped in XML so at best we should be seing something like
Quote:
<Address>9-10-&Amp;#....33</Address>

Also for non existing chars in the target system you may get substitution chars...

My advice: keep it in UTF-8 !
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
rekarm01
PostPosted: Thu Jul 21, 2011 6:27 pm    Post subject: Re: converting a UTF-8 Numeric Character Reference to ISO CP Reply with quote

Grand Master

Joined: 25 Jun 2008
Posts: 1415

seshank wrote:
We have a xml that is having a Codepage of 819 but having some UTF-8 characters. These UTF-8 characters are represented using the Numeric Character Reference (NCR)

The XML parsers can include numeric character references when parsing input messages, but they don't use them when writing output messages. The broker routines for character conversion are not specific to XML, so they throw an exception for an invalid code point, rather than generate an XML character reference.

The easiest option is to use an output encoding that doesn't require character references, such as

fjb_saper wrote:
UTF-8 !

More difficult options require careful use of either opaque parsing or field type constants such as XMLNSC.AsisElementContent. Consult the documentation for more details before considering either of these alternatives.

seshank wrote:
I dont understand why &#8722 is getting changed to character 1222 .

'&#8722;' (character reference) = U+2212 (Unicode) = X'12 22' (UCS-2 little endian)

seshank wrote:
InputMessage:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<Employee><ID>12313213131</ID><Address>9-10-&#8722;33</Address></Employee>

U+2212 is a mathematical operator ("MINUS SIGN"). Mathematical operators probably don't belong in an element called <Address>.
Back to top
View user's profile Send private message
smdavies99
PostPosted: Thu Jul 21, 2011 8:43 pm    Post subject: Re: converting a UTF-8 Numeric Character Reference to ISO CP Reply with quote

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

rekarm01 wrote:

U+2212 is a mathematical operator ("MINUS SIGN"). Mathematical operators probably don't belong in an element called <Address>.


Yes they do.

For example,
17-21 Basingstoke Road

Perfectly valid.... Mostly used for business addresses.

I see this problem all the time where people cut/paste addresses from MS-Word docs (cp 850) into a web page(unknown CP, could be UTF or 817). Everything is fine until you want to parse the resulting XML.
So you have to scan the unparsed XML and replace the errant stuff.

the Euro sign is also a problem. 8859-1 won't hack it. you need 8859-15.
_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Fri Jul 22, 2011 12:06 am    Post subject: Re: converting a UTF-8 Numeric Character Reference to ISO CP Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20756
Location: LI,NY

smdavies99 wrote:
rekarm01 wrote:

U+2212 is a mathematical operator ("MINUS SIGN"). Mathematical operators probably don't belong in an element called <Address>.


Yes they do.

For example,
17-21 Basingstoke Road

Perfectly valid.... Mostly used for business addresses.

I see this problem all the time where people cut/paste addresses from MS-Word docs (cp 850) into a web page(unknown CP, could be UTF or 817). Everything is fine until you want to parse the resulting XML.
So you have to scan the unparsed XML and replace the errant stuff.

the Euro sign is also a problem. 8859-1 won't hack it. you need 8859-15.


Still it should not be a "minus" sign but a "-" dash sign...
I believe that even though they may have the same graphical representation, those are 2 different characters in UTF-8...

So yes just cut and paste can give you a different response than typing it again...

Have fun
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
rekarm01
PostPosted: Sun Jul 24, 2011 9:44 am    Post subject: Re: converting a UTF-8 Numeric Character Reference to ISO CP Reply with quote

Grand Master

Joined: 25 Jun 2008
Posts: 1415

smdavies99 wrote:
rekarm01 wrote:
U+2212 is a mathematical operator ("MINUS SIGN"). Mathematical operators probably don't belong in an element called <Address>.

Yes they do.

No, they don't. Although some non‐ASCII characters may look similar—hyphens, dashes, minus signs, even ancient symbols for Roman weights and measures—they serve specific purposes, and are not interchangeable. The multipurpose ASCII '-' ("HYPHEN-MINUS") often works better, when specificity is not important and the implied meaning is clear. For example:

smdavies99 wrote:
17-21 Basingstoke Road

That uses the ASCII '-', which is usually OK, as opposed to the minus sign, which would change its meaning. Compare:
  • '17‑21 Basingstoke Road' ("NON-BREAKING HYPHEN"; word join: '"17‑21" Basingstoke Road')
  • '17‒21 Basingstoke Road' ("FIGURE DASH"; implied meaning)
  • '17–21 Basingstoke Road' ("EN DASH"; range: 'from 17 Basingstoke Road to 21 Basingstoke Road')
  • '17—21 Basingstoke Road' ("EM DASH"; interrupt: '17—no, wait—21!—21 Basingstoke Road')
  • '17−21 Basingstoke Road' ("MINUS SIGN"; subtraction: equivalent to '−4 Basingstoke Road')
Applications don't just display strings, particularly from XML messages. The wrong character can also affect layout, portability, searches, queries, comparisons, sorts, conversions, translations, transformations, parsing, interpreting, processing, etc. How a character looks is probably the least important thing about it.

smdavies99 wrote:
I see this problem all the time where people cut/paste addresses from MS-Word docs—

That's a different problem. Using the default "AutoCorrect/AutoFormat" settings, MS Word replaces ASCII hyphens with dashes, not with the minus sign. None of the more common single‐byte code pages define the minus sign. Any use of the minus sign is almost certainly deliberate.

smdavies99 wrote:
—(cp 850) into a web page (unknown CP, could be UTF or 817).

MS Word is more likely to use a Windows code page (such as 1252), rather than a DOS (OEM) code page (such as 850). There is no 817 code page, (nor "UTF").
Back to top
View user's profile Send private message
smdavies99
PostPosted: Sun Jul 24, 2011 12:45 pm    Post subject: Reply with quote

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

Whilst your points might be correct do you honestly think that someone entering and address cares if they use an of these

Quote:

17‑21 Basingstoke Road' ("NON-BREAKING HYPHEN"; word join: '"17‑21" Basingstoke Road')
'17‒21 Basingstoke Road' ("FIGURE DASH"; implied meaning)
'17–21 Basingstoke Road' ("EN DASH"; range: 'from 17 Basingstoke Road to 21 Basingstoke Road')
'17—21 Basingstoke Road' ("EM DASH"; interrupt: '17—no, wait—21!—21 Basingstoke Road')
'17−21 Basingstoke Road' ("MINUS SIGN"; subtraction: equivalent to '−4 Basingstoke Road')

In my experience, they don't (with homage to GWTW),' they don't give a damm.

I should have said 819 not 817. my mistake.

Re the use of 1252 or 850. I see data mismatches with 850 20 times as often as with 1252. You are right the MS-Word would produce 1252. However there are good number of other 'business applications' that have been around since the DOS days that still use 850 and even 866(Cyrillic) Some of these apps are still sold today. Their devs have not bothered to fix their character set problems.
Our support people also have to 'fix' a wide range of other character set issues originating from cut/paste. The languages include Polish, Serbian, Lithuanian, Turkish and a variety of Cyrillic languages. Such are the joys of living in a multicultural society.
Thankfully, all the salesforce have been issued with iPads and the old cut/paste problems are a thing of the past.
_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.
Back to top
View user's profile Send private message
rekarm01
PostPosted: Mon Jul 25, 2011 3:00 am    Post subject: Reply with quote

Grand Master

Joined: 25 Jun 2008
Posts: 1415

smdavies99 wrote:
do you honestly think that someone entering and address cares

That's a different question. Most of the time, users would just use the ASCII '-' on their keyboard (available in most countries), and be done with it. And most of the time, that would be the right thing to do. No worries. Problem solved.

But someone who goes to the extra trouble to hunt down an obscure Unicode character to use instead might also care about picking the right one.
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Mon Jul 25, 2011 4:57 am    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20756
Location: LI,NY

rekarm01 wrote:
smdavies99 wrote:
do you honestly think that someone entering and address cares

That's a different question. Most of the time, users would just use the ASCII '-' on their keyboard (available in most countries), and be done with it. And most of the time, that would be the right thing to do. No worries. Problem solved.

But someone who goes to the extra trouble to hunt down an obscure Unicode character to use instead might also care about picking the right one.

I think it would be adequate to say that most of the addresses get entered through cut and paste from somewhere else. Once you cut and paste nobody cares whether it should be a dash or any other of the signs that look like it...
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
Display posts from previous:   
Post new topic  Reply to topic Page 1 of 1

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » converting a UTF-8 Numeric Character Reference to ISO CP
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.