Author |
Message
|
murdeep |
Posted: Thu Feb 11, 2010 8:04 am Post subject: Invalid XML characters |
|
|
Master
Joined: 03 Nov 2004 Posts: 211
|
Hello, this is probably a lame question but here goes.
Running WMB 6.1.0.5 and WMQ V7 on W2K.
We have a message flow that receives XML. Every once in awhile we receive a message that gets rejected by the XMLNSC parser due to invalid XML character, in one case a 0x7F.
I inform the sending application team and they claim that it is valid and that our parser is not configured correctly. I asked them then if they validate the XML before putting it on the wire and they said yes. So I said great send me your schemas and I will use them to validate what you send me when I receive the message.
I imported the schemas created a message set, build a simple flow that uses the MRM parser and does a Trace ${Root} to force parsing of the entire message and as I expected the message is rejected for invalid XML character.
The app team is positive they validate.
Is it possible to have a 0x7F character in an XML document and be a valid XML document? Is there some option when configuring WMB to have it not reject XML that has these odd characters?
Thanks in advance. |
|
Back to top |
|
 |
mqjeff |
Posted: Thu Feb 11, 2010 8:27 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
Try validating with XMLNSC instead of MRM-XML... |
|
Back to top |
|
 |
Vitor |
Posted: Thu Feb 11, 2010 8:30 am Post subject: Re: Invalid XML characters |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
murdeep wrote: |
Is it possible to have a 0x7F character in an XML document and be a valid XML document? Is there some option when configuring WMB to have it not reject XML that has these odd characters?
|
Isn't that a DEL character? Hardly the sort of thing you've have in XML I think!
The XML spec is rather nicely ambiguous on this:
Quote: |
Document authors are encouraged to avoid "compatibility characters" |
but that value is in the range of allowable XML characters.
2 questions occur to me:
1) Why do they have a DEL in the XML
2) What are they using to validate the XML? Is it a commerical tool or homebrew? Does it open in IE?
How WMB can be convinced to handle this I'll leave to better minds than mine.  _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
murdeep |
Posted: Thu Feb 11, 2010 9:38 am Post subject: |
|
|
Master
Joined: 03 Nov 2004 Posts: 211
|
mqjeff wrote: |
Try validating with XMLNSC instead of MRM-XML... |
Ok, tried this and it fails with a slightly different error message but result is same - message rejected for invalid xml character.
Vitor wrote: |
2 questions occur to me:
1) Why do they have a DEL in the XML
2) What are they using to validate the XML? Is it a commerical tool or homebrew? Does it open in IE?
|
1) Good question, haven't got a good explanation of why they are sending us this char
2) they are an ASP and are using VB code XMLReader class I believe to validate. We are looking at if perhaps they are only validating structure and not content. |
|
Back to top |
|
 |
Vitor |
Posted: Thu Feb 11, 2010 10:34 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
murdeep wrote: |
2) they are an ASP and are using VB code XMLReader class I believe to validate. We are looking at if perhaps they are only validating structure and not content. |
Ok, not my strong suit but can you validate the structure of an XML document without parsing the content? Even in VB?
Weird, but I've seen weirder.
Have you tried opening one of these documents with IE or XMLSpy? IIRC WMB uses (or used to use) a variation of the Xerces parser which is fairly standard. Again, someone who knows more about this will be along in a minute. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
murdeep |
Posted: Thu Feb 11, 2010 10:41 am Post subject: |
|
|
Master
Joined: 03 Nov 2004 Posts: 211
|
Vitor wrote: |
Have you tried opening one of these documents with IE or XMLSpy? IIRC WMB uses (or used to use) a variation of the Xerces parser which is fairly standard. Again, someone who knows more about this will be along in a minute. |
Saved the message to a .xml file and then opened with IE and the 0x7F appears as a little box. ---> hadnt <---
Not sure if IE is as strict as the WMB parsers. |
|
Back to top |
|
 |
Vitor |
Posted: Thu Feb 11, 2010 12:00 pm Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
murdeep wrote: |
Not sure if IE is as strict as the WMB parsers. |
It's still odd that that value would appear in that position - you would expect a ' given the context.
Be that as it may, the answer will only come from someone with a detailed knowledge of the WMB parser. Who isn't me. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
murdeep |
Posted: Thu Feb 11, 2010 2:40 pm Post subject: |
|
|
Master
Joined: 03 Nov 2004 Posts: 211
|
Ok, we may be getting to the bottom of this. Some more info.
The sending app is a VB client on windows connecting to a MQServer on windows.
The MQServer qmgr CCSID is 437.
The client VB app sets it's Message.CCSID = 437 and then build it's payload. In the payload they set a character to ASCII 180 which is a apostrophe. I'm thinking that ASCII 180 is not a valid character in CCSID 437. I'm thinking that the client should be setting the Message.CCSID to 1208 for UTF-8 so that it supports the ASCII 180 character before sending the message.
Can anyone comment on this? Thanks |
|
Back to top |
|
 |
kimbert |
Posted: Thu Feb 11, 2010 3:15 pm Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
If you have not yet referred to this document, then you should do so: http://www.w3.org/TR/2006/REC-xml-20060816/#charsets
I would advise you to quote it in all conversations with other departments about the well-formedness of XML
0x7F is a legal Unicode code point in an XML document.
In your XML documents, the byte value '0x7F' might or might not represent a legal XML code point - that will depend on the encoding ( the CCSID ) of your XML document.
The issue below seems different from the 0x7F question, btw.
Quote: |
The client VB app sets it's Message.CCSID = 437 and then build it's payload. In the payload they set a character to ASCII 180 which is a apostrophe. I'm thinking that ASCII 180 is not a valid character in CCSID 437. I'm thinking that the client should be setting the Message.CCSID to 1208 for UTF-8 so that it supports the ASCII 180 character before sending the message. |
Apostrophe is not represented by a byte value of 180. It has a byte value of 039 ( 0x27 ) in ASCII. You're right about the mapping to code page 437, though. There's no apostrophe in 437.
And yes, I think UTF-8 would be a great choice of encoding. |
|
Back to top |
|
 |
murdeep |
Posted: Thu Feb 11, 2010 3:42 pm Post subject: |
|
|
Master
Joined: 03 Nov 2004 Posts: 211
|
kimbert wrote: |
Apostrophe is not represented by a byte value of 180. It has a byte value of 039 ( 0x27 ) in ASCII. You're right about the mapping to code page 437, though. There's no apostrophe in 437.
And yes, I think UTF-8 would be a great choice of encoding. |
Isn't 0x27 technically a single quotation mark? and ASCII 180 an apostrophe?
No apostrophe in 437.
I think I get this now. The developer claims his XML is valid because he validates which works, they then serialize it to a string which is then written to the message buffer with a WriteString method where the CCSID is set to 437. The ASCII 180 somehow becomes a 0x7F and fails parsing at WMB.
I'll get the developer to set his CCSID to 1208 and then see what happens when the message arrives at WMB.
One last point this message utlimately ends up at a z/OS node. Will there be any issues converting frrom 1208 to the z/OS CCSID whatever that may be? |
|
Back to top |
|
 |
kimbert |
Posted: Thu Feb 11, 2010 4:43 pm Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Quote: |
One last point this message utlimately ends up at a z/OS node. Will there be any issues converting frrom 1208 to the z/OS CCSID whatever that may be? |
z/OS can handle UTF-8 data ( 1208 ). If 'whatever that may be' is not a Unicode code page, then yes, there may well be problems. But that's always the problem when you use non-Unicode code pages. |
|
Back to top |
|
 |
rekarm01 |
Posted: Thu Feb 11, 2010 6:48 pm Post subject: Re: Invalid XML characters |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
murdeep wrote: |
we receive a message that gets rejected by the XMLNSC parser due to invalid XML character, in one case a 0x7F. |
What was the actual error message?
0x7F is a byte, not a character. Whether it maps to a valid XML character or not depends on the ccsid that maps it.
murdeep wrote: |
Isn't 0x27 technically a single quotation mark? and ASCII 180 an apostrophe? |
ASCII is a 7-bit character set; there is no ASCII 180. 0x27 is a byte, not a character; without a ccsid, it means nothing:- ccsid=437 (MS-DOS):
0x27 --> U+0027 = <APOSTROPHE> = (')
0x7F --> U+001A = <SUBSTITUTION> = (not a valid XML character)
0x7F --> U+2302 = <HOUSE> = (⌂)
0xB4 --> U+2524 = <BOX_DRAWINGS_LIGHT_VERTICAL_AND_LEFT> = (┤)
- ccsid=819 (ISO 8859-1):
0x27 --> U+0027 = <APOSTROPHE> = (')
0xB4 --> U+00B4 = <ACUTE ACCENT> = (´)
- ccsid=1208 (UTF-8):
0x27 --> U+0027 = <APOSTROPHE> = (')
0xC2B4 --> U+00B4 = <ACUTE ACCENT> = (´)
For ccsid=437, 0x7F is ambiguous; it can either be interpreted as a control character or a graphic character, depending on the context. This, among other reasons, makes it a poor choice for representing non-ASCII character data.
Most likely, the sender used a non-ASCII acute accent instead of an ASCII apostrophe in the XML element "...hadn´t..."; the sender then converted the message to ccsid=437, which does not have an acute accent, so the conversion replaced the acute accent with a substitution character. The substitution character is not a valid XML character.
The sender corrupted the message before it sent it; even if the broker were to accept it, the receiver would still get a bad message.
murdeep wrote: |
I'll get the developer to set his CCSID to 1208 and then see what happens when the message arrives at WMB. |
Unicode is a good choice, but it won't fix Garbage In, Garbage Out.
murdeep wrote: |
One last point this message utlimately ends up at a z/OS node. Will there be any issues converting from 1208 to the z/OS CCSID whatever that may be? |
Unicode is best used end-to-end. If the source message contains characters not supported by the target ccsid, there will be issues. |
|
Back to top |
|
 |
murdeep |
Posted: Fri Feb 12, 2010 8:03 am Post subject: |
|
|
Master
Joined: 03 Nov 2004 Posts: 211
|
Ok, had the developer change the CCSID specified prior to the WriteString methid call to CCSID 852 which supports the ASCII 180 and WMB validates the XML no problem. But as I suspected when the message arrives at the z/OS qmgr which is CCSID 37 we run into issues.
So we will try CCSID 1208. My question is when the z/OS app does a get with convert do they have to do anything prior to doing the MQGet. I ask because by defaut won't the conversion be 1208->37? Won't this fail as well? |
|
Back to top |
|
 |
rekarm01 |
Posted: Wed Feb 17, 2010 2:17 am Post subject: Re: Invalid XML characters |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
Using an acute accent character in "...hadn´t..." is a typographical error. It might be worthwhile to have the sender fix that.
murdeep wrote: |
Ok, had the developer change the CCSID specified prior to the WriteString method call to CCSID 852 which supports the ASCII 180 ... |
ASCII is still a 7-bit character set; there is still no ASCII 180. (For ccsid=852, the acute accent actually maps to #239, not #180.)
MS-DOS ccsids are useful for backwards compatibility with DOS applications, but they don't always convert well to/from non-DOS ccsids. Try to avoid them, unless the source or target application is actually running under DOS:- 437 MS-DOS Latin US
- 737 MS-DOS Greek
- 775 MS-DOS Baltic Rim
- 850 MS-DOS Latin 1
- 851 MS-DOS Greek 1
- 852 MS-DOS Latin 2
- 855 MS-DOS Cyrillic
- 857 MS-DOS Turkish
- 860 MS-DOS Portuguese
- 861 MS-DOS Icelandic
- 862 MS-DOS Hebrew
- 863 MS-DOS French Canada
- 865 MS-DOS Nordic
- 866 MS-DOS Cyrillic CIS 1
- 869 MS-DOS Greek 2
murdeep wrote: |
But as I suspected when the message arrives at the z/OS qmgr which is CCSID 37 we run into issues.
So we will try CCSID 1208. |
Converting to Unicode is much less useful, when it's undone by a subsequent conversion from Unicode. If the target ccsid is 37, why not have the developer set ccsid=37 directly? That way, the broker and target application don't need to convert at all. Any conversion issues can be detected at the source, and dealt with there.
murdeep wrote: |
My question is when the z/OS app does a get with convert do they have to do anything prior to doing the MQGet. |
What could any application possibly do to a message prior to the MQGet? Some issues can only be fixed by the sender. |
|
Back to top |
|
 |
|