Author |
Message
|
mattynorm |
Posted: Fri Sep 25, 2009 10:25 pm Post subject: Different behaviour when reformatting xml |
|
|
Acolyte
Joined: 06 Jun 2003 Posts: 52
|
I am working on a quick fix to a problem with a file being generated by a 3rd party application
Basically they are sending an xml file in double byte format (i.e. there is a hex '00' between each readable character, with a CCID of 1200 (UTF-16), but an xml declaration in the file of UTF-8. We have told them this is rubbish, but they can't provide a fix until Jan at the earliest. So I need an interim solution (the flow is failing the first time it tries to access an element in the tree).
I wrote a very simple flow, consiting of MQInput, Compute Node and MQOutput, taking the message in as a BLOB. The compute node had the following esql(from memory so please excuse typos)
Call CopyMessageHeaders ;
SET OutputRoot.BLOB.BLOB = TRANSLATE(InputRoot.BLOB.BLOB, x'00') ;
SET OutputRoot.MQMD.CodedCharSetId = 1208 ;
Setting the queue name of the MQOutput node to the input queue of the original flow, and placing the failing message on my Input queue has solved the problem.
However, it has now been decreed that it should all be done in the same flow, so between the MQInput node (now set to BLOB) and the first compute node, I included another compute node (containing the C&P'd code above), and a ResetContentDescriptor to set the domain back to XMLNSC. This flow now fails and the message backs out,and I don't understand what the difference is between the 2 scenarios, and why it should fail in this single flow solution.
Any ideas? |
|
Back to top |
|
 |
Vitor |
Posted: Sat Sep 26, 2009 3:28 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
AFAIK the broker uses the CCSID of the message to parse it rather than the XML declaration; what CCSID is the message sent with and have you tried simply overriding that to be 1200 (so it matches the message content) rather than this edit? _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mattynorm |
Posted: Sat Sep 26, 2009 4:20 am Post subject: |
|
|
Acolyte
Joined: 06 Jun 2003 Posts: 52
|
The message is sent with CCSID of 1200, which is (according to google) utf-16. But the message is sent with an xml declaration of utf-8. The broker doesn't seem to like reading the tree with this message declaration and CCSID combination, it backs out with an unspecified xml parsing error on the first read of the message tree. I am trying to reformat it into single byte utf-8 xml by removing all hex'00's, and this approach works when it is in a separate flow, but not if the nodes are in the same flow. The difference in behaviour seems to be the difference between doing an MQPUT then an MQGET (the GET being on the next flow's input node, which specifies the XMLNSC parser, thus having the effect of changing the parser from BLOB to XMLNSC), and doing this within the same flow using the RCD node.
Never really like manipulating xml strings using BLOB, so any other suggestions for a neater solution also welcome! |
|
Back to top |
|
 |
Vitor |
Posted: Sat Sep 26, 2009 10:21 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
mattynorm wrote: |
any other suggestions for a neater solution also welcome! |
Post the actual parsing error, preferably as part of a user trace. There are those among us who are good with this sort of thing. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
fjb_saper |
Posted: Sun Sep 27, 2009 12:06 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
I would have thought that you'd try a read with convert? (input node)...
Not always advisable especially if the qmgr's ccsid does not cover all the chars in the input message...
However your way of manipulating the BLOB is something I'd want to avoid...
Have you tried the CAST as char CCSID method? You can then recast as BLOB in a different CCSID (1208) and pass it with that CCSID to the content reset node...
Have fun  _________________ MQ & Broker admin |
|
Back to top |
|
 |
rekarm01 |
Posted: Sun Sep 27, 2009 1:34 am Post subject: Re: Different behaviour when reformatting xml |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
mattynorm wrote: |
Basically they are sending an xml file in double byte format (i.e. there is a hex '00' between each readable character, with a CCSID of 1200 (UTF-16), but an xml declaration in the file of UTF-8. |
As Vitor pointed out, the xml parsers use the CCSID to parse the message; they don't use the xml declaration. If there's a parsing problem, it's a good guess that the sending CCSID is wrong. There are different UTF-16 encoding schemes, such as big-endian (ccsid=1200), little-endian (ccsid=1202), or byte-order-mark-endian (ccsid=1204).
mattynorm wrote: |
Code: |
SET OutputRoot.BLOB.BLOB = TRANSLATE(InputRoot.BLOB.BLOB, x'00');
SET OutputRoot.MQMD.CodedCharSetId = 1208; |
|
This won't work for any messages with non-ASCII characters.
Determine what's actually wrong with the message first, before deciding how to fix it; an exception or usertrace would help. If the CCSID is wrong and can't be fixed by the sender, the message flow needs to fix it before invoking any XML parsers.
Edit: Corrected last sentence; it's a little more complicated than just changing the CCSID directly.
Last edited by rekarm01 on Mon Sep 28, 2009 6:22 pm; edited 1 time in total |
|
Back to top |
|
 |
mattynorm |
Posted: Mon Sep 28, 2009 6:46 am Post subject: |
|
|
Acolyte
Joined: 06 Jun 2003 Posts: 52
|
Here is the relevant bit from the trace, when run by the normal process with no attemted conversion
Quote: |
2009-09-28 15:33:25.981244 60 ParserException BIP5004E: An XML parsing error ''An invalid XML character (Unicode: 0x3c00) was found in the prolog of the document.'' occurred on line 1 column 1 when parsing element ''/Root/XMLNSC''. Internal error codes are '1504' and '2'.
This error was reported by the generic XML parser, and is usually the result of a badly formed XML message.
Check that the input XML message is a well-formed XML message that adheres to the XML specification. The line number and column number that are quoted in the message give the position where the parser discovered the problem. However, the actual error might be earlier in the message.
Other possible causes are:
1. A character that is not supported by XML occurs in the instance message data.
XML supports only a subset of control characters; therefore, ensure that no unsupported characters, such as X'00', appear in the document.
2. The Coded Character Set ID that is defined in the message header does not reflect the contents of the instance message.
If the XML document has an XML prologue, the WebSphere MQ CodedCharSetId should be consistent with the XML Encoding field.
3. A reserved XML character appears in the instance message data.
Characters that might be recognized as XML markup - for example, < and & - should be replaced with the corresponding XML entities - < and &. |
|
|
Back to top |
|
 |
Vitor |
Posted: Mon Sep 28, 2009 7:32 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
Quote: |
2. The Coded Character Set ID that is defined in the message header does not reflect the contents of the instance message.
If the XML document has an XML prologue, the WebSphere MQ CodedCharSetId should be consistent with the XML Encoding field |
Well that's something I wasn't expecting to see!
Clearly the broker is using the coding in the document, and not something I've experienced personally. I've seen single byte documents with "utf-16" encoding parse correctly under WMBv6. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mqjeff |
Posted: Mon Sep 28, 2009 8:01 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
mattynorm wrote: |
2009-09-28 15:33:25.981244 60 ParserException BIP5004E: An XML parsing error ''An invalid XML character (Unicode: 0x3c00) was found in the prolog of the document.'' occurred on line 1 column 1 |
Line 1 Column 1.
It hasn't gotten to the <?xml delcaration yet.
Is there a BOM? |
|
Back to top |
|
 |
Vitor |
Posted: Mon Sep 28, 2009 8:35 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
mqjeff wrote: |
mattynorm wrote: |
2009-09-28 15:33:25.981244 60 ParserException BIP5004E: An XML parsing error ''An invalid XML character (Unicode: 0x3c00) was found in the prolog of the document.'' occurred on line 1 column 1 |
Line 1 Column 1.
It hasn't gotten to the <?xml delcaration yet.
|
Doh!  _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mattynorm |
Posted: Mon Sep 28, 2009 11:39 am Post subject: |
|
|
Acolyte
Joined: 06 Jun 2003 Posts: 52
|
Yes, it's falling because it's not expecting the x'00' byte I think. Had a bit more of a play with it, and the message continues if I set the CCID using rfhutil to 1202 before sticking the message on the queue. However, if I have a compute node where I reset the CodedCharSet to 1202, it falls over before there with the parsing error. I've also tried checking the convert option on the MQInput node, and setting the CCSID to 1202 there, but it still falls over. Hmmm. |
|
Back to top |
|
 |
Vitor |
Posted: Mon Sep 28, 2009 11:42 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
mattynorm wrote: |
I've also tried checking the convert option on the MQInput node, and setting the CCSID to 1202 there, but it still falls over. Hmmm. |
Remember that the values in the MQInput node are only defaults; if there's an RFH2 header in the message then WMB will use that CCSID not the node's value. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mattynorm |
Posted: Mon Sep 28, 2009 11:48 am Post subject: |
|
|
Acolyte
Joined: 06 Jun 2003 Posts: 52
|
There's no rfh header on the message. And I don't think there's any BOM at the start, though I'll check this again tomorrow.
Thanks for all your help so far |
|
Back to top |
|
 |
kimbert |
Posted: Mon Sep 28, 2009 12:40 pm Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
I was interested in the wording of this statement:
Quote: |
However, if I have a compute node where I reset the CodedCharSet to 1202, it falls over before there with the parsing error |
I was assuming that you would set the Domain on the MQInput node to 'BLOB'. So I'm puzzled as to how you could get a parsing error *before* the compute node where you reset the CCSID ( BLOB domain doesn't issue parsing errors ). |
|
Back to top |
|
 |
rekarm01 |
Posted: Mon Sep 28, 2009 8:00 pm Post subject: Re: Different behaviour when reformatting xml |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
mattynorm wrote: |
Yes, it's falling because it's not expecting the x'00' byte I think. |
It's expecting the x'00', but in a different place:
Code: |
CCSID=1200: '<?xml ...' = x'003c 003f 0078 006d 006c ...' (big-endian)
CCSID=1202: '<?xml ...' = x'3c00 3f00 7800 6d00 6c00 ...' (little-endian) |
The problem is that the MQ input header is wrong, causing the xml parser to misread the data:
Quote: |
BIP5004E: An XML parsing error ''An invalid XML character (Unicode: 0x3c00) was found in the prolog of the document.'' occurred on line 1 column 1 |
(Unicode: 0x3c00) is a CJK unified Han ideograph; it's not a good way to start an XML prolog.
mattynorm wrote: |
Had a bit more of a play with it, and the message continues if I set the CCSID using rfhutil to 1202 before sticking the message on the queue. |
That's the best thing to do. Do that.
mattynorm wrote: |
However, if I have a compute node where I reset the CodedCharSet to 1202, it falls over before there with the parsing error. I've also tried checking the convert option on the MQInput node, and setting the CCSID to 1202 there, but it still falls over. Hmmm. |
The problem here is that any attempt to convert on MQInput or to modify the output CCSID assumes that the input CCSID is correct to begin with.
If the sender can't fix the input CCSID, then the message flow could try to work around it, before invoking the xml parser. There are (at least) 2 ways to do this:- MQInput (domain=BLOB)
-> Compute "SET OutputRoot.Properties.CodedCharSetId = 1202;"
-> RCD (domain=XMLNSC)
-> ...
- MQInput (domain=BLOB)
-> Compute "CREATE LASTCHILD OF OutputRoot DOMAIN 'XMLNSC' PARSE(InputRoot.BLOB.BLOB CCSID 1202);"
-> ... Both of these approaches ignore the input CCSID, and hard-code CCSID=1202 instead. That is, at best, a temporary work-around, to be removed as soon as the sender can fix the input message. |
|
Back to top |
|
 |
|