ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Problem parsing XML with FileInputNode.

Post new topic  Reply to topic Goto page 1, 2  Next
 Problem parsing XML with FileInputNode. « View previous topic :: View next topic » 
Author Message
jborella
PostPosted: Fri Oct 22, 2010 5:12 am    Post subject: Problem parsing XML with FileInputNode. Reply with quote

Apprentice

Joined: 04 Jun 2009
Posts: 26

According to the XML 1.0 standard (http://www.w3.org/TR/REC-xml/#NT-EncName), the XML declaration in top of the XML document isn't required to contain an encoding. If we ignore UTF-16 for now, UTF-8 is assumed in case no encoding attribute is provided.

For instance:
<?xml version="1.0"?>
<FixedValue>
<someValue>æøå</someValue>
</FixedValue>

is a valid XML document, UTF-8 is assumed

<?xml version="1.0" encoding="Windows-1252"?>
<FixedValue>
<someValue>æøå</someValue>
</FixedValue>

is a valid XML document if encoded in Windows-1252. So far so good.

When in WMB I try to read these two examples using a FileInputNode, parsing fails.

Relevant configured properties are:
'Input message parsing':
Message domain: XMLNSC
Message coded character set ID: Broker System Default

I guess the problem is because I have to provide the CCSID. This shouldn't be necessary, since the CCSID MUST be deduced from the content of the file.

Is this an error or am I doing something wrong?
Back to top
View user's profile Send private message
kimbert
PostPosted: Fri Oct 22, 2010 7:54 am    Post subject: Reply with quote

Jedi Council

Joined: 29 Jul 2003
Posts: 5542
Location: Southampton

Quote:
I guess the problem is because I have to provide the CCSID. This shouldn't be necessary, since the CCSID MUST be deduced from the content of the file.
That's not strictly true. In that same section, the XML specification says this:
Quote:
In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.
So if the XML processor can get the encoding from the transport ( or in your case, the File node's properties ) then it does not have to use the encoding in the XML declaration.
Quote:
According to the XML 1.0 standard (http://www.w3.org/TR/REC-xml/#NT-EncName), the XML declaration in top of the XML document isn't required to contain an encoding. If we ignore UTF-16 for now, UTF-8 is assumed in case no encoding attribute is provided.
Please read the remainder of that section - your interpretation is not correct.
Back to top
View user's profile Send private message
jborella
PostPosted: Sun Oct 24, 2010 10:01 pm    Post subject: Reply with quote

Apprentice

Joined: 04 Jun 2009
Posts: 26

kimbert wrote:
Quote:
In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.
So if the XML processor can get the encoding from the transport ( or in your case, the File node's properties ) then it does not have to use the encoding in the XML declaration.

I don't consider some integration logic in the broker as an external transport protocol, and the examples provided doesn't suggest that logic in the code is an external transport protocol.

Let me rephrase the question. How do I get the broker to accept the two legal examples of XML I have provided, using one FileInput Node, when no external transport protocol is present. I'm sattisfied if I can fill in the CCSID with UTF-8 and then go with that if no encoding is provided in the XML header.
Back to top
View user's profile Send private message
rekarm01
PostPosted: Mon Oct 25, 2010 11:24 pm    Post subject: Re: Problem parsing XML with FileInput node Reply with quote

Grand Master

Joined: 25 Jun 2008
Posts: 1415

jborella wrote:
I don't consider some integration logic in the broker as an external transport protocol

Nevertheless, the xml parsers use the ccsid, not the xml encoding, to parse the input message.

jborella wrote:
Let me rephrase the question. How do I get the broker to accept the two legal examples of XML I have provided

One approach is to parse the input message as a BLOB, provide some means to derive the ccsid from the xml encoding, and then re-parse the input message as xml.
Back to top
View user's profile Send private message
jborella
PostPosted: Tue Oct 26, 2010 11:48 pm    Post subject: Re: Problem parsing XML with FileInput node Reply with quote

Apprentice

Joined: 04 Jun 2009
Posts: 26

rekarm01 wrote:
Nevertheless, the xml parsers use the ccsid, not the xml encoding, to parse the input message.

That confirms my fears. I'll have to raise this as an issue with IBM, since that makes the handling of XML unecessarily difficult.

rekarm01 wrote:
One approach is to parse the input message as a BLOB, provide some means to derive the ccsid from the xml encoding, and then re-parse the input message as xml.

Thank You for the advice. I'll work with that. I've had the exact same idea. Only problem I've found so far is, that I'll have to translate encoding to numbers (eg. UTF-8 to 1208, Windows-1252 to ??? and so on). The system I'm working on is receiving all kinds of XML, why I can run into a situation where I haven't anticipated some encoding.

Furthermore somebody must have had this problem before me. Or do all of You in here always know the encoding beforehand?
Back to top
View user's profile Send private message
smdavies99
PostPosted: Wed Oct 27, 2010 3:52 am    Post subject: Re: Problem parsing XML with FileInput node Reply with quote

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

jborella wrote:
Or do all of You in here always know the encoding beforehand?


It is a good idea to try an standardise on a single CCSID Wherever possible.
My first choice is 1208 (UTF-. Then people have to persuade me why this is not a good idea.

There are times ( as shown by the plethora of post on this topic here) when the sending system gets it all wrong. I've even seen messages with UTF-8 attributes sent as CCSID 819....
_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.
Back to top
View user's profile Send private message
Vitor
PostPosted: Wed Oct 27, 2010 4:35 am    Post subject: Re: Problem parsing XML with FileInput node Reply with quote

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

jborella wrote:
Furthermore somebody must have had this problem before me. Or do all of You in here always know the encoding beforehand?


We either know, or beat the information out of the senders when the input information fails to parse....

As my worthy associate points out, the number of posts in this forum on this topic shows the size of the problem. Again, it's not just not knowing how the input is encoded (and remember most file input is non-XML with no processing instruction), it's data encoded one way but described as being another.
_________________
Honesty is the best policy.
Insanity is the best defence.
Back to top
View user's profile Send private message
jborella
PostPosted: Mon Nov 01, 2010 1:03 am    Post subject: Re: Problem parsing XML with FileInput node Reply with quote

Apprentice

Joined: 04 Jun 2009
Posts: 26

smdavies99 wrote:
It is a good idea to try an standardise on a single CCSID Wherever possible.
My first choice is 1208 (UTF-. Then people have to persuade me why this is not a good idea.

There are times ( as shown by the plethora of post on this topic here) when the sending system gets it all wrong. I've even seen messages with UTF-8 attributes sent as CCSID 819....

I agree that it would be great always to go with UTF-8, but there are many cases where this solution isn't feasible. In this case, we have to integrate a standard system into our coorporation. Of course I could go to the developers of the system and ask them to change XML encoding to UTF-8. There are several reasons why we don't do that:

    * We would have to pay for the changes.
    * It would take time to get such a change.
    * The system is conforming to the XML standard.
    * It wouldn't give us any revenue to have the change made.

In this case I think it's the broker thats wrongly made. If it can't read an XML document without several fixes, which wan't work for all encodings, something is wrong. I'll raise an issue with IBM and keep You posted in here.
Back to top
View user's profile Send private message
smdavies99
PostPosted: Mon Nov 01, 2010 1:52 am    Post subject: Re: Problem parsing XML with FileInput node Reply with quote

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

jborella wrote:

In this case I think it's the broker thats wrongly made. If it can't read an XML document without several fixes, which wan't work for all encodings, something is wrong. I'll raise an issue with IBM and keep You posted in here.


In my humble experience it is often these external systems that are sending incorrect XML formatted messages. Many of these external suppliers have absolutely no idea of the XML Standards & Rules that you have to observe in order to get a 100% compliant message. 95% character coverage is not enough.

What do you expect broker to do when you get an XML message that is described as being UTF-8 but bits of it are actually ISO8859-1? For a lot of the time you won't notice this error but there are times when you will fail the parsing because of the error in the original data.
_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.
Back to top
View user's profile Send private message
Vitor
PostPosted: Mon Nov 01, 2010 4:42 am    Post subject: Re: Problem parsing XML with FileInput node Reply with quote

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

jborella wrote:
I'll raise an issue with IBM and keep You posted in here.


I for one look forward to your future postings on this issue. Though given some of the contributors to this thread you may already have the substance of the response....

Looking back at your original problem, is it your assertion that for file based XML documents arriving in WMB without encoding information that WMB should assume UTF-8 irrespective of the default CCSID specified on the FileInput node or that in use by WMB? I'm just trying to get a handle on what you think in broker is "wrongly made".

I'm also ensuring this problem is clearly separated from the more common problem; where the XML (or indeed the CCSID of an inbound WMQ message) is set to one value but the message is encoded in another.
_________________
Honesty is the best policy.
Insanity is the best defence.
Back to top
View user's profile Send private message
jborella
PostPosted: Mon Nov 01, 2010 5:50 am    Post subject: Re: Problem parsing XML with FileInput node Reply with quote

Apprentice

Joined: 04 Jun 2009
Posts: 26

Vitor wrote:
is it your assertion that for file based XML documents arriving in WMB without encoding information that WMB should assume UTF-8 irrespective of the default CCSID specified on the FileInput node or that in use by WMB?

I would like the message broker to follow three steps in order of finding the encoding, if they are identified as being XML (for instance if using the XMLNSC parser):
1. If an encoding is provided as part of an external header, this encoding must be used.
2. otherwise if an encoding is provided in the XML header, use this encoding.
3. otherwise assume UTF-8/16

My impression is, that the broker conforms to 1. If I made this example using MQ as transport protocol, the broker would use the MQMD header to determine the character encoding. My rule number two and three aren't used, ever. The broker always assumes the CCSID, which is hardcoded in the flow logic.
Back to top
View user's profile Send private message
mqjeff
PostPosted: Mon Nov 01, 2010 5:58 am    Post subject: Reply with quote

Grand Master

Joined: 25 Jun 2008
Posts: 17447

How can broker read the XML header if it doesn't know what the encoding is?
Back to top
View user's profile Send private message
jborella
PostPosted: Mon Nov 01, 2010 6:09 am    Post subject: Re: Problem parsing XML with FileInput node Reply with quote

Apprentice

Joined: 04 Jun 2009
Posts: 26

smdavies99 wrote:
In my humble experience it is often these external systems that are sending incorrect XML formatted messages. Many of these external suppliers have absolutely no idea of the XML Standards & Rules that you have to observe in order to get a 100% compliant message. 95% character coverage is not enough.

Although I've experienced a lot of the stuff You're referencing, it isn't the case here. All of the XML received this far has been valid.

smdavies99 wrote:
What do you expect broker to do when you get an XML message that is described as being UTF-8 but bits of it are actually ISO8859-1?

I would expect the parser to fail, as described in the XML 1.0 standard.
Back to top
View user's profile Send private message
kimbert
PostPosted: Mon Nov 01, 2010 6:53 am    Post subject: Reply with quote

Jedi Council

Joined: 29 Jul 2003
Posts: 5542
Location: Southampton

If I have understood this correctly:
- the XML has an XML declaration which accurately describes the encoding of the XML document.
- the FileInput node is not reading the XML declaration to determine the encoding. That may be a deliberate design decision aimed at making the FileInput node behave consistently with other nodes. On the other hand, it might be a defect.

I think we should wait and see what the response to the PMR is.
Back to top
View user's profile Send private message
rekarm01
PostPosted: Mon Nov 01, 2010 4:10 pm    Post subject: Re: Problem parsing XML with FileInput node Reply with quote

Grand Master

Joined: 25 Jun 2008
Posts: 1415

jborella wrote:
In this case I think it's the broker thats wrongly made.

Nevertheless, the broker's implementation is consistent with the XML standard.

jborella wrote:
I would like the message broker to follow three steps in order of finding the encoding, if they are identified as being XML (for instance if using the XMLNSC parser):
  1. If an encoding is provided as part of an external header, this encoding must be used.
  2. otherwise if an encoding is provided in the XML header, use this encoding.
  3. otherwise assume UTF-8/16
My impression is, that the broker conforms to 1.

The XML parsers expect an external header to provide the character encoding, overriding any XML declaration in the message body or any other default value.

jborella wrote:
If I made this example using MQ as transport protocol, the broker would use the MQMD header to determine the character encoding.

No. The "external transport protocol" is between the logical message tree and the parser, not between MQ and the broker.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic  Reply to topic Goto page 1, 2  Next Page 1 of 2

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Problem parsing XML with FileInputNode.
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.