MQSeries.net :: View topic - File Input node : how to read UTF-8 aswell as ISO-8851

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » File Input node : how to read UTF-8 aswell as ISO-8851

File Input node : how to read UTF-8 aswell as ISO-8851

« View previous topic :: View next topic »

Author

Message

Laurens

Posted: Wed Apr 10, 2013 1:11 am Post subject: File Input node : how to read UTF-8 aswell as ISO-8851

Apprentice

Joined: 01 Oct 2009
Posts: 35

Hi all,

I'm trying to let a message flow read files - through File Input Node - that may be either in UTF-8 encoding or ISO-8851 encoding.
The files are XML files and in the prolog one can find the encoding specified.

I thought that the filenode is clever enough to use that prolog information to set the CodedCharSetId in the message properties. However that seems not the case.

Is there an easy way handling this ? Probaly I'm missing something

Kind regards
Laurens

kimbert

Posted: Wed Apr 10, 2013 1:26 am Post subject:

Jedi Council

Joined: 29 Jul 2003
Posts: 5543
Location: Southampton

The FileInput node does not read the XML prolog. The reasons are complex, and are explained on this thread: http://www.mqseries.net/phpBB/viewtopic.php?p=289723&sid=849b0c5db0fd7a2f34ea92901fff7bb0

There is a workaround detailed in the same thread.

Laurens

Posted: Wed Apr 10, 2013 4:07 am Post subject:

Apprentice

Joined: 01 Oct 2009
Posts: 35

Thanks Kimbert !

Looks good.

In the mean while I had created something simular with BLOB -> FlowOrder -> (first) ResetContentDescriptor Blob to XMLNSC -> take XMLEncoding
(second) -> set OutputRoot.properties.CodedCharSetId -> ResetContentDescriptor BLOB to XMLNSC

Is my solution more expensive ? I am correct in assuming then in (first) branch of the floworder the broker is not parsing the complete message. Correct ?

smdavies99

Posted: Wed Apr 10, 2013 4:17 am Post subject:

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

IF both your message types have a correct prologue then you could always just extract that (or compare the first 'n' bytes of the blob with a knows blobbified string (of a valid prologue)

IF you have large messages this may well be more performant that converting the whole BLOB to XMLNSC and then deciding upon the CCSID.

Some experimentation will prove which way takes less time.
_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.

rekarm01

Posted: Wed Apr 10, 2013 4:30 pm Post subject: Re: File Input node : how to read UTF-8 as well as ISO-8851

Grand Master

Joined: 25 Jun 2008
Posts: 1415

Laurens wrote:

I'm trying to let a message flow read files - through File Input Node - that may be either in UTF-8 encoding or ISO-8851 encoding.

ISO 8851 specifies routine methods for determining the moisture content, non-fat solids content, and fat content of butter. WMB does not currently support that.

... or was that a typo?

Laurens wrote:

In the mean while I had created something simular with BLOB -> FlowOrder -> (first) ResetContentDescriptor Blob to XMLNSC -> take XMLEncoding

This requires an initial guess for ccsid, close enough to the actual character encoding to be able to read the XML prolog correctly. If the initial guess were an ASCII-based ccsid, for example, then this would only work for ASCII-based files (UTF-8, ISO 8859, etc.), but not for other files (UTF-16, EBCDIC, etc.) If that's a problem, then a more general BLOB-based solution is necessary.

With on-demand parsing to read the prolog, the FlowOrder node is not that expensive. But the FlowOrder node only propagates its input message to its first and second output terminals. Any changes in the output message through the first output terminal are not propagated through the second output terminal. So, the message flow will have to save the ccsid derived in the first part some other way, (such as in the Environment tree), in order to set it in the second part.

smdavies99

Posted: Wed Apr 10, 2013 11:04 pm Post subject:

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

You also have to remember that there are a good number of variants to the ISO-8859 Character Sets.

ISO-8859-1 is Western European but does not include the 'Euro' Symbol.
etc
etc
etc.
If you can guarantee that the XML prologue is correct (And often it is not due to lazy programmers who don't know the difference between 8859-1 and 8859-15) then by all means go ahead and determine the CCSID from the prologue.

My 2p worth is that it would be better to try to get everything UTF-8 but I am aware that this might not be possible.
_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » File Input node : how to read UTF-8 aswell as ISO-8851

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP