Author |
Message
|
Laurens |
Posted: Wed Apr 10, 2013 1:11 am Post subject: File Input node : how to read UTF-8 aswell as ISO-8851 |
|
|
Apprentice
Joined: 01 Oct 2009 Posts: 35
|
Hi all,
I'm trying to let a message flow read files - through File Input Node - that may be either in UTF-8 encoding or ISO-8851 encoding.
The files are XML files and in the prolog one can find the encoding specified.
I thought that the filenode is clever enough to use that prolog information to set the CodedCharSetId in the message properties. However that seems not the case.
Is there an easy way handling this ? Probaly I'm missing something
Kind regards
Laurens |
|
Back to top |
|
 |
kimbert |
Posted: Wed Apr 10, 2013 1:26 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
|
Back to top |
|
 |
Laurens |
Posted: Wed Apr 10, 2013 4:07 am Post subject: |
|
|
Apprentice
Joined: 01 Oct 2009 Posts: 35
|
Thanks Kimbert !
Looks good.
In the mean while I had created something simular with BLOB -> FlowOrder -> (first) ResetContentDescriptor Blob to XMLNSC -> take XMLEncoding
(second) -> set OutputRoot.properties.CodedCharSetId -> ResetContentDescriptor BLOB to XMLNSC
Is my solution more expensive ? I am correct in assuming then in (first) branch of the floworder the broker is not parsing the complete message. Correct ? |
|
Back to top |
|
 |
smdavies99 |
Posted: Wed Apr 10, 2013 4:17 am Post subject: |
|
|
 Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
IF both your message types have a correct prologue then you could always just extract that (or compare the first 'n' bytes of the blob with a knows blobbified string (of a valid prologue)
IF you have large messages this may well be more performant that converting the whole BLOB to XMLNSC and then deciding upon the CCSID.
Some experimentation will prove which way takes less time. _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
 |
rekarm01 |
Posted: Wed Apr 10, 2013 4:30 pm Post subject: Re: File Input node : how to read UTF-8 as well as ISO-8851 |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
Laurens wrote: |
I'm trying to let a message flow read files - through File Input Node - that may be either in UTF-8 encoding or ISO-8851 encoding. |
ISO 8851 specifies routine methods for determining the moisture content, non-fat solids content, and fat content of butter. WMB does not currently support that.
... or was that a typo?
Laurens wrote: |
In the mean while I had created something simular with BLOB -> FlowOrder -> (first) ResetContentDescriptor Blob to XMLNSC -> take XMLEncoding |
This requires an initial guess for ccsid, close enough to the actual character encoding to be able to read the XML prolog correctly. If the initial guess were an ASCII-based ccsid, for example, then this would only work for ASCII-based files (UTF-8, ISO 8859, etc.), but not for other files (UTF-16, EBCDIC, etc.) If that's a problem, then a more general BLOB-based solution is necessary.
With on-demand parsing to read the prolog, the FlowOrder node is not that expensive. But the FlowOrder node only propagates its input message to its first and second output terminals. Any changes in the output message through the first output terminal are not propagated through the second output terminal. So, the message flow will have to save the ccsid derived in the first part some other way, (such as in the Environment tree), in order to set it in the second part. |
|
Back to top |
|
 |
smdavies99 |
Posted: Wed Apr 10, 2013 11:04 pm Post subject: |
|
|
 Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
You also have to remember that there are a good number of variants to the ISO-8859 Character Sets.
ISO-8859-1 is Western European but does not include the 'Euro' Symbol.
etc
etc
etc.
If you can guarantee that the XML prologue is correct (And often it is not due to lazy programmers who don't know the difference between 8859-1 and 8859-15) then by all means go ahead and determine the CCSID from the prologue.
My 2p worth is that it would be better to try to get everything UTF-8 but I am aware that this might not be possible. _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
 |
|