|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
 |
|
MRM Converting Text to Binary |
« View previous topic :: View next topic » |
Author |
Message
|
goffinf |
Posted: Thu Mar 14, 2013 11:11 am Post subject: MRM Converting Text to Binary |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
version: 6.1.0.10
I am experimenting with converting between physical formats in MRM.
Part of my flow looks like this :-
Code: |
HTTPInput --> Compute --> HTTPReply
MRM (ConvertToBinary)
Format = Text1
Inside the Compute node (ConvertToBinary) :-
SET OutputRoot.Properties.MessageFormat = 'Binary1';
|
I am using one of the sample MessageSets, which has a structure where the first field is called 'firstname'. In the Text1 definition this is FIXED LENGTH (12) and the units are CHARACTER.
The Binary1 format for 'firstname' is FIXED LENGTH (12) and the units are BYTES.
If I send thru this for firstname everything works quite happily :-
Bob#########
However, if I change the 'B' to '£' and send :-
£ob########
Whilst the HTTPInput parses this correctly, I get the an error when the flow reaches the HTTPReply.
Here's the user trace that shows that it successfully got thru the HTTPInput :-
Code: |
UserTrace BIP5494I: The logical tree is now being matched to the message model.
UserTrace BIP5564I: Item ''customer'' from the logical tree has matched with the message model as ''[MESSAGE]_Party/customer(1 of unbounded)''.
UserTrace BIP5564I: Item ''firstname'' from the logical tree has matched with the message model as ''[MESSAGE]_Party/customer(1 of unbounded)/firstname''.
UserTrace BIP5564I: Item ''lastname'' from the logical tree has matched with the message model as ''[MESSAGE]_Party/customer(1 of unbounded)/lastname''.
UserTrace BIP5564I: Item ''streetaddress'' from the logical tree has matched with the message model as ''[MESSAGE]_Party/customer(1 of unbounded)/streetaddress''.
UserTrace BIP5564I: Item ''cityname'' from the logical tree has matched with the message model as ''[MESSAGE]_Party/customer(1 of unbounded)/cityname''.
UserTrace BIP5564I: Item ''statecode'' from the logical tree has matched with the message model as ''[MESSAGE]_Party/customer(1 of unbounded)/statecode''.
UserTrace BIP5564I: Item ''postcode'' from the logical tree has matched with the message model as ''[MESSAGE]_Party/customer(1 of unbounded)/postcode''.
UserTrace BIP5564I: Item ''referencecode'' from the logical tree has matched with the message model as ''[MESSAGE]_Party/customer(1 of unbounded)/referencecode''.
|
and here the part of the user trace which shows that there's a problem with firstname :-
Code: |
BIP2230E: Error detected whilst processing a message in node 'com.mycompany.messagebroker.unittest.http.ConvertFormatHTTP.HTTP Reply'.
...
ParserException BIP5286E: Writing errors have occurred.
Message set name: 'mbunittestMessageSet'
Message format: 'Binary1'
Message type path: '/Party'
Review other error messages to find the cause of the errors.
ParserException BIP5167E: A Custom Wire Format error occurred during the parsing or writing of message ''Party''.
See the following messages for further details. Contact your IBM support center if you cannot resolve the error.
ParserException BIP5350E: There was a Custom Wire Format error when writing the message ''Party''.
The error occurred during or after the writing of element ''/Party/customer/firstname''.
Check that the message has been built correctly and conforms to the MRM model.
Review other error messages for more details.
ParserException BIP5168E: A Custom Wire Format writing error occurred involving an incorrect data conversion.
Element ''firstname'' is either too long, or is out of range for the physical data type of ''fixed-length string''.
While the logical tree was being written to the bit stream, a data conversion error occurred.
Change the definition of the element so that it can store the data safely. Alternatively change the message that is being written so that the value is in the correct range for the element.
...
|
You should know that the CodedCharSetId is 1208 and Encoding 546.
Of course I chose the £ character deliberately because in UTF-8 (1208) it occupies 2 bytes (c2a3), and although its probably hard to see above, the second input message has 1 less # to account for that (hence it was OK one the way IN).
If I look at the logical message tree AFTER the HTTPInput but BEFORE the Compute I see this (from a Trace node) :-
Code: |
(0x01000021:Name+):MRM = ( ['mrm' : 0xb604b78]
(0x01000013:Name+):customer = (
(0x0300000B:NameValue+):firstname = '£ob#########' (CHARACTER)
|
and AFTER the Compute :-
Code: |
(0x01000021:Name+):MRM = ( ['mrm' : 0xb604b78]
(0x01000013:Name+):customer = (
(0x0300000B:NameValue+):firstname = '£ob#########' (CHARACTER)
|
They are both the same but they both have the FULL 12 CHARACTERS (as of course they should !)
So ... it seems that when serialization of the message tree is attempted by the HTTPReply node which is now using the 'Binary1' format the exception occurs because on converting the '£' to bytes (c2a3) and then all the other characters, we end up with one byte too many !
OK ... but how do I get around this problem (given that the input text could contains ANY character permissable in UTF-8 which of course is a multi-byte encoding 1-4) ???
Regards
Fraser. |
|
Back to top |
|
 |
kimbert |
Posted: Thu Mar 14, 2013 12:51 pm Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Quote: |
So ... it seems that when serialization of the message tree is attempted by the HTTPReply node which is now using the 'Binary1' format the exception occurs because on converting the '£' to bytes (c2a3) and then all the other characters, we end up with one byte too many ! |
Yes, that is exactly what is happening.
Quote: |
OK ... but how do I get around this problem (given that the input text could contains ANY character permissable in UTF-8 which of course is a multi-byte encoding 1-4) ??? |
Well....you could switch to using DFDL ( which requires v8, of course ). DFDL allows you to explicitly declare that your output field is a fixed number of bytes AND that it contains characters. It also allows you to set a property that explicitly allows truncation.
The MRM parser has no such facility, so it has no option but to throw an exception and force you ( the integration developer ) to code around the problem. |
|
Back to top |
|
 |
goffinf |
Posted: Thu Mar 14, 2013 12:51 pm Post subject: |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
Interesting, I am always looking for opportunities to play with DFDL (we will hopefully move across this year).
For v6.1 I guess I'll have to map each field across individually taking the leftmost number of characters according to the definition. |
|
Back to top |
|
 |
goffinf |
Posted: Fri Mar 15, 2013 3:49 am Post subject: |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
goffinf wrote: |
For v6.1 I guess I'll have to map each field across individually taking the leftmost number of characters according to the definition. |
Of course that's not going to be a very reliable approach is it ?
What I have done for now is to CAST each input field to BLOB (with the correct output CodedCharSetId and Encoding), take the leftmost number of bytes according to the output definition, then CAST back to CHAR (repeat for all 'records'). Then set the OutputRoot.Properties.MessageFormat to 'Binary1'.
That also isn't going to be reliable since I guess its possible that for multi-byte encoding (like UTF- one or more of the bytes that relate to a given character could be truncated thus changing that character when it is serialized.
So ... probably the only safe way to to process each individual character until you can be sure that all fit into the available fixed length or, if they don't, only include the ones that do and fill the rest with the designated padding character (so that no character gets 'split' and the output only contains the complete set of bytes for any character).
Seems like an awful lot of work .... is there an easier way in v6.1 ??
.. and does DFDL guarantee to do the right thing ??
Regards
Fraser. |
|
Back to top |
|
 |
kimbert |
Posted: Fri Mar 15, 2013 7:36 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Quote: |
That also isn't going to be reliable since I guess its possible that for multi-byte encoding like UTF-8 one or more of the bytes that relate to a given character could be truncated thus changing that character when it is serialized. |
Once again, you are 100% correct. This is the essential point that I was making on this thread: http://www.mqseries.net/phpBB2/viewtopic.php?t=63656&sid=5671e12568f5fd4ff28d6d61f68bb522
Quote: |
Your COBOL application was probably originally designed for single-byte EBCDIC characters, in which case the distinction between characters and bytes would not matter. It is now being expected to handle UTF-8 data, and it is breaking. This is not IBM's problem - it is a problem that crops up continually all over the world when programmers fail to take into account the facts explained here: http://www.joelonsoftware.com/articles/Unicode.html |
Quote: |
Seems like an awful lot of work .... is there an easier way in v6.1 ?? |
It is work that exists because a data format that uses fixed-length fields is being expected to cope with UTF-8 data. The cheapest fix in the long term is to move to a different data format. In the mean time, your fix is going to throw away the minimum number of characters, so it's probably the best you're going to get.
Quote: |
and does DFDL guarantee to do the right thing ?? |
When parsing, DFDL guarantees not to complain about malformed partial characters at the end of a field if the field's length is a fixed number of bytes. When writing, DFDL guarantees to fill any unusable bytes at the end of a fixed-length byte buffer with a specified fill byte value.
Not sure whether that qualifies as the 'right thing' in your book  |
|
Back to top |
|
 |
longng |
Posted: Fri Mar 15, 2013 9:55 am Post subject: |
|
|
Apprentice
Joined: 22 Feb 2013 Posts: 42
|
I was about to refer to the thread http://www.mqseries.net/phpBB2/viewtopic.php?t=63656&sid=5671e12568f5fd4ff28d6d61f68bb522 but Kimbert was there first!
Albeit, my reference is from a different perspective. In the case I refer to in the other thread, MRM simply overwrites subsequent field(s) as to be able to accommodate a preceding field, which contains UTF-8 characters that are longer than the field's length.
In your case described here, at least there's an error and exception generated, in my case there are no errors!
 |
|
Back to top |
|
 |
kimbert |
Posted: Fri Mar 15, 2013 1:26 pm Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Quote: |
MRM simply overwrites subsequent field(s) as to be able to accommodate a preceding field, which contains UTF-8 characters that are longer than the field's length.
|
Sorry to hijack this thread, but I cannot let that pass without comment.
In longng's scenario, he was setting th 'length units' property to characters and the length to 40. The MRM parser correctly serialized the 40 characters of UTF-8 data, which sometimes occupied more than 40 bytes. The downstream application, which was designed in the days when 1 character was always the same as 1 byte, then read exactly 40 bytes from the message. Then it tried to read the next field, which started with the overflow from the previous 40 byte field.
In other words, the MRM parser was doing exactly what longng told it to, and the fault was in the system design. There is no conceivable error that the MRM parser could or should have emitted. |
|
Back to top |
|
 |
goffinf |
Posted: Fri Mar 15, 2013 2:52 pm Post subject: |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
kimbert wrote: |
Quote: |
MRM simply overwrites subsequent field(s) as to be able to accommodate a preceding field, which contains UTF-8 characters that are longer than the field's length.
|
Sorry to hijack this thread, but I cannot let that pass without comment.
In longng's scenario, he was setting th 'length units' property to characters and the length to 40. The MRM parser correctly serialized the 40 characters of UTF-8 data, which sometimes occupied more than 40 bytes. The downstream application, which was designed in the days when 1 character was always the same as 1 byte, then read exactly 40 bytes from the message. Then it tried to read the next field, which started with the overflow from the previous 40 byte field.
In other words, the MRM parser was doing exactly what longng told it to, and the fault was in the system design. There is no conceivable error that the MRM parser could or should have emitted. |
That's OK. That was also my reading of that thread and I was also tempted to comment likewise, but thought I'd let sleeping dogs ... |
|
Back to top |
|
 |
|
|
 |
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|