|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
 |
|
MRM Support for UTF8 and UTF16 |
« View previous topic :: View next topic » |
Author |
Message
|
goffinf |
Posted: Mon Mar 04, 2013 1:11 pm Post subject: MRM Support for UTF8 and UTF16 |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
v6.1.0.10
I haven't done much MRM work, so forgive my ignorance on this subject.
I have been playing with the CSV message set sample. It defines a simple customer structure with multiple physical formats. The one that I'm interested in (and slightly confused about) is the Binary format. This is Fixed Length String and the Length Unit is Bytes. So for example the 'firstname' field is fixed as 12 bytes.
So I created a simple flow with a HTTPRequest node part way thru, I set the MRM properties for the response parsing and created a simple implementation to return a response in UTF8 padding as needed. Ran the flow, and everything was sweet.
Now the bit that I'm not understanding. I changed the HTTP endpoint to return the response as UTF16. Now I know that UTF8 and UTF16 encoding is different so I have been very careful to ensure that the number of bytes returned fit neatly into the fixed length fields declared in the message definition.
Well it throws a parsing error, in fact a number of different ones depending how I adjust the response.
So, a couple of things ... first, in my case a BOM is part of the response (FEFF). Not sure what the message set does with that or whether I'm supposed to take care of it in the Byte Alignment or Leading Skip Count, or somewhere else ??
Next, I'm not sure how Broker knows how to differentiate between UTF8 and UTF16 (other than the presence/absense of the BOM). I thought it might pay attention to the response HTTP Content-Type header so I made sure this included a charset=UTF-16
What I ideally want is the ability for this flow to support EITHER UTF8 or UTF16 (or any other character encoding). Now that may be too ambitious (but I do have a reason ... its to do with the unit test framework we use and HTTP mock support, but I won't bore you with more detail).
I'm certain there's something simple here which I'm not understanding .... I notice that this message set has a Byte Alignment of 1 byte, and that made me wonder since clearly UTF16 is two. Is that whats tripping me up here ?? But if I changed it to 2 bytes, what happens when I want it to parse UTF8 which has a mix of single and multi-byte character encoding ?
I guess (?) it is is more normal for message definitions to be defined for one encoding, but I still don't get why if I made sure that the right number of bytes were returned with the right encoding for the characters that the UTF16 response failed to parse.
Roll on v8 when I can use DFDL (maybe next year ... and 'yes' we do have extended support for 6.1 after September).
Kind regards
Fraser. |
|
Back to top |
|
 |
kimbert |
Posted: Tue Mar 05, 2013 2:39 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Nice problem description!
Quote: |
So, a couple of things ... first, in my case a BOM is part of the response (FEFF). Not sure what the message set does with that or whether I'm supposed to take care of it in the Byte Alignment or Leading Skip Count, or somewhere else ?? |
MRM does nothing at all with a BOM. If it's there in the data ( and the CCSID has been set correctly ) then it will be interpreted as a Zero-width Non-Breaking Space character ( because that's what the byte sequence for a BOM really is ). If you know for sure that the BOM will *always* be present, you could remove it using Leading Skip, but bear in mind that a UTF-8 BOM is three bytes whereas a UTF-16 BOM is two bytes.
Quote: |
Next, I'm not sure how Broker knows how to differentiate between UTF8 and UTF16 (other than the presence/absense of the BOM). I thought it might pay attention to the response HTTP Content-Type header so I made sure this included a charset=UTF-16
|
All broker parsers work in the same way. They take the text encoding from InputRoot.Properties.CodedCharSetId and the numeric encoding from InputRoot.Properties.Encoding. Those fields in the properties folder are set from the MQ transport if you use an MQInput node. The HTTPInput node also passes this to the body parser ( but not via the Properties folder). Additinoally, the contents of the HTTP headers are stored in the local environment, as described here : http://publib.boulder.ibm.com/infocenter/wmbhelp/v8r0m0/topic/com.ibm.etools.mft.doc/ac00477_.htm
Quote: |
What I ideally want is the ability for this flow to support EITHER UTF8 or UTF16 (or any other character encoding). Now that may be too ambitious (but I do have a reason ... its to do with the unit test framework we use and HTTP mock support, but I won't bore you with more detail). |
It may be ambitious, but it's not an unreasonable thing to want.
We need to find out what CCSID the MRM parser is using - and the best way to find out is to take a debug-level user trace.
Quote: |
I guess (?) it is is more normal for message definitions to be defined for one encoding, but I still don't get why if I made sure that the right number of bytes were returned with the right encoding for the characters that the UTF16 response failed to parse. |
Actually, being flexible about encodings is supported by WMB, and is a scenario that we always test out.
Quote: |
I notice that this message set has a Byte Alignment of 1 byte, and that made me wonder since clearly UTF16 is two. Is that whats tripping me up here ?? But if I changed it to 2 bytes, what happens when I want it to parse UTF8 which has a mix of single and multi-byte character encoding ? |
Best not to play with the byte alignment stuff - that's not the problem. |
|
Back to top |
|
 |
goffinf |
Posted: Wed Mar 06, 2013 11:30 am Post subject: |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
Thanks Tim.
My goodness I haven't spent as much time reading specs and articles for ages. Worth it though I feel much more confident about this subject now.
For the benefit of others, my problem was resolved (as is frequently suggested for these types of questions) by thinking about the character encoding of the data that you are asking your flow to deal with and making sure that the CodedCharSetId and Encoding values match that data.
Unsurprisingly I hadn't done so (at least not carefully enough). In my case I had forgotten about endian-ness. I had supplied 1200 for the ccsid thinking that was correct for UTF-16, whereas, since I'm running on Intel, I needed 1202 because that the right value for UTF-16 LE. Now everything parses nicely.
I left the Encoding value as 546. I couldn't find anything in the docs that told me what values should be used for numerical data encoding. I have read plenty of posts here where other values are used but ... Well, can anyone tell me how to find which values mean what (like the 'Supported Code Pages' section in the InfoCentre but for Encoding).
Cheers
Fraser. |
|
Back to top |
|
 |
fjb_saper |
Posted: Thu Mar 07, 2013 5:52 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
Relatively easy. There are some constant values (look them up).
What you need to set there is numerical format of the data using big or little endian (number representation will differ according to which), floating point etc...  _________________ MQ & Broker admin |
|
Back to top |
|
 |
goffinf |
Posted: Thu Mar 07, 2013 8:19 am Post subject: |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
Hmmm ... I assume you mean these which I found in the InfoCentre (thanks for the clue) :-
$mq:MQENC_WINDOWS (546)
$mq:MQENC_UNIX (273)
$mq:MQENC_390 (785)
and these :-
MQENC_NATIVE (0x00000222L)
MQENC_INTEGER_NORMAL (0x00000001L)
MQENC_INTEGER_REVERSED (0x00000002L)
MQENC_DECIMAL_NORMAL (0x00000010L)
MQENC_DECIMAL_REVERSED (0x00000020L)
MQENC_FLOAT_IEEE_NORMAL (0x00000100L)
MQENC_FLOAT_IEEE_REVERSED (0x00000200L)
MQENC_FLOAT_S390 (0x00000300L)
I'm not at all sure how/when to use any from the second set, care to elaborate ? (note: in many cases I won't be using MQ when communicating with different platforms - mostly HTTP).
Fraser. |
|
Back to top |
|
 |
fjb_saper |
Posted: Thu Mar 07, 2013 4:18 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
well imagine you're on a Unix platform but the message was created on a Linux platform, is a mix of text and numeric data and of course the numeric doesn't have the same endian-ness as on your platform. This would be when you set the encoding to match the data in the message.
Have fun  _________________ MQ & Broker admin |
|
Back to top |
|
 |
rekarm01 |
Posted: Sat Mar 09, 2013 8:37 pm Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
goffinf wrote: |
I'm not at all sure how/when to use any from the second set, care to elaborate? |
They might be useful in rare cases, for constructing or modifying a custom Encoding value using bit-wise operations, rather than using the standard native values for various platforms. They're also useful for just figuring out what the Encoding value means; for example:
Code: |
MQENC_UNIX = 273 = 0x111 = MQENC_FLOAT_IEEE_NORMAL + MQENC_DECIMAL_NORMAL + MQENC_INTEGER_NORMAL
MQENC_WINDOWS = 546 = 0x222 = MQENC_FLOAT_IEEE_REVERSED + MQENC_DECIMAL_REVERSED + MQENC_INTEGER_REVERSED
MQENC_390 = 785 = 0x311 = MQENC_FLOAT_S390 + MQENC_DECIMAL_NORMAL + MQENC_INTEGER_NORMAL |
For UTF-16, the MQENC_INTEGER_* value can describe the byte-endianness. MQENC_INTEGER_NORMAL is big-endian, and MQENC_INTEGER_REVERSED is little-endian. |
|
Back to top |
|
 |
goffinf |
Posted: Mon Mar 11, 2013 12:15 pm Post subject: |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
Thx rekarm01, that certainly makes sense.
There are a few more 'native' values which for those following this thread may want to view :-
http://publib.boulder.ibm.com/infocenter/wmqv7/v7r0/index.jsp?topic=%2Fcom.ibm.mq.csqzaq.doc%2Ffc_MQENC_.htm
The other things that I learnt is that it's always a good idea, and sometimes essential, that you provide explicit metadata for character encoding. So that means its a good idea to *always* include an XML declaration with the encoding attribute value that matches your data, likewise if you are using HTTP and call out to some endpoint, *always* include the charset value with the Content-Type header on the request and response. If you are using Java (same for XML and HTTP) and UTF you can use values such as UTF-16LE, UTF-16BE and plain UTF-16 (BOM will determine endian-ness).
HTH
Fraser |
|
Back to top |
|
 |
rekarm01 |
Posted: Wed Mar 20, 2013 12:31 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
goffinf wrote: |
The other things that I learnt is that it's always a good idea, and sometimes essential, that you provide explicit metadata for character encoding. So that means its a good idea to *always* include an XML declaration with the encoding attribute value that matches your data, likewise if you are using HTTP and call out to some endpoint, *always* include the charset value with the Content-Type header on the request and response. If you are using Java (same for XML and HTTP) and UTF you can use values such as UTF-16LE, UTF-16BE and plain UTF-16 (BOM will determine endian-ness). |
It's always a good idea to explicitly specify the charset/encoding, particularly if it's different from the default value. The default charset for HTTP 1.1 is "ISO-8859-1", and the default encoding for XML (in the absence of information provided by an external transport protocol or BOM) is "UTF-8". |
|
Back to top |
|
 |
|
|
 |
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|