|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
 |
|
The MQMD says its CCSID 1208 - is it? |
« View previous topic :: View next topic » |
Author |
Message
|
PeterPotkay |
Posted: Tue Sep 29, 2015 2:01 pm Post subject: The MQMD says its CCSID 1208 - is it? |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
This is the first line of the MQ message data in rfhutilc, where I selected "Both" to show hex and Character.
00000000 HEAD ERRECPL0 20202020 48454144 45525245 43504C30
The 00000000 in italics is the rfhutil line number.
Then follows the characters HEAD ERRECPL0
Then follows the hex
How do I know if I am looking at "real" UTF-8, or just some other ASCII Code Page masquerading as 1208?
Its my understanding for 1208 UTF - that the standard 256 ASICC characters are the same single bytes between say 819 and 437 and 1208. If true, then I guess it doesn't matter if the app produced the data as 437 but called it 1208.
But is there any way to know its really 1208 without having a character that would take 2 bytes?
I suspect the app may be calling it 1208 without really producing UTF-8, and getting by thru sheer luck because so far they've just been sending A-Z, a-z, 0-9, etc. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
fjb_saper |
Posted: Tue Sep 29, 2015 4:15 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
If there is no difference in the mapping for the characters sent, between the effective CCSID and CCSID 1208, who's to say which one is false?
You will only find out once there is a character requiring a true 1208 CCSID like copyright or EURO symbols  _________________ MQ & Broker admin |
|
Back to top |
|
 |
PeterPotkay |
Posted: Wed Sep 30, 2015 9:21 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Thanks F.J., that's what I'm thinking too. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
tczielke |
Posted: Fri Oct 02, 2015 5:10 am Post subject: |
|
|
Guardian
Joined: 08 Jul 2010 Posts: 941 Location: Illinois, USA
|
I believe you can check if a file contains valid UTF-8 data on Unix systems by using the iconv command.
For example, if I have a simple ASCII text file called file.txt that contains the data:
The quick brown fox jumped over the slow lazy dog!
and then run iconv on that file as follows, I get a successful run of the program:
> iconv -f utf-8 file.txt
The quick brown fox jumped over the slow lazy dog!
Now if I binary edit the file and put an illegal utf-8 byte like x'C0' in it after the "fox" text and run it again, I get the following:
> iconv -f utf-8 file.txt
The quick brown foxiconv: illegal input sequence at position 19
Using a tool like iconv would give you some assurance that the file is at least in a valid utf-8.
But whether the bytes in the file were intended to be in utf-8 can be much more difficult to figure out. _________________ Working with MQ since 2010. |
|
Back to top |
|
 |
tczielke |
Posted: Fri Oct 02, 2015 1:21 pm Post subject: |
|
|
Guardian
Joined: 08 Jul 2010 Posts: 941 Location: Illinois, USA
|
I find this code page stuff facinating (my wife is a lucky woman ), so I thought I would post on something else that was interesting that I found with MQ and UTF-8.
From what I have read, the hex bytes 'C0' and 'C1' are invalid in UTF-8, and should not be used. In the previous example, iconv does give an error if you have x'C0' in the byte stream and you say the byte stream is UTF-8.
However, when I use a program I wrote (mqcpcnvt) that can let you play with byte conversions between code pages in MQ, MQ seems to allow the x'C0' and x'C1' bytes. And it also does some odd things with it in the conversion.
For example, here is MQ converting an MQSTR formatted message with x'C061'.
> mqcpcnvt C061 1208 819 TCZ.TEST1
mqcpcnvt start
target queue is TCZ.TEST1
The following byte string was PUT with CCSID 1208
<C061>
The following byte string was returned with converted GET with CCSID 819:
<21>
This is the returned byte string printed as chars (a non isprint char will be shown as '.').
<!>
Here MQ is converting 'C061' in 1208 to x'21' in 819, which is an exclamation point. That doesn't look right to me. An exclamation point maps to an exclamation point and vice versa between 1208 and 819.
> mqcpcnvt 21 819 1208 TCZ.TEST1
mqcpcnvt start
target queue is TCZ.TEST1
The following byte string was PUT with CCSID 819
<21>
The following byte string was returned with converted GET with CCSID 1208:
<21>
This is the returned byte string printed as chars (a non isprint char will be shown as '.').
<!>
> mqcpcnvt 21 1208 819 TCZ.TEST1
mqcpcnvt start
target queue is TCZ.TEST1
The following byte string was PUT with CCSID 1208
<21>
The following byte string was returned with converted GET with CCSID 819:
<21>
This is the returned byte string printed as chars (a non isprint char will be shown as '.').
<!>
Very odd behavior. I was expecting MQ to error if you said the message had x'C0' or x'C1' in it and was 1208, and you asked to convert it to another code page. _________________ Working with MQ since 2010. |
|
Back to top |
|
 |
fjb_saper |
Posted: Sat Oct 03, 2015 4:44 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
subtle point but you are not passing C0 or C1 as hex chars in UTF-8.
You are passing a code point... and Unicode C061 is valid (hex ec81a1)
Did you check that the hex value on the queue was actually C061 ?
 _________________ MQ & Broker admin |
|
Back to top |
|
 |
tczielke |
Posted: Sat Oct 03, 2015 5:52 am Post subject: |
|
|
Guardian
Joined: 08 Jul 2010 Posts: 941 Location: Illinois, USA
|
It is very possible I am misunderstanding something, but when I read some of the UTF-8 documentation on the internet, it says that the hex byte x'C0' (or x'C1') is an invalid byte that should not appear in a UTF-8 byte stream. Using the iconv command on Unix seems to confirm that.
I confirmed through tracing that for this run of the program:
> mqcpcnvt C061 1208 819 TCZ.TEST1
The following was in the trace:
Message is PUT with Format=MQSTR, CCSID=1208, Encoding=273 (Normal) and has a length of 2 with x'C061' as the passed in message data.
Message is GET with Format=MQSTR, CCSID=819, Encoding=273 (Normal) and has a length of 1 with x'21' as the returned message data.
It really does not make sense to me (and seems wrong) that x'C061' in 1208 would convert to a common ASCII character like the exclamation point (! = x'21') in 819.
Another thing I noticed is that when I run the program like this (i.e. putting a message of length 1 that contains x'c0' in 1208, and asking for the message to be converted to 819), I get the following result:
> mqcpcnvt C0 1208 819 TCZ.TEST1
mqcpcnvt start
target queue is TCZ.TEST1
MQGET ended with reason code 2150
Ending Program!
mqcpcnvt end
2150 = MQRC_DBCS_ERROR
This implies to me that MQ is interpreting x'C0' in 1208 as the first byte in a 1208 double byte character encoding (code point). However, the UTF-8 literarture I have read says that x'C2' - x'DF' are the valid beginning bytes for a 2 byte character encoding in UTF-8.
I probably need to open a PMR to better understand this. I was just curious if anyone on this forum had some insight into what I am seeing. Again, maybe I am misunderstanding something. _________________ Working with MQ since 2010. |
|
Back to top |
|
 |
fjb_saper |
Posted: Sat Oct 03, 2015 1:37 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
Looks to me that it is interpreting it by default to something else... You also might want to look into what the substitution character is i.e. what is being returned when there is no mapping between the CCSIDs or when the specific character / code point has no mapping in the target CCSID...  _________________ MQ & Broker admin |
|
Back to top |
|
 |
tczielke |
Posted: Sun Oct 04, 2015 3:35 pm Post subject: |
|
|
Guardian
Joined: 08 Jul 2010 Posts: 941 Location: Illinois, USA
|
Since Unicode has more documentation on the internet, I changed to doing the conversions between 1208 ( UTF-8 ) and 1200 ( UTC-2 ) ). Still seeing odd results:
Here is x'c041' or the "invalid" x'c0' byte followed by the upper case A from 1208 to 1200.
Code: |
>mqcpcnvtc c041 1208 1200 TCZ.TEST1 MQT1
mqcpcnvt start
target queue is TCZ.TEST1
The following byte string was PUT with CCSID 1208
<C041>
The following byte string was returned with converted GET with CCSID 1200:
<0001>
This is the returned byte string printed as chars (a non isprint char will be shown as '.').
<..>
mqcpcnvt end
|
This gets mapped to a control code of x'0001' in 1200.
Here is x'C0' followed by x'7e' (tilde) being converted from 1208 to 1200.
Code: |
> mqcpcnvtc c07e 1208 1200 TCZ.TEST1 MQT1
mqcpcnvt start
target queue is TCZ.TEST1
The following byte string was PUT with CCSID 1208
<C07E>
The following byte string was returned with converted GET with CCSID 1200:
<003E>
This is the returned byte string printed as chars (a non isprint char will be shown as '.').
<.>>
|
Here x'c07e' in 1208 is getting mapped to a greater than sign in 1200. Of course the reverse mapping x'003E' from 1200 to 1208 just goes to x'3E' (i.e. great than sign to greater than sign).
Anyway, I will open a PMR to try and understand this better. I found 3 different sites that exhaustively list the UTF-8 code points, and they all 3 seem to leave out x'c0' and x'c1'.
Another interesting I found in my investigation is that z/OS seems to support surrogate pair conversion between 1208 <-> 1200 and vice versa. At least for several test cases that I did.
The surrogate pair conversion between 1208 <-> 1200 failed on distributed platorms, which matches with what the MQ manual says that surrogate pairs in UTF-16 is not supported -> http://www-01.ibm.com/support/knowledgecenter/SSFKSJ_8.0.0/com.ibm.mq.ref.dev.doc/q104590_.htm?lang=en
Basically, the MQ support for Unicode is limited to the BMP (Basic Multilingual Plane) in UTC-2. _________________ Working with MQ since 2010. |
|
Back to top |
|
 |
rekarm01 |
Posted: Tue Oct 06, 2015 4:28 am Post subject: Re: The MQMD says its CCSID 1208 - is it? |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
PeterPotkay wrote: |
Its my understanding for 1208 UTF - that the standard 256 ASICC characters are the same single bytes between say 819 and 437 and 1208. |
ASCII is 7-bit; it only defines 128 characters, not 256. For ccsid=819 and ccsid=1208, the first 128 ASCII characters are encoded the same way, as single bytes. For ccsid=437, the control character and delete character code points (X'00'-X'1F', X'7F') might not be the same.
tczielke wrote: |
I believe you can check if a file contains valid UTF-8 data on Unix systems by using the iconv command. |
iconv may not be the best command for this. It converts character data, but does not necessarily validate it. It can return errors when the input contains valid (well-formed) UTF-8 data, and it can fail to report invalid (ill-formed) UTF-8 data.
tczielke wrote: |
Here MQ is converting 'C061' in 1208 to x'21' in 819, which is an exclamation point. That doesn't look right to me. |
It's not right. X'C061' is not well-formed UTF-8. It looks like MQ is trying to follow the UTF-8 rule for converting the two byte sequence to a code point, without actually validating that the byte sequence is well-formed:
X'C061' = B'1100 0000 0110 0001' --> B'000 0010 0001' = U+21 = '!'
X'C041' = B'1100 0000 0100 0001' --> B'000 0000 0001' = U+01 = <control>
X'C07E' = B'1100 0000 0111 1110' --> B'000 0011 1110' = U+3E = '>'
Garbage in, garbage out. |
|
Back to top |
|
 |
tczielke |
Posted: Tue Oct 06, 2015 5:16 am Post subject: |
|
|
Guardian
Joined: 08 Jul 2010 Posts: 941 Location: Illinois, USA
|
Hi rekarm01,
Thank you for the post. That was very helpful!
I followed the algorigthm that you gave for converting 1208 -> 819 for a valid two byte 1208 code point like the pound sign (x'c2a3'), and it does work.
x'c2a3' (Pound Sign in 1208) = B'11000010 10100011' = B'000 1010 0011' = x'a3' (Pound Sign in 819)
Do you happen to know where these types of code page deciphering algorithms are documented?
Also, are there any tools (preferably free ) that you would recommend instead of iconv for validating utf-8 data?
I am still under the thinking that MQ should have given a conversion error and not falsely converted a malformed utf-8 string like x'c061', so I am still going down the PMR route to see what IBM says about this behavior. _________________ Working with MQ since 2010. |
|
Back to top |
|
 |
rekarm01 |
Posted: Mon Oct 12, 2015 2:41 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
tczielke wrote: |
Do you happen to know where these types of code page deciphering algorithms are documented? |
The Unicode standard defines the UTF-8 encoding, and it is also documented as an Internet RFC, but the wikipedia entry is probably the easiest to read.
tczielke wrote: |
Also, are there any tools (preferably free ) that you would recommend instead of iconv for validating utf-8 data? |
I tested out iconv on different servers, some correctly identified ill-formed UTF-8, and some didn't. iconv itself is just a front end for calling underlying library converter routines, which are to some extent configurable, so at least some versions of iconv are strict enough for validating utf-8 ... just check first before assuming it will work.
There is also the ICU converter (WMB/IIB uses ICU libraries for conversion to/from Unicode), but I don't know how easy it is to download/install it separately. There are also plenty of source code snippets posted around the internet; it might be easier to use that as a starting point for writing your own validation tool.
tczielke wrote: |
I am still under the thinking that MQ should have given a conversion error and not falsely converted a malformed utf-8 string like x'c061', so I am still going down the PMR route to see what IBM says about this behavior. |
At least for some platforms, MQ relies on the underlying iconv library routines for conversion, which are separately installed. A PMR is probably the best way to find out more about that ... |
|
Back to top |
|
 |
tczielke |
Posted: Tue Oct 13, 2015 12:05 pm Post subject: |
|
|
Guardian
Joined: 08 Jul 2010 Posts: 941 Location: Illinois, USA
|
Thank you for the testing results and information. That is helpful.
The PMR is still open, but one thing that did come out of it is that MQ does use different conversion software/rules per platform. For example, the x'C0' invalid byte test for 1208 messages does throw a warning and non zero reason code on platforms like AIX. This of course, matches what you said. You seem to have internal knowledge of MQ.
Going back to one of Peter's original comments:
"But is there any way to know its really 1208 without having a character that would take 2 bytes?"
I would say yes, if the message contains bytes outside of the ASCII range (i.e. x'00' - x'7F').
One thing I noticed after doing more reading on UTF-8 (1208) is that it does have a unique layout:
1. Single byte data is only x'00' - x'7F' or ASCII. Also, these ASCII bytes never appear in the multi-byte encodings.
2. The leading byte of a multi-byte encoding tells how many bytes will follow (i.e. leading byte x'110xxxxx' means two byte encoding, leading byte x'1110xxxx' means thread byte encoding, etc.) and continuation bytes have to start with x'10xxxxxx'.
This unique layout also allows you to distinguish it from most other code pages like single byte code pages like 437, 819 and UTF-16 (1200), when the message uses bytes outside the ASCII range.
I wrote some C code that can analyze a message and see how it conforms to the UTF-8 rules, tracking things like how many ASCII bytes did it have, how many multi-bytes did it have, invalid codings, etc.
With a tool like this, you would should be able to see if the developer was "lying" and misrepesented the CCSID as UTF-8, if the message includes some bytes outside the ASCII range.
Anyway, I currently have it working with an enhancement to amqsbcg, and I think I will put this tool into a future release of the MH06 supportpac. We have run into issues with applications properly handling UTF-8 messages, so a tool like this would be helpful for problem determination. _________________ Working with MQ since 2010. |
|
Back to top |
|
 |
|
|
 |
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|