|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
single/double-bytes questions and conversion problem |
« View previous topic :: View next topic » |
Author |
Message
|
lium |
Posted: Tue Sep 28, 2010 7:30 am Post subject: single/double-bytes questions and conversion problem |
|
|
Disciple
Joined: 17 Jul 2002 Posts: 184
|
I got customer data, which was told as CCSID 1252 ANSI Latin 1.
When I open it with NotePad++, everything is normal, for example, I saw the data at the very beginning was "# InvoiceNumber". There was a tab character between # and InvoiceNumber. However, when I use dos command debug to show it, The output was:
FF FE 23 00 09 00 49 00 6E 00 76 00-6F 00 69 00 63 00 65 00
The FF FE is not used, 23 00 is #, 09 00 is tab, 49 is I, 6E is n, 76 is v etc.
The 00 shown as .
So it seems to me double-bytes. I also got this from internet http://en.wikipedia.org/wiki/Windows-1252, from the diagram, it really seems like the Windows-1252 (CP1252) is double bytes, ie, the 8 is code as 0038( I think if putting from low bit order to high bit order, it would be 3800).
However, I googled the internet, and it really says the ANSI 1252 is single byte, for example:
from http://resources.arcgis.com/content/kbase?fa=articleShow&d=21158, it says:
Single Byte Character Set (SBCS):
1250: Windows Latin 2 (Central Europe)
1251: Windows Cyrillic
1252: Windows Latin 1 (ANSI)
1253: Windows Greek
1254: Windows Latin 5 (Turkish)
1255: Windows Hebrew
1256: Windows Arabic
1257: Windows Baltic
1258: Windows Vietnamese
874: Windows Thai
Also from http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt, it says:
932, 936, 949 and 950 are all double byte code pages. The remainder are single byte code pages
Now, let me state the problem I have with customer data:
I created a message set in MRM domain to parse it, and wanted to generate output based on customer input, for example, I coded a simple message flow, an MQInput node parsing the message, and right connected to a compute node which has ESQL as
SET OutputRoot.Properties.CodedCharSetId = 1208;
SET OutputRoot.MQMD.CodedCharSetId = 1208;
SET OutputRoot.MQMD.Format = 'MQSTR ';
SET OutputRoot.XMLNS.Message = InputRoot.MRM.Title.T_InvoiceNo;
The compute node is output to an MQOutput node right way.
I used rfhutil to load customer data. I set the MQMD's code page to 1252 and RFH's code page to 1252 too. During debug, I indeed saw that the InputRoot.Properties.CodedCharSetId, and InputRoot.MQMD.CodedCharSetId and InputRoot.MQRFH2.CodedCharSetId all had value 1252. however, when browsing the output message, it showed:
C:\temp>amqsbcg0 QOUT|more
AMQSBCG0 - starts here
**********************
MQOPEN - 'QOUT'
MQGET of message number 1
****Message descriptor****
StrucId : 'MD ' Version : 2
Report : 0 MsgType : 8
Expiry : -1 Feedback : 0
Encoding : 546 CodedCharSetId : 1208
Format : 'MQSTR '
Priority : 0 Persistence : 1
MsgId : X'414D5120494453514D20202020202020BABCA04C20007B19'
CorrelId : X'000000000000000000000000000000000000000000000000'
BackoutCount : 0
ReplyToQ : ' '
ReplyToQMgr : 'IDSQM '
** Identity Context
UserIdentifier : 'mliu '
AccountingToken :
X'1601051500000010DDFDB0D2B59FF58CE8AFE8ED03000000000000000000000B'
ApplIdentityData : ' '
** Origin Context
PutApplType : '11'
PutApplName : 'ium\mqtools\ih03\rfhutil.exe'
PutDate : '20100928' PutTime : '15065409'
ApplOriginData : ' '
GroupId : X'000000000000000000000000000000000000000000000000'
MsgSeqNumber : '1'
Offset : '0'
MsgFlags : '0'
OriginalLength : '-1'
**** Message ****
length - 48 bytes
00000000: 3C4D 6573 7361 6765 3E00 4900 6E00 7600 '<Message>.I.n.v.'
00000010: 6F00 6900 6300 6500 4E00 7500 6D00 6200 'o.i.c.e.N.u.m.b.'
00000020: 6500 7200 3200 3C2F 4D65 7373 6167 653E 'e.r.2.</Message>'
So I have questions:
1. is CCSID 1252 ANSI Latin 1 really single byte or double byte?
2. Does codepage decides single byte/double-bytes or if it is single byte or double-bytes also depends on other factors such as operating system etc?
2. How to remove the 0x00 in the output so that the it would be:
<Message>InvoiceNumber2</Message> rather than
<Message>.I.n.v.o.i.c.e.N.u.m.b.e.r.2.</Message>
Any help, really appreciated! |
|
Back to top |
|
 |
Vitor |
Posted: Tue Sep 28, 2010 7:55 am Post subject: Re: single/double-bytes questions and conversion problem |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
lium wrote: |
I got customer data, which was told as CCSID 1252 ANSI Latin 1. |
Erm....
lium wrote: |
Encoding : 546 CodedCharSetId : 1208
|
This message isn't in 1252, it's in 1208. Despite everything. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mqjeff |
Posted: Tue Sep 28, 2010 8:27 am Post subject: Re: single/double-bytes questions and conversion problem |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
Vitor wrote: |
lium wrote: |
I got customer data, which was told as CCSID 1252 ANSI Latin 1. |
Erm....
lium wrote: |
Encoding : 546 CodedCharSetId : 1208
|
This message isn't in 1252, it's in 1208. |
Vitor wrote: |
Despite everything. |
Despite working as designed, you mean?
Code: |
SET OutputRoot.Properties.CodedCharSetId = 1208; |
Is pretty authoritative...
Quibbles on using OutputRoot.XMLNS instead of OutputRoot.XMLNSC.
MRM will use MQMD information on CCSID and Encoding to convert message data into logical data in 1200. Telling MQOutput to turn logical message data into 1208 then is pie. |
|
Back to top |
|
 |
Vitor |
Posted: Tue Sep 28, 2010 8:30 am Post subject: Re: single/double-bytes questions and conversion problem |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
mqjeff wrote: |
Vitor wrote: |
Despite everything. |
Despite working as designed, you mean?
|
Shush...you'll spoil the surprise...  _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
zonko |
Posted: Tue Sep 28, 2010 8:31 am Post subject: |
|
|
Voyager
Joined: 04 Nov 2009 Posts: 78
|
Quote: |
UTF-16
In UTF-16, a BOM (U+FEFF) is placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.
If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF (where "0x" indicates hexadecimal);
if the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE.
The Unicode value U+FFFE is guaranteed never to be assigned as a Unicode character; this implies that in a Unicode context the 0xFF, 0xFE byte pattern can only be interpreted as the U+FEFF character expressed in little-endian byte order (since it could not be a U+FFFE character expressed in big-endian byte order).
|
The data is UCS-2, or UTF-16, in code page 1200 not 1208. The above quote is from the Wikipedia page on the Byte-Order mark used to prefix data. |
|
Back to top |
|
 |
mqjeff |
Posted: Tue Sep 28, 2010 8:35 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
zonko wrote: |
The data is UCS-2, or UTF-16, in code page 1200 not 1208. |
Which data... ? The data that is explicitly set to be made into 1208, and has no BOM at all?
Or the data that starts with FFFE and not FEFF? |
|
Back to top |
|
 |
lium |
Posted: Tue Sep 28, 2010 9:23 am Post subject: |
|
|
Disciple
Joined: 17 Jul 2002 Posts: 184
|
Thanks for your replies.
Vitor:
I tried the 1208 at the very beginning, it was the exactly the same as using 1252.
mqjeff/zonko:
When I used 1200, I got something interesting. When I put 1200 for both MQMD and RFH, got parsing exception for MQRFH2 which makes sense.
Then I put 1208 for MQMD, and 1200 for RFH, I got exception for message body. what interests me is it failed at line level rather than header level(header goes before line).
so I am going to look into it and keep you updated.
BTW, the FFFE is deemed "Not Used", please check this website:
http://msdn.microsoft.com/en-us/library/cc195054.aspx |
|
Back to top |
|
 |
mqjeff |
Posted: Tue Sep 28, 2010 9:38 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
You need to keep separate and keep straight the difference between the CCSID on the message coming in (in both the MQMD and the MQRFH2) and the CCSID on the message going out.
They are entirely separate and distinct things.
If you are able to successfully parse the input message using your MRM TDS model, then the CCSID fields on the input message are in fact correct. At that point, everything exists to take the parsed message (now in CCSID 1200 in a logical message tree!) and do what's needed to construct an output message as needed.
If you are not able to parse the input message, then you need to make sure that the CCSID fields correctly describe the data that is in the input message.
You also need to keep in mind exactly what data is being described by the MQMD CCSID field and what data is being described by the MQFH2 CCSID field - it's again separate and distinct things. |
|
Back to top |
|
 |
lium |
Posted: Tue Sep 28, 2010 10:45 am Post subject: |
|
|
Disciple
Joined: 17 Jul 2002 Posts: 184
|
Thanks again for everybody.
I think my problem has been solved by setting the codepage 1200 in RFH.
Also, the "FF FE" I put is 0xFEFF, the FF is lower bit, the FE is high bit. The debug program output from lower to high.
Regarding Zonko is reply:
The data is UCS-2, or UTF-16, in code page 1200 not 1208
Can I understand as: the codepage 1200 is single byte code page, if it is single byte or double byte depends on if it is UTF-8 or UTF-16? |
|
Back to top |
|
 |
kimbert |
Posted: Tue Sep 28, 2010 10:53 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
First, I suggest that you read this: http://www.joelonsoftware.com/articles/Unicode.html
mqjeff said:
Quote: |
If you are able to successfully parse the input message using your MRM TDS model, then the CCSID fields on the input message are in fact correct. |
Hmmm. Not sure that's true in all cases. For example, you could incorrectly specify US-ASCII ( code page 437 ) when the sender is using UTF-8 ( code page 1208 ). If the test message does not happen to contain any multi-byte characters then it will parse correctly. If the test message does contain multi-byte characters then you *might* get a parsing error. Or maybe your logical data will be incorrect.
My point is never try to find the code page using trial and error. Find out which code page the sender is using, and make sure that your message flow is using exactly that code page. Not just any 'double-byte code page' - you could pick the wrong one, and it would appear to work in many cases.
Quote: |
The FF FE is not used |
The FFFE is almost certainly a UTF16-LE BOM, as zonko correctly pointed out. Your input document is declaring itself to be in UTF-16, little-endian.
You seem to be missing a very important point here. If you don't know the character encoding of this input message, you cannot possibly interpret it correctly. In that comment, you were assuming that the data is in code page 1252. I think that is probably a wrong assumption.
but...the first 128 code points in code page 1252 are the same as UTF-16 which is why it appears to work. See my first paragraph.
Quote: |
I got customer data, which was told as CCSID 1252 ANSI Latin 1 |
I suggest that you approach whoever told you that, and ask them to explain why the data has a Unicode BOM on the front of it. Only the Unicode encodings ( UTF-8, UTF-16 and UTF-32 ) can have a BOM.
Quote: |
Then I put 1208 for MQMD, and 1200 for RFH, I got exception for message body. what interests me is it failed at line level rather than header level(header goes before line). |
Yes - do keep trying with UTF-16. It is quite possible that your message model is not expecting the message to have a BOM. Or maybe the parsing error was a genuine error - it's hard to tell without knowing what the error was. If you do post again, please quote any errors in full. |
|
Back to top |
|
 |
lium |
Posted: Wed Sep 29, 2010 10:49 am Post subject: |
|
|
Disciple
Joined: 17 Jul 2002 Posts: 184
|
Thanks Kimbert and everybody.
Another question is:
The input message could not get parsed correctly until I adjust its order.
For example, for bit stream
49006E0076006F00690063006500
I have change it to
0049006E0076006F006900630065 before it can be parsed as "Invoice"
It seems like the message broker does not recognize BOM and developer might have to judge the endian and adjust the byte order before it is parsed, Is this correct? |
|
Back to top |
|
 |
fjb_saper |
Posted: Wed Sep 29, 2010 11:36 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
lium wrote: |
Thanks Kimbert and everybody.
Another question is:
The input message could not get parsed correctly until I adjust its order.
For example, for bit stream
49006E0076006F00690063006500
I have change it to
0049006E0076006F006900630065 before it can be parsed as "Invoice"
It seems like the message broker does not recognize BOM and developer might have to judge the endian and adjust the byte order before it is parsed, Is this correct? |
This data looks to be UTF-16. Now you have 3 CCSIDs to handle that IIRC:
- 1200 -- no BOM
- 1201 -- BOM one way (don't remember if little or big endian test it)
- 1202 -- BOM the other way
Have fun  _________________ MQ & Broker admin |
|
Back to top |
|
 |
Vitor |
Posted: Wed Sep 29, 2010 12:03 pm Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
lium wrote: |
It seems like the message broker does not recognize BOM and developer might have to judge the endian and adjust the byte order before it is parsed, Is this correct? |
No.
a) WMB can handle double byte character sets perfectly well
b) The developer (whoever that is) does not have to judge anything. The developer at the sending end knows what character set he's using. The developer at the WMB end knows (or should know) what character set the message uses because he was told by the sending developer. Either by coding it in the message, or sending an email, or by gasping the details out between trout blows.
I repeat the advice of my most worthy associate:
kimbert wrote: |
My point is never try to find the code page using trial and error. Find out which code page the sender is using, and make sure that your message flow is using exactly that code page |
Do not use judgement. Use facts. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
fjb_saper |
Posted: Wed Sep 29, 2010 12:34 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
This is the right way to go! _________________ MQ & Broker admin |
|
Back to top |
|
 |
lium |
Posted: Wed Sep 29, 2010 1:24 pm Post subject: |
|
|
Disciple
Joined: 17 Jul 2002 Posts: 184
|
Great Tx guys, especially to fjb_saper.
The code page 1202 worked perfectly in term that all message parsed correctly.
Customer passed us wrong information. So you see sometimes God is not that knowledgable  |
|
Back to top |
|
 |
|
|
 |
Goto page 1, 2 Next |
Page 1 of 2 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|