MQSeries.net :: View topic - single/double-bytes questions and conversion problem

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » single/double-bytes questions and conversion problem

Goto page 1, 2 Next

single/double-bytes questions and conversion problem

« View previous topic :: View next topic »

Author

Message

lium

Posted: Tue Sep 28, 2010 7:30 am Post subject: single/double-bytes questions and conversion problem

Disciple

Joined: 17 Jul 2002
Posts: 184

I got customer data, which was told as CCSID 1252 ANSI Latin 1.
When I open it with NotePad++, everything is normal, for example, I saw the data at the very beginning was "# InvoiceNumber". There was a tab character between # and InvoiceNumber. However, when I use dos command debug to show it, The output was:
FF FE 23 00 09 00 49 00 6E 00 76 00-6F 00 69 00 63 00 65 00

The FF FE is not used, 23 00 is #, 09 00 is tab, 49 is I, 6E is n, 76 is v etc.
The 00 shown as .
So it seems to me double-bytes. I also got this from internet http://en.wikipedia.org/wiki/Windows-1252, from the diagram, it really seems like the Windows-1252 (CP1252) is double bytes, ie, the 8 is code as 0038( I think if putting from low bit order to high bit order, it would be 3800).

However, I googled the internet, and it really says the ANSI 1252 is single byte, for example:
from http://resources.arcgis.com/content/kbase?fa=articleShow&d=21158, it says:

Single Byte Character Set (SBCS):
1250: Windows Latin 2 (Central Europe)
1251: Windows Cyrillic
1252: Windows Latin 1 (ANSI)
1253: Windows Greek
1254: Windows Latin 5 (Turkish)
1255: Windows Hebrew
1256: Windows Arabic
1257: Windows Baltic
1258: Windows Vietnamese
874: Windows Thai

Also from http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt, it says:

932, 936, 949 and 950 are all double byte code pages. The remainder are single byte code pages

Now, let me state the problem I have with customer data:

I created a message set in MRM domain to parse it, and wanted to generate output based on customer input, for example, I coded a simple message flow, an MQInput node parsing the message, and right connected to a compute node which has ESQL as

SET OutputRoot.Properties.CodedCharSetId = 1208;
SET OutputRoot.MQMD.CodedCharSetId = 1208;
SET OutputRoot.MQMD.Format = 'MQSTR ';
SET OutputRoot.XMLNS.Message = InputRoot.MRM.Title.T_InvoiceNo;

The compute node is output to an MQOutput node right way.
I used rfhutil to load customer data. I set the MQMD's code page to 1252 and RFH's code page to 1252 too. During debug, I indeed saw that the InputRoot.Properties.CodedCharSetId, and InputRoot.MQMD.CodedCharSetId and InputRoot.MQRFH2.CodedCharSetId all had value 1252. however, when browsing the output message, it showed:

C:\temp>amqsbcg0 QOUT|more

AMQSBCG0 - starts here
**********************

MQOPEN - 'QOUT'

MQGET of message number 1
****Message descriptor****

StrucId : 'MD ' Version : 2
Report : 0 MsgType : 8
Expiry : -1 Feedback : 0
Encoding : 546 CodedCharSetId : 1208
Format : 'MQSTR '
Priority : 0 Persistence : 1
MsgId : X'414D5120494453514D20202020202020BABCA04C20007B19'
CorrelId : X'000000000000000000000000000000000000000000000000'
BackoutCount : 0
ReplyToQ : ' '
ReplyToQMgr : 'IDSQM '
** Identity Context
UserIdentifier : 'mliu '
AccountingToken :
X'1601051500000010DDFDB0D2B59FF58CE8AFE8ED03000000000000000000000B'
ApplIdentityData : ' '
** Origin Context
PutApplType : '11'
PutApplName : 'ium\mqtools\ih03\rfhutil.exe'
PutDate : '20100928' PutTime : '15065409'
ApplOriginData : ' '

GroupId : X'000000000000000000000000000000000000000000000000'
MsgSeqNumber : '1'
Offset : '0'
MsgFlags : '0'
OriginalLength : '-1'

**** Message ****

length - 48 bytes

00000000: 3C4D 6573 7361 6765 3E00 4900 6E00 7600 '<Message>.I.n.v.'
00000010: 6F00 6900 6300 6500 4E00 7500 6D00 6200 'o.i.c.e.N.u.m.b.'
00000020: 6500 7200 3200 3C2F 4D65 7373 6167 653E 'e.r.2.</Message>'

So I have questions:
1. is CCSID 1252 ANSI Latin 1 really single byte or double byte?
2. Does codepage decides single byte/double-bytes or if it is single byte or double-bytes also depends on other factors such as operating system etc?
2. How to remove the 0x00 in the output so that the it would be:
<Message>InvoiceNumber2</Message> rather than
<Message>.I.n.v.o.i.c.e.N.u.m.b.e.r.2.</Message>

Any help, really appreciated!

Vitor

Posted: Tue Sep 28, 2010 7:55 am Post subject: Re: single/double-bytes questions and conversion problem

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

lium wrote:

I got customer data, which was told as CCSID 1252 ANSI Latin 1.

Erm....

lium wrote:

Encoding : 546 CodedCharSetId : 1208

This message isn't in 1252, it's in 1208. Despite everything.
_________________
Honesty is the best policy.
Insanity is the best defence.

mqjeff

Posted: Tue Sep 28, 2010 8:27 am Post subject: Re: single/double-bytes questions and conversion problem

Grand Master

Joined: 25 Jun 2008
Posts: 17447

Vitor wrote:

lium wrote:

I got customer data, which was told as CCSID 1252 ANSI Latin 1.

Erm....

lium wrote:

Encoding : 546 CodedCharSetId : 1208

This message isn't in 1252, it's in 1208.

Vitor wrote:

Despite everything.

Despite working as designed, you mean?

Code:

SET OutputRoot.Properties.CodedCharSetId = 1208;

Is pretty authoritative...

Quibbles on using OutputRoot.XMLNS instead of OutputRoot.XMLNSC.

MRM will use MQMD information on CCSID and Encoding to convert message data into logical data in 1200. Telling MQOutput to turn logical message data into 1208 then is pie.

Vitor

Posted: Tue Sep 28, 2010 8:30 am Post subject: Re: single/double-bytes questions and conversion problem

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

mqjeff wrote:

Vitor wrote:

Despite everything.

Despite working as designed, you mean?

Shush...you'll spoil the surprise...

_________________
Honesty is the best policy.
Insanity is the best defence.

zonko

Posted: Tue Sep 28, 2010 8:31 am Post subject:

Voyager

Joined: 04 Nov 2009
Posts: 78

Quote:

UTF-16
In UTF-16, a BOM (U+FEFF) is placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF (where "0x" indicates hexadecimal);
if the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE.
The Unicode value U+FFFE is guaranteed never to be assigned as a Unicode character; this implies that in a Unicode context the 0xFF, 0xFE byte pattern can only be interpreted as the U+FEFF character expressed in little-endian byte order (since it could not be a U+FFFE character expressed in big-endian byte order).

The data is UCS-2, or UTF-16, in code page 1200 not 1208. The above quote is from the Wikipedia page on the Byte-Order mark used to prefix data.

mqjeff

Posted: Tue Sep 28, 2010 8:35 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

zonko wrote:

The data is UCS-2, or UTF-16, in code page 1200 not 1208.

Which data... ? The data that is explicitly set to be made into 1208, and has no BOM at all?

Or the data that starts with FFFE and not FEFF?

lium

Posted: Tue Sep 28, 2010 9:23 am Post subject:

Disciple

Joined: 17 Jul 2002
Posts: 184

Thanks for your replies.

Vitor:
I tried the 1208 at the very beginning, it was the exactly the same as using 1252.

mqjeff/zonko:
When I used 1200, I got something interesting. When I put 1200 for both MQMD and RFH, got parsing exception for MQRFH2 which makes sense.
Then I put 1208 for MQMD, and 1200 for RFH, I got exception for message body. what interests me is it failed at line level rather than header level(header goes before line).
so I am going to look into it and keep you updated.

BTW, the FFFE is deemed "Not Used", please check this website:
http://msdn.microsoft.com/en-us/library/cc195054.aspx

mqjeff

Posted: Tue Sep 28, 2010 9:38 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

You need to keep separate and keep straight the difference between the CCSID on the message coming in (in both the MQMD and the MQRFH2) and the CCSID on the message going out.

They are entirely separate and distinct things.

If you are able to successfully parse the input message using your MRM TDS model, then the CCSID fields on the input message are in fact correct. At that point, everything exists to take the parsed message (now in CCSID 1200 in a logical message tree!) and do what's needed to construct an output message as needed.

If you are not able to parse the input message, then you need to make sure that the CCSID fields correctly describe the data that is in the input message.

You also need to keep in mind exactly what data is being described by the MQMD CCSID field and what data is being described by the MQFH2 CCSID field - it's again separate and distinct things.

lium

Posted: Tue Sep 28, 2010 10:45 am Post subject:

Disciple

Joined: 17 Jul 2002
Posts: 184

Thanks again for everybody.
I think my problem has been solved by setting the codepage 1200 in RFH.
Also, the "FF FE" I put is 0xFEFF, the FF is lower bit, the FE is high bit. The debug program output from lower to high.

Regarding Zonko is reply:

The data is UCS-2, or UTF-16, in code page 1200 not 1208

Can I understand as: the codepage 1200 is single byte code page, if it is single byte or double byte depends on if it is UTF-8 or UTF-16?

kimbert

Posted: Tue Sep 28, 2010 10:53 am Post subject:

Jedi Council

Joined: 29 Jul 2003
Posts: 5543
Location: Southampton

First, I suggest that you read this: http://www.joelonsoftware.com/articles/Unicode.html

mqjeff said:

Quote:

If you are able to successfully parse the input message using your MRM TDS model, then the CCSID fields on the input message are in fact correct.

Hmmm. Not sure that's true in all cases. For example, you could incorrectly specify US-ASCII ( code page 437 ) when the sender is using UTF-8 ( code page 1208 ). If the test message does not happen to contain any multi-byte characters then it will parse correctly. If the test message does contain multi-byte characters then you *might* get a parsing error. Or maybe your logical data will be incorrect.

My point is never try to find the code page using trial and error. Find out which code page the sender is using, and make sure that your message flow is using exactly that code page. Not just any 'double-byte code page' - you could pick the wrong one, and it would appear to work in many cases.

Quote:

The FF FE is not used

The FFFE is almost certainly a UTF16-LE BOM, as zonko correctly pointed out. Your input document is declaring itself to be in UTF-16, little-endian.

Quote:

BTW, the FFFE is deemed "Not Used", please check this website:
http://msdn.microsoft.com/en-us/library/cc195054.aspx

You seem to be missing a very important point here. If you don't know the character encoding of this input message, you cannot possibly interpret it correctly. In that comment, you were assuming that the data is in code page 1252. I think that is probably a wrong assumption.
but...the first 128 code points in code page 1252 are the same as UTF-16 which is why it appears to work. See my first paragraph.

Quote:

I got customer data, which was told as CCSID 1252 ANSI Latin 1

I suggest that you approach whoever told you that, and ask them to explain why the data has a Unicode BOM on the front of it. Only the Unicode encodings ( UTF-8, UTF-16 and UTF-32 ) can have a BOM.

Quote:

Then I put 1208 for MQMD, and 1200 for RFH, I got exception for message body. what interests me is it failed at line level rather than header level(header goes before line).

Yes - do keep trying with UTF-16. It is quite possible that your message model is not expecting the message to have a BOM. Or maybe the parsing error was a genuine error - it's hard to tell without knowing what the error was. If you do post again, please quote any errors in full.

lium

Posted: Wed Sep 29, 2010 10:49 am Post subject:

Disciple

Joined: 17 Jul 2002
Posts: 184

Thanks Kimbert and everybody.
Another question is:
The input message could not get parsed correctly until I adjust its order.
For example, for bit stream
49006E0076006F00690063006500

I have change it to

0049006E0076006F006900630065 before it can be parsed as "Invoice"

It seems like the message broker does not recognize BOM and developer might have to judge the endian and adjust the byte order before it is parsed, Is this correct?

fjb_saper

Posted: Wed Sep 29, 2010 11:36 am Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20763
Location: LI,NY

lium wrote:

This data looks to be UTF-16. Now you have 3 CCSIDs to handle that IIRC:

1200 -- no BOM
1201 -- BOM one way (don't remember if little or big endian test it)
1202 -- BOM the other way

Have fun

_________________
MQ & Broker admin

Vitor

Posted: Wed Sep 29, 2010 12:03 pm Post subject:

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

lium wrote:

It seems like the message broker does not recognize BOM and developer might have to judge the endian and adjust the byte order before it is parsed, Is this correct?

No.

a) WMB can handle double byte character sets perfectly well
b) The developer (whoever that is) does not have to judge anything. The developer at the sending end knows what character set he's using. The developer at the WMB end knows (or should know) what character set the message uses because he was told by the sending developer. Either by coding it in the message, or sending an email, or by gasping the details out between trout blows.

I repeat the advice of my most worthy associate:

kimbert wrote:

My point is never try to find the code page using trial and error. Find out which code page the sender is using, and make sure that your message flow is using exactly that code page

Do not use judgement. Use facts.
_________________
Honesty is the best policy.
Insanity is the best defence.

fjb_saper

Posted: Wed Sep 29, 2010 12:34 pm Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20763
Location: LI,NY

This is the right way to go!
_________________
MQ & Broker admin

lium

Posted: Wed Sep 29, 2010 1:24 pm Post subject:

Disciple

Joined: 17 Jul 2002
Posts: 184

Great Tx guys, especially to fjb_saper.
The code page 1202 worked perfectly in term that all message parsed correctly.
Customer passed us wrong information. So you see sometimes God is not that knowledgable

Display posts from previous:

Goto page 1, 2 Next

Page 1 of 2

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » single/double-bytes questions and conversion problem

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP