MQSeries.net :: View topic - Convert double-byte to single-byte characters inside Broker

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Convert double-byte to single-byte characters inside Broker

Convert double-byte to single-byte characters inside Broker

« View previous topic :: View next topic »

Author

Message

sunny_30

Posted: Fri Jul 31, 2009 11:16 am Post subject: Convert double-byte to single-byte characters inside Broker

Master

Joined: 03 Oct 2005
Posts: 258

Hi all,

Trying to find the solution for MessageBroker Idoc parsing problem, opened a PMR with IBM but not of much help yet.

Environment: AIX-5.3, MessageBroker 6005, MQ-6022
Message-Domain: SAP IDocs from the WebSphere MQ Link for R3

The Idoc-flat file messages come with special-characters such as: â€ Ã‘ � etc. The MQMD.Format is set to NULL, but CCSID is set to 1208.
The flat-file idoc goes through some pre-processing such as removing CRLF characters, padding spaces etc to fit into the SAP r/3 format.
Processed thousands of messages successfully, but handful of messages that carry special-characters are failing in the RCD node.
Exception-details:
<Severity>3</Severity>
<Number>6118</Number>
<Text>Invalid buffer parameters</Text>

The application(F2Q) thats posts flat-file Idoc to queue is running on Windows. SAP is the source of special characters that generates the data in Unicode format.

The issue seems to be that SAP is generating special-characters in Double-bytes alongside with single-byte characters inside the Idoc-message.
for example the special character: Ã‘ is HEX: 'D1' (in single-byte) But the Inbound Idoc is coming with double-byte: hex- 'C391' (double-byte) representation.
Is DBCS-data(one byte characters mixed with two byte characters) a problem in first place for Idocs?

I tried conversion of BLOB to CCSID-1200, doesnt help.
changing format to MQFMT_STRING, doesnt help either.
While trying to convert to CCSID-437, IDoc parsed fine but characters showed up differently in SAP.

I have not had problems handling special-characters inside XML UTF-8 messages where are all characters are SBCS (single-byte) & CCSID was set to 1208 but DBCS seems to be an issue here.
Is there a way to convert all the double-byte characters inside the BLOB data to single-byte(SBCS) using Message-Broker ?

Has anybody here faced a similar issue using Idoc (or) any other parsers with special-characters?
please help.

Thanks,
Sunny

fjb_saper

Posted: Fri Jul 31, 2009 12:15 pm Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20763
Location: LI,NY

This should not be a problem.
You need to make sure when you ingest the SAP segments that all Strings are declared as char and not as byte....
This can be somewhat challenging.
See the relevant section in the manual

Another approach would be to not parse the IDOC in UTF-8 but parse it in some single byte char language. What is the native for SAP? You could ask for the file in that charset and parse then with it.

You could also have some kind of input blob to output blob where the CCSID is translated in between...

Have fun.

_________________
MQ & Broker admin

kimbert

Posted: Sat Aug 01, 2009 7:31 am Post subject:

Jedi Council

Joined: 29 Jul 2003
Posts: 5543
Location: Southampton

You are the latest in a long train of users who are struggling with code page issues. Please search in this forum for 'code page Unicode' to find recent posts which contain useful links on this subject.

Quote:

The issue seems to be that SAP is generating special-characters in Double-bytes alongside with single-byte characters inside the Idoc-message.
for example the special character: Ã‘ is HEX: 'D1' (in single-byte) But the Inbound Idoc is coming with double-byte: hex- 'C391' (double-byte) representation.
Is DBCS-data(one byte characters mixed with two byte characters) a problem in first place for Idocs?

C391 is the correct UTF8 ( code page 1208 ) representation of that character. At risk of repeating myself, you need to do some reading.

Quote:

I tried conversion of BLOB to CCSID-1200, doesnt help.
changing format to MQFMT_STRING, doesnt help either.
While trying to convert to CCSID-437, IDoc parsed fine but characters showed up differently in SAP.

In other words, you have given up trying to understand the problem, and you have started to twiddle with settings. That's the wrong approach.

Suggested next steps:
a) Do some research into the basics of code pages, as suggested
b) Take a debug-level user trace and make sure that you understand which code page the Idoc parser is using
c) Check that the input bit stream containly properly-encoded UTF8 characters only. If not, the source application is providing false information about the data which it is supplying, and there's nothing your message flow can do about it.

sunny_30

Posted: Sun Aug 02, 2009 8:14 am Post subject:

Master

Joined: 03 Oct 2005
Posts: 258

Thanks for your responses.

The Idoc-MessageSet was set to encoding: CP1252.
I changed the encoding to UTF-8.
For "all" Idoc-segments in the Message-definition files (including standard Idoc.mxsd), I modified the "Length Units" from Bytes to Characters.
(I used a text-editor to do "replace all" opening the mxsd using "multiple-line replace capable Text-editor")
Manually modifying each segment definition wd have been impossible, there are 1465 definitions to change all together!
Inspite of me doing all these, it didnt help. I must have missed something, I get the same parser-error as earlier: "Invalid buffer parameters"

BUT it finally worked in a different way:
I let the Idoc-Message-set use CP1252, the "Length Units" remained set to "Bytes" only.
Before the message gets to Idoc-parser (RCD node) I changed the CCSID of the incoming 1208 (UTF-8 multibyte message) to CCSID-5348 which is a single-byte representation of Windows CP1252 encoding.
It parsed fine, in the later steps where the message gets transformed to .xsd (SAP WBIA-format) I changed it back to UTF-8 (CCSID-1208) before the xml hits SAP.

Thanks again for all the help.

fjb_saper

Posted: Sun Aug 02, 2009 2:04 pm Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20763
Location: LI,NY

Just changing the offending field from byte to char would probably have gotten you around the bend...

_________________
MQ & Broker admin

sunny_30

Posted: Sun Aug 02, 2009 5:22 pm Post subject:

Master

Joined: 03 Oct 2005
Posts: 258

Hi Saper,

Quote:

Just changing the offending field from byte to char would probably have gotten you around the bend

It is the first thing I tried(including changing message-set encoding to utf-8 ) but the message still failed to parse....

fjb_saper

Posted: Sun Aug 02, 2009 7:33 pm Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20763
Location: LI,NY

sunny_30 wrote:

Hi Saper,

Quote:

Just changing the offending field from byte to char would probably have gotten you around the bend

It is the first thing I tried(including changing message-set encoding to utf-8 ) but the message still failed to parse....

I hope you checked out the reference I gave you and used characters and not character units. Character Units would be defeating the purpose for UTF-8...

Anyways happy you solved it. We did something similar. The origin system being the MF with CCSID 500 with a failover copy in Oracle UTF-8... We change back the failover blob to ccsid 500 before parsing...

_________________
MQ & Broker admin

sunny_30

Posted: Mon Aug 03, 2009 1:31 pm Post subject:

Master

Joined: 03 Oct 2005
Posts: 258

I ran into a different issue after the recent change.
It doesnt look to be that simple to process Multi-byte characters using IDOC-parser.

Here is the current problem:
This is the setup I recently modified to:
IDOC-flat-file: UTF-8(CCSID-1208- multibyte) File-to-Queue ->
SAP-IDOC-parser :CP1252 (CCSID-5348- singlebyte) Message-Broker ->
mySap.com Adapter UTF-8(CCSID-1208- multibyte) Feeds into SAP

This setup works for most of the special characters: Ã‘, â€ etc
source: http://www.io.com/~jdawson/cp1252.html

But in the process of conversion from UTF-8 -> CP1252 ->UTF-8, Im having problem with character: �
The UTF-8 Hex representation is 'EFBFBD'
The character gets converted to single-byte: Hex- '1A' (looks like this is the substitution hex when no conversion rule is found!)
So, upon conversion back to UTF-8, '1A' doesnt get converted to 'EFBFBD'

Fails with the following error in the adapter downstream after Message-Broker processing:
[MsgID: 0] [Mesg: An invalid XML character (Unicode: 0x1a) was found in the element content of the document.]
same '1A' problem here:
http://www.mqseries.net/phpBB2/viewtopic.php?t=40397&postdays=0&postorder=asc&start=0

So Im stuck with only one choice: i.e. to get the IDOC-parser to successfully process UTF-8 multi-byte character messages, that way I dont lose anything in midway conversion into single-bytes.

As I explained earlier, although I tried changing the length-units for "all" the message-set fields from Bytes to Characters (note not 'character units'), it still doesnt seem to work.
Before resorting to convert "all" fields I first tested changing just the field in question but of no help.
I think even when I changed all fields, Message-Broker is still using standard Idoc.mxsd to internally parse as Bytes only.
I have one more message-set field: Byte-alignment set to "1 byte"
what is the significance of this? when Length-Units is set to "Characters"

I appreciate any suggestions to resolve this

Thanks

fjb_saper

Posted: Mon Aug 03, 2009 2:51 pm Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20763
Location: LI,NY

You could also try to translate to CCSID 1200 and parse with that.
This way all characters show 2 bytes... Again you would have to change your message set to parse in characters and not in bytes.

However it would most uniformly give you 2 bytes per char.

I used ccsid 500 because it was the ccsid of origin and it is single byte for the stuff I had 2 bytes in utf-8 (500 = ebcdic international).

You found out that it is extremely difficult to go to a single byte CCSID if it does not cater for all the characters you have in your multibyte bitstream...

Wishing you the best....

_________________
MQ & Broker admin

rekarm01

Posted: Tue Aug 04, 2009 4:59 pm Post subject: Re: Convert double-byte to single-byte characters in Broker

Grand Master

Joined: 25 Jun 2008
Posts: 1415

sunny_30 wrote:

I ran into a different issue after the recent change.
It doesnt look to be that simple to process Multi-byte characters using IDOC-parser.

The IDOC domain is deprecated. Perhaps issues like this is one reason why.

sunny_30 wrote:

But in the process of conversion from UTF-8 -> CP1252 ->UTF-8, Im having problem with character: �
The UTF-8 Hex representation is 'EFBFBD'
The character gets converted to single-byte: Hex- '1A' (looks like this is the substitution hex when no conversion rule is found!)
So, upon conversion back to UTF-8, '1A' doesnt get converted to 'EFBFBD'

'�' (<U+FFFD>) is a replacement character; it serves a similar purpose as the ASCII substitution character. While it is possible that SAP is generating this intentionally, it is much more likely introduced accidentally, due to an improper conversion upstream. During the process of conversion from ??? -> UTF-8 -> CP1252 -> UTF-8, the ??? is wrong.

One hypothetical scenario, involves two variants of the Windows code page 1252:

ccsid=1252: MS Windows Latin-1, Version 1 (X'80'=undefined)
ccsid=5348: MS Windows Latin-1, Version 2 (X'80'='â‚¬')

If the original message encoded 'â‚¬' using ccsid=5348, but the initial conversion is from ccsid=1252 to ccsid=1208 (UTF-8), then the 'â‚¬' character will be interpreted as undefined, and replaced with the Unicode '�' character, rather than the Unicode 'â‚¬' character, causing additional problems downstream. Information is lost; the broker can't fix this.

If the MQ message has multiple headers, (for example: MQMD header, MQSAPH header, IDOC body, ...), then it's important to examine the CodedCharSetId and Format for each header; these fields normally describe only the next item in the chain.

Whether or not the message set needs to count by bytes, or by characters, depends on whether SAP padded the fields to a fixed number of bytes, or a fixed number of characters.

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Convert double-byte to single-byte characters inside Broker

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP