MQSeries.net :: View topic - Mis-encoded case

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Mis-encoded case - Why no error is thrown?

Goto page 1, 2 Next

Mis-encoded case - Why no error is thrown?

« View previous topic :: View next topic »

Author

Message

ghoshly

Posted: Thu Mar 06, 2014 5:01 am Post subject: Mis-encoded case - Why no error is thrown?

Partisan

Joined: 10 Jan 2008
Posts: 333

If source system specifies a character set / code page but sends character in the message data which is not in the same code page then also broker does not generates any exception.

Message Broker represents the same as '?' We can see that in Trace node output or MQ output.

Is there any specific reason why message broker is not generating exception in those case.

Example: Source System specifies ISO-8859-1 and sends character ฿ (Thai Curreny symbol Baht) U+0E3F which gets the representation as '?'

ghoshly

Posted: Thu Mar 06, 2014 5:03 am Post subject: WMB 8.0.0.2 WMQ 7.5.1 AIX 7.1

Partisan

Joined: 10 Jan 2008
Posts: 333

Environment details - WMB 8.0.0.2 WMQ 7.5.1 AIX 7.1

Tibor

Posted: Thu Mar 06, 2014 5:06 am Post subject: Re: Mis-encoded case - Why no error is thrown?

Grand Master

Joined: 20 May 2001
Posts: 1033
Location: Hungary

ghoshly wrote:

Source System specifies ISO-8859-1 and sends character ฿ (Thai Curreny symbol Baht) U+0E3F which gets the representation as '?'

It is a little bit strange, because ISO-8859-1 / Latin-I has no representation for this character.

fjb_saper

Posted: Thu Mar 06, 2014 6:20 am Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20763
Location: LI,NY

you may see the representation as a "?". I would make sure and check the hex value which could well be something else. Code page translation sometimes changes a character it has no representation for into a different character... (It's in the rules). The sad thing is that this may lead to an invalid XML document as the replacement character is not always valid XML...

Are you sure the broker's representation is wrong? Remember the broker uses UTF internally. It might be that the output is wrong because that character is not supported in the output CCSID... or it just might be that your display program cannot display the output correctly...

_________________
MQ & Broker admin

smdavies99

Posted: Thu Mar 06, 2014 6:39 am Post subject:

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

fjb_saper wrote:

Are you sure the broker's representation is wrong? Remember the broker uses UTF internally.

Shouldn't that be UNICODE?

_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.

zpat

Posted: Thu Mar 06, 2014 6:44 am Post subject:

Jedi Council

Joined: 19 May 2001
Posts: 5867
Location: UK

UTF-16 is unicode, so is UTF-8 for that matter.
_________________
Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error.

ghoshly

Posted: Thu Mar 06, 2014 6:45 am Post subject: Actual value / Representation

Partisan

Joined: 10 Jan 2008
Posts: 333

My idea is in the same line and that is why I mentioned about representation.

Question is : If there is mis-encoded scenario i.e. Source system is sending character of certain character set but mentions something different (Other than Unicode UTF-8 or 16), should Broker throw exception when it receives through any input node or writes to some output node? We do copy properties folder from input to output.

We do receive XML writing exception when we do not put character set values in output properties.

In some previous thread we heared from Kimbert that Broker internally uses UTF-16.

ghoshly

Posted: Thu Mar 06, 2014 8:07 am Post subject: Its not just representation :-(

Partisan

Joined: 10 Jan 2008
Posts: 333

Hello... I checked the output with File Output node as well.

If UTF-8 is used from source system, then we get the correct hex value in the file, i.e. E0B8BF

however if improper character set is mentioned in input we are getting 3F only which is hex value of ? character.

kimbert

Posted: Thu Mar 06, 2014 12:12 pm Post subject:

Jedi Council

Joined: 29 Jul 2003
Posts: 5543
Location: Southampton

Help! You guys really need to learn the basics about character encodings - and it is not hard! zpat is the only person to make a 100% correct statement so far on this thread.

Quote:

If source system specifies a character set / code page but sends character in the message data which is not in the same code page then also broker does not generates any exception.

Please explain exactly why you expected an exception. Please use Google to research ISO8859-1 and UTF-8 before you reply.
_________________
Before you criticize someone, walk a mile in their shoes. That way you're a mile away, and you have their shoes too.

fatherjack

Posted: Fri Mar 07, 2014 3:22 am Post subject:

Knight

Joined: 14 Apr 2010
Posts: 522
Location: Craggy Island

kimbert wrote:

You guys really need to learn the basics about character encodings - and it is not hard!

Maybe not for someone with your experience and experitise in the subject, but given the number of threads that have vontinually appeared on this forum over the years then maybe it's a bit harder to grasp than you think. Is there a "Character Encoding for Dummies" anywhere?
_________________
Never let the facts get in the way of a good theory.

ghoshly

Posted: Fri Mar 07, 2014 4:20 am Post subject: Conversion

Partisan

Joined: 10 Jan 2008
Posts: 333

I can understand how in UTF-8 we are getting the hex values. What I do not understand is, when a character is not present in the incoming character set, how it is transformed / converted to '?'

For example I have tried with Shift-JIS and ISO-8859-1 to send ฿ character.

Is '?' is default character in such cases?

I am sorry and apologize for limited knowledge.

Gralgrathor

Posted: Fri Mar 07, 2014 7:18 am Post subject: Re: Conversion

Master

Joined: 23 Jul 2009
Posts: 297

ghoshly wrote:

Question: *IS* it converted to ?, or is that just the way your viewer displays the actual character? Does a hexdump of the bitstream show you the unicode hex for ?.?
_________________
A measure of wheat for a penny, and three measures of barley for a penny; and see thou hurt not the oil and the wine.

Vitor

Posted: Fri Mar 07, 2014 7:37 am Post subject: Re: Conversion

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

ghoshly wrote:

Is '?' is default character in such cases?

Either that or '.', depending on the software being used to view the data.

I echo comments of others that how the data is represented does not affect the underlying hex stream.
_________________
Honesty is the best policy.
Insanity is the best defence.

kimbert

Posted: Fri Mar 07, 2014 9:14 am Post subject:

Jedi Council

Joined: 29 Jul 2003
Posts: 5543
Location: Southampton

Quote:

Is there a "Character Encoding for Dummies" anywhere?

At risk of being accused of being a fanboy for this fella: http://www.joelonsoftware.com/articles/Unicode.html

Plus of course, Wikipedia, which has very good pages on Unicode, encodings and character sets.

The facts are:
- ISO-8859-1 is a single-byte encoding, and every byte value is a valid character. It is therefore impossible to get an 'invalid character' when *reading* ISO-8859-1. You might get unexpected characters, though. Especially if the bytes are actually representing UTF-8 and not ISO8859-1!
- ISO-8859-1 can represent exactly 256 characters. Unicode can represent a few million. So it's very easy to get 'Unconvertable character' errors when *writing* ISO-8859-1.

- UTF-8 can represent any character in the Unicode character set using between one and four bytes. It is therefore impossible to get an 'Unconvertable character' error when *writing* UTF-8.
- UTF-8 is exactly the same as ASCII ( and ISO-8859-* ) for the first 127 values. After that, characters are encoded as sequences of two or more bytes and those sequences *must* conform to the UTF-8 specification. So it's very easy to get 'Unconvertable character' errors when *reading* UTF-8 - especially if the input bytes are actually representing ISO-8859-1 characters in the 128-255 range. Although you still might get lucky if the sequence of characters happens to match a valid UTF-8 byte sequence. In that case you will just get the wrong characters.

It should be clear from the above that knowing the correct encoding ( same as CCSID ) is absolutely essential. It's quite possible to get incorrect results without realizing it.
_________________
Before you criticize someone, walk a mile in their shoes. That way you're a mile away, and you have their shoes too.

ghoshly

Posted: Mon Mar 10, 2014 1:15 am Post subject: Thanks..

Partisan

Joined: 10 Jan 2008
Posts: 333

Thanks a lot Kimbet & all for your response.

I do see the hex code '3F' when I can see '?' and that is the reason I mentioned about conversion.

I do use notepad++ or editplus for this message viewing purpose. Do you guys suggest any helpful tool for this purpose from your work experience, which has better capability for different encoding & character set?

Display posts from previous:

Goto page 1, 2 Next

Page 1 of 2

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Mis-encoded case - Why no error is thrown?

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP