Author |
Message
|
ghoshly |
Posted: Thu Mar 06, 2014 5:01 am Post subject: Mis-encoded case - Why no error is thrown? |
|
|
Partisan
Joined: 10 Jan 2008 Posts: 333
|
If source system specifies a character set / code page but sends character in the message data which is not in the same code page then also broker does not generates any exception.
Message Broker represents the same as '?' We can see that in Trace node output or MQ output.
Is there any specific reason why message broker is not generating exception in those case.
Example: Source System specifies ISO-8859-1 and sends character ฿ (Thai Curreny symbol Baht) U+0E3F which gets the representation as '?' |
|
Back to top |
|
 |
ghoshly |
Posted: Thu Mar 06, 2014 5:03 am Post subject: WMB 8.0.0.2 WMQ 7.5.1 AIX 7.1 |
|
|
Partisan
Joined: 10 Jan 2008 Posts: 333
|
Environment details - WMB 8.0.0.2 WMQ 7.5.1 AIX 7.1 |
|
Back to top |
|
 |
Tibor |
Posted: Thu Mar 06, 2014 5:06 am Post subject: Re: Mis-encoded case - Why no error is thrown? |
|
|
 Grand Master
Joined: 20 May 2001 Posts: 1033 Location: Hungary
|
ghoshly wrote: |
Source System specifies ISO-8859-1 and sends character ฿ (Thai Curreny symbol Baht) U+0E3F which gets the representation as '?' |
It is a little bit strange, because ISO-8859-1 / Latin-I has no representation for this character. |
|
Back to top |
|
 |
fjb_saper |
Posted: Thu Mar 06, 2014 6:20 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
you may see the representation as a "?". I would make sure and check the hex value which could well be something else. Code page translation sometimes changes a character it has no representation for into a different character... (It's in the rules). The sad thing is that this may lead to an invalid XML document as the replacement character is not always valid XML...
Are you sure the broker's representation is wrong? Remember the broker uses UTF internally. It might be that the output is wrong because that character is not supported in the output CCSID... or it just might be that your display program cannot display the output correctly...  _________________ MQ & Broker admin |
|
Back to top |
|
 |
smdavies99 |
Posted: Thu Mar 06, 2014 6:39 am Post subject: |
|
|
 Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
fjb_saper wrote: |
Are you sure the broker's representation is wrong? Remember the broker uses UTF internally. |
Shouldn't that be UNICODE?
 _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
 |
zpat |
Posted: Thu Mar 06, 2014 6:44 am Post subject: |
|
|
 Jedi Council
Joined: 19 May 2001 Posts: 5866 Location: UK
|
UTF-16 is unicode, so is UTF-8 for that matter. _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
 |
ghoshly |
Posted: Thu Mar 06, 2014 6:45 am Post subject: Actual value / Representation |
|
|
Partisan
Joined: 10 Jan 2008 Posts: 333
|
My idea is in the same line and that is why I mentioned about representation.
Question is : If there is mis-encoded scenario i.e. Source system is sending character of certain character set but mentions something different (Other than Unicode UTF-8 or 16), should Broker throw exception when it receives through any input node or writes to some output node? We do copy properties folder from input to output.
We do receive XML writing exception when we do not put character set values in output properties.
In some previous thread we heared from Kimbert that Broker internally uses UTF-16. |
|
Back to top |
|
 |
ghoshly |
Posted: Thu Mar 06, 2014 8:07 am Post subject: Its not just representation :-( |
|
|
Partisan
Joined: 10 Jan 2008 Posts: 333
|
Hello... I checked the output with File Output node as well.
If UTF-8 is used from source system, then we get the correct hex value in the file, i.e. E0B8BF
however if improper character set is mentioned in input we are getting 3F only which is hex value of ? character. |
|
Back to top |
|
 |
kimbert |
Posted: Thu Mar 06, 2014 12:12 pm Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Help! You guys really need to learn the basics about character encodings - and it is not hard! zpat is the only person to make a 100% correct statement so far on this thread.
Quote: |
If source system specifies a character set / code page but sends character in the message data which is not in the same code page then also broker does not generates any exception. |
Please explain exactly why you expected an exception. Please use Google to research ISO8859-1 and UTF-8 before you reply. _________________ Before you criticize someone, walk a mile in their shoes. That way you're a mile away, and you have their shoes too. |
|
Back to top |
|
 |
fatherjack |
Posted: Fri Mar 07, 2014 3:22 am Post subject: |
|
|
 Knight
Joined: 14 Apr 2010 Posts: 522 Location: Craggy Island
|
kimbert wrote: |
You guys really need to learn the basics about character encodings - and it is not hard! |
Maybe not for someone with your experience and experitise in the subject, but given the number of threads that have vontinually appeared on this forum over the years then maybe it's a bit harder to grasp than you think. Is there a "Character Encoding for Dummies" anywhere? _________________ Never let the facts get in the way of a good theory. |
|
Back to top |
|
 |
ghoshly |
Posted: Fri Mar 07, 2014 4:20 am Post subject: Conversion |
|
|
Partisan
Joined: 10 Jan 2008 Posts: 333
|
I can understand how in UTF-8 we are getting the hex values. What I do not understand is, when a character is not present in the incoming character set, how it is transformed / converted to '?'
For example I have tried with Shift-JIS and ISO-8859-1 to send ฿ character.
Is '?' is default character in such cases?
I am sorry and apologize for limited knowledge. |
|
Back to top |
|
 |
Gralgrathor |
Posted: Fri Mar 07, 2014 7:18 am Post subject: Re: Conversion |
|
|
Master
Joined: 23 Jul 2009 Posts: 297
|
ghoshly wrote: |
I can understand how in UTF-8 we are getting the hex values. What I do not understand is, when a character is not present in the incoming character set, how it is transformed / converted to '?'
For example I have tried with Shift-JIS and ISO-8859-1 to send ฿ character.
Is '?' is default character in such cases?
I am sorry and apologize for limited knowledge. |
Question: *IS* it converted to ?, or is that just the way your viewer displays the actual character? Does a hexdump of the bitstream show you the unicode hex for ?.? _________________ A measure of wheat for a penny, and three measures of barley for a penny; and see thou hurt not the oil and the wine. |
|
Back to top |
|
 |
Vitor |
Posted: Fri Mar 07, 2014 7:37 am Post subject: Re: Conversion |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
ghoshly wrote: |
Is '?' is default character in such cases? |
Either that or '.', depending on the software being used to view the data.
I echo comments of others that how the data is represented does not affect the underlying hex stream. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
kimbert |
Posted: Fri Mar 07, 2014 9:14 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Quote: |
Is there a "Character Encoding for Dummies" anywhere? |
At risk of being accused of being a fanboy for this fella: http://www.joelonsoftware.com/articles/Unicode.html
Plus of course, Wikipedia, which has very good pages on Unicode, encodings and character sets.
The facts are:
- ISO-8859-1 is a single-byte encoding, and every byte value is a valid character. It is therefore impossible to get an 'invalid character' when *reading* ISO-8859-1. You might get unexpected characters, though. Especially if the bytes are actually representing UTF-8 and not ISO8859-1!
- ISO-8859-1 can represent exactly 256 characters. Unicode can represent a few million. So it's very easy to get 'Unconvertable character' errors when *writing* ISO-8859-1.
- UTF-8 can represent any character in the Unicode character set using between one and four bytes. It is therefore impossible to get an 'Unconvertable character' error when *writing* UTF-8.
- UTF-8 is exactly the same as ASCII ( and ISO-8859-* ) for the first 127 values. After that, characters are encoded as sequences of two or more bytes and those sequences *must* conform to the UTF-8 specification. So it's very easy to get 'Unconvertable character' errors when *reading* UTF-8 - especially if the input bytes are actually representing ISO-8859-1 characters in the 128-255 range. Although you still might get lucky if the sequence of characters happens to match a valid UTF-8 byte sequence. In that case you will just get the wrong characters.
It should be clear from the above that knowing the correct encoding ( same as CCSID ) is absolutely essential. It's quite possible to get incorrect results without realizing it. _________________ Before you criticize someone, walk a mile in their shoes. That way you're a mile away, and you have their shoes too. |
|
Back to top |
|
 |
ghoshly |
Posted: Mon Mar 10, 2014 1:15 am Post subject: Thanks.. |
|
|
Partisan
Joined: 10 Jan 2008 Posts: 333
|
Thanks a lot Kimbet & all for your response.
I do see the hex code '3F' when I can see '?' and that is the reason I mentioned about conversion.
I do use notepad++ or editplus for this message viewing purpose. Do you guys suggest any helpful tool for this purpose from your work experience, which has better capability for different encoding & character set? |
|
Back to top |
|
 |
|