Author |
Message
|
er_pankajgupta84 |
Posted: Mon Jul 27, 2009 5:42 pm Post subject: ISO-8859-1 va UTF-8 encoding |
|
|
 Master
Joined: 14 Nov 2008 Posts: 203 Location: charlotte,NC, USA
|
Hi.. I have one question regarding encoding:
Can we use UTF-8 (CCSID - 1208) in all place where we can use ISO-8859-1 (CCSID- 819).
Why i am asking this is if UTF-8 can handle all the possible character then why we insist on using other encoding. This is just because of saving memory as UTF-8 will take 2 bytes and other encoding like ISO-8859 will take 1 byte.
Please share your views.
Any link that could explain this encoding stuff would be help full. |
|
Back to top |
|
 |
smdavies99 |
Posted: Mon Jul 27, 2009 9:45 pm Post subject: |
|
|
 Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
There have been many discussions about this and Kimbert (who really knows about this stuff) has in one of his recent posts given some websites that may/will help explain it all to you.
I struggled with this sort of thing and had to take some time to get it clear in my head.
Are your messages that big or the links between your systems so slow that you want to save a byte per character?
Remember that Broker uses UTF internally so economising in this way outside is possibly a 'false economy'.
That said, you really need to look at the flow of data from end to end and map the character set used and more importantly, the contents of the data. This will probably show up some if not many misunderstandings or even blindness to the problem that is coming their way.
Then (if possible and the PHB's allow it) decide on a standard character set encoding everywhere. Then your problem is done and dusted before it becomes critical. _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
 |
er_pankajgupta84 |
Posted: Tue Jul 28, 2009 5:17 am Post subject: |
|
|
 Master
Joined: 14 Nov 2008 Posts: 203 Location: charlotte,NC, USA
|
Thanks for your reply...but i don't get anything from it..
I have asked one simple question:
IS ISO-8859-1 is a subset of UTF-8 ?.. if yes then CCSID 1208 should also be able to parse CCSID - 819 characters.
If ISO-8859-1 is not a subset of UTF-8 then its fine..becoz then we may have to use CCSID 819 to parse some LATIN data.
Can anybody comment on this.. |
|
Back to top |
|
 |
smdavies99 |
Posted: Tue Jul 28, 2009 5:34 am Post subject: |
|
|
 Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
There is a way to answer your question exactly.
You have to compare the character set maps for both character sets.
If every character in the ISO set is in UTF-8 including accented characters then the answer will be yes.
I have a set of printouts of a large variety of character sets thich I use for this purpose. Sadly, the site I obtained them from 10+ years ago no longer exists. I originally used them for working out the mapping for Kazak characters in Russian & Greek sets. _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
 |
WMBDEV1 |
Posted: Tue Jul 28, 2009 5:46 am Post subject: |
|
|
Sentinel
Joined: 05 Mar 2009 Posts: 888 Location: UK
|
er_pankajgupta84 wrote: |
IS ISO-8859-1 is a subset of UTF-8 ?
|
The characters contained in ISO-8859-1 are also in UTF-8 so yes, in that sense it is a subset.
Quote: |
.if yes then CCSID 1208 should also be able to parse CCSID - 819 characters.
|
No, for the same reason I gave you in your other thread. Although they contain the same characters they have different physical represenations for the extended ASCII set. Look at how a pound symbol is represented in ISO-8859-1 and then in UTF-8 |
|
Back to top |
|
 |
er_pankajgupta84 |
Posted: Tue Jul 28, 2009 5:54 am Post subject: |
|
|
 Master
Joined: 14 Nov 2008 Posts: 203 Location: charlotte,NC, USA
|
thanks...your reply make sense...as this is what i was expecting...
Now.. in our system we are getting data from SAP where encoding is default to UTF-8 and default encoding in our system is UTF-8 i.e 1208. But when they send message with latin characters then it gets converted into some other characters in the input queue.
Here is a pictorial representation:
SAP-PI (UTF- - MB - Queue (CCSID-1208)
We are able to see the correct message in SAP-PI but when it comes to queue it got changed.
Now if we set CCSID in SAP-PI as 819 then we are able to receive proper message in queue.
So when we habe UTF-8 encoding in both system why we need 819 to represents some characters 1208 should cover it.
Any comments.. |
|
Back to top |
|
 |
WMBDEV1 |
Posted: Tue Jul 28, 2009 6:01 am Post subject: |
|
|
Sentinel
Joined: 05 Mar 2009 Posts: 888 Location: UK
|
So it sounds like SAP arent really sending the message in UTF-8 despite claiming they are (by setting ccsid to 1208 on the message)
Get the byte value(s) that are used to represent the £ symbol to confirm this. You can do this using a trace node in the broker or RFHUtil to read direct from the queue.
Quote: |
So when we habe UTF-8 encoding in both system why we need 819 to represents some characters 1208 should cover it. |
Quite. If SAP really do write the message in UTF-8 you shouldnt need to change the ccsid to 819. As long as the correct ccsid is used all should be well. |
|
Back to top |
|
 |
er_pankajgupta84 |
Posted: Tue Jul 28, 2009 6:05 am Post subject: |
|
|
 Master
Joined: 14 Nov 2008 Posts: 203 Location: charlotte,NC, USA
|
thanks again...you are providing a great deal of info...
one more confirmation...
if ccsid is 819 on broker side..and sap is having UTF-8 (1208) then if they send the message then there would be error as SAP is sending it as UTF-8 but we are receiving it as ISo-8859..
please correct me if i am wrong...
One more question..is it a good idea to set default encoding - CCSID to 1208 on broker.. |
|
Back to top |
|
 |
WMBDEV1 |
Posted: Tue Jul 28, 2009 6:58 am Post subject: |
|
|
Sentinel
Joined: 05 Mar 2009 Posts: 888 Location: UK
|
er_pankajgupta84 wrote: |
if ccsid is 819 on broker side..and sap is having UTF-8 (1208) then if they send the message then there would be error as SAP is sending it as UTF-8 but we are receiving it as ISo-8859..
|
No, dont worry about the defaults used by the QM. Just use the ccsid on the message and the broker can receive it ok (provided its actually been written correctly).
If SAP say they are sending it in one set but are actually writing it in another you will have issues. The broker will be able to convert either sets fine, as long its really written in the page advertised (which it sounds like it may not be in this case).
Quote: |
One more question..is it a good idea to set default encoding - CCSID to 1208 on broker.. |
I guess it depends on the charcter set for your OS. UTF-8 sounds like a good default for *nix systems. |
|
Back to top |
|
 |
kimbert |
Posted: Tue Jul 28, 2009 8:50 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Go back and read some more articles about
- Unicode and its encodings ( UTF-8, UTF-16, UTF-32 )
- Code pages and their history
Your questions show that you are still very confused. You will not understand the answers below until you understand the basics. |
|
Back to top |
|
 |
er_pankajgupta84 |
Posted: Tue Jul 28, 2009 10:42 am Post subject: |
|
|
 Master
Joined: 14 Nov 2008 Posts: 203 Location: charlotte,NC, USA
|
Kimbert...learning is a continous process..i have done a great deal of study on encoding and will continue to do it..thanks for your comments and suggestions.
I conducted one small poc:
i created two input files with same contents
1. with UTF-8 as encoding
2. ISO-8859-1 as encoding.
My QM and Broker are running on AIX so CCSID of QM is default to 819.
When i pass 2nd file it got parsed successfully but when i sent 1st file it failed.
So if QM / Broker is capable of converting various coding sets to its encoding and when we have representation of a character in both encodings then why it failed when i send the same data in UTF-8 encoding.
this is the string that is causing problem: CAFÉ ..
The letter " É " has representation in 819 and 1208.
Any comments/ suggestion would be helpful |
|
Back to top |
|
 |
kimbert |
Posted: Tue Jul 28, 2009 12:29 pm Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Learning about Unicode is *not* a continuous process. It is a task which must be completed before any of the knowledge can be applied. You cannot stop half way through the 'process' and start experimenting - that will result in a lot of wasted time.
Quote: |
i created two input files with same contents
1. with UTF-8 as encoding
2. ISO-8859-1 as encoding. |
Wrong! If you want to represent the same characters, the file contents will have to be different.
Quote: |
My QM and Broker are running on AIX so CCSID of QM is default to 819. |
The platform and its default CCSID are not relevant to this discusssion - as you should understand by now.
Quote: |
When i pass 2nd file it got parsed successfully but when i sent 1st file it failed. |
Exactly what I would expect if you input file contains ISO-8859-1 characters.
Quote: |
The letter " É " has representation in 819 and 1208. |
It does have a representation in code page 819.
Every character supported by Unicode has a representation in UTF-8.
Key questionWhat is the representation of É in code page ISO-8859-1?
What is its representation in UTF-8?
http://www.fileformat.info/info/unicode/char/00c8/index.htm
http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout |
|
Back to top |
|
 |
WMBDEV1 |
Posted: Wed Jul 29, 2009 12:39 am Post subject: |
|
|
Sentinel
Joined: 05 Mar 2009 Posts: 888 Location: UK
|
A question ive posed a couple of times now but to no avail! |
|
Back to top |
|
 |
er_pankajgupta84 |
Posted: Wed Jul 29, 2009 9:49 am Post subject: |
|
|
 Master
Joined: 14 Nov 2008 Posts: 203 Location: charlotte,NC, USA
|
Great deal of information..thanks
Can some one through some light on these questions:
1. When we can CCSID of our Qmanager set to 1208 (UTF- then it should handle all languages. If not what would be the exception cases.
2. If my Qmanager CCSID is 1208 and my source system encoding is different then when it sends a message to our QManager then automatic conversion from one encoding to UTF-8 should takes place.
3. If CCSID is 819 on our QManger and source system is having 1208(UTF- as encoding. Then when it sends a message to out QM then also automatic conversion of encoding from UTF-8 to ISO-8859-1(819) should take place.
If some one can through light on how this encoding and CCSID stuff goes in broker then it would solve many confusions.
One more question we also have a property "encoding" in MQMD folder of each message. Does it has some significance or simply the CCSID will derive the encoding. |
|
Back to top |
|
 |
Vitor |
Posted: Wed Jul 29, 2009 10:15 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
er_pankajgupta84 wrote: |
One more question we also have a property "encoding" in MQMD folder of each message. Does it has some significance or simply the CCSID will derive the encoding. |
Encoding refers to the handling of packed numbers (little endian / big endian) rather than character sets.
As to your other questions:
1. It depends on the running server being able to handle the conversion, as in all cases.
2. Conversion is not automatic but has to be requested and is reliant on the message meeting the criteria.
3. See above.
All of this is documented. Remember that WMB uses WMQ like any application would. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
|