Author |
Message
|
catshout |
Posted: Thu Jan 28, 2016 5:40 am Post subject: Charset recognition |
|
|
Acolyte
Joined: 15 May 2012 Posts: 57
|
Dear community,
we're receiving EDIFACT messages from external partners. The encoding definition (charset) in UNOC is sometimes misleading, so that we're required to recognize the correct encoding of the whole message within an IIB flow.
Is there any built in capability in IIB that is able to automatically determine the encoding of incoming data when none or misleading describing data are sent; and so far setting the CCSID of a message depending on this recognition? Or is there an elegant programmatic way, either in ESQL or Java?
Best regards
- Gerald |
|
Back to top |
|
 |
fjb_saper |
Posted: Thu Jan 28, 2016 5:59 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
If the sender has declared the correct CCSID when sending the message,
you should be able to find the CCSID in
InputRoot.Properties.CodedCharSetId (from memory)
Otherwise have the senders fix their code!
Have fun  _________________ MQ & Broker admin |
|
Back to top |
|
 |
Vitor |
Posted: Thu Jan 28, 2016 6:15 am Post subject: Re: Charset recognition |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
catshout wrote: |
Is there any built in capability in IIB that is able to automatically determine the encoding of incoming data when none or misleading describing data are sent |
If the sender has not supplied the CCSID (or just allowed it to default to some value) or has set it to an incorrect or misleading value there's no easy (elegant) way for IIB to determine it's been had.
If the sender has not supplied the CCSID (or just allowed it to default to some value) or has set it to an incorrect or misleading value this is known by most professionals as "doing it wrong" and the sending code needs to be corrected. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
catshout |
Posted: Thu Jan 28, 2016 6:17 am Post subject: |
|
|
Acolyte
Joined: 15 May 2012 Posts: 57
|
That's the problem. The sender sends the plain data as it is an we pick it up. Nor CCSID neither any other information about the encoding charset. |
|
Back to top |
|
 |
Vitor |
Posted: Thu Jan 28, 2016 6:30 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
catshout wrote: |
That's the problem. |
Are you paying these people, or in some other from of monetary relationship? Is there any way they can be encouraged to get their act in gear, as this is the simplest solution to your problem.
catshout wrote: |
The sender sends the plain data as it is an we pick it up. |
Sends it how? MQ? Web? File? Carrier pigeon? If it's the latter, could the pigeon look up the encoding before it flies off? _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
catshout |
Posted: Thu Jan 28, 2016 6:42 am Post subject: |
|
|
Acolyte
Joined: 15 May 2012 Posts: 57
|
Quote: |
Are you paying these people, or in some other from of monetary relationship? Is there any way they can be encouraged to get their act in gear, as this is the simplest solution to your problem. |
These people sending EDIFACT ORDERS to our client. He is a supplier of goods for all these partners. An EDIFACT ORDER has a portion of data that describes the encoding of the following part. And this is being set wrong in several cases. Sure, our client may ask his partners to send the correct encoding meta data, but if they don't do this .. no other way.
Really cool idea
The data are coming over AS2/HTTP trough a DataPower Gateway. IIB receives the data over MQ. Maybe that DataPower is able to set the encoding based on content.
I've looked around .. Some Java libs are providing charset recognition as Apache Tika. Maybe that's a way if we can't convince the sending party to set the metadata right .. |
|
Back to top |
|
 |
mqjeff |
Posted: Thu Jan 28, 2016 6:53 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
If the sending team isn't able to set the metadata right, then they are not producing correct messages.
And you should explain that you can not accept incorrect messages, as you know, they're incorrect. _________________ chmod -R ugo-wx / |
|
Back to top |
|
 |
timber |
Posted: Fri Jan 29, 2016 1:25 am Post subject: |
|
|
 Grand Master
Joined: 25 Aug 2015 Posts: 1292
|
Just to back up what others have said...there is no *generally reliable* method of auto-detecting the character encoding/CCSID. There are some heuristics. Some encodings ( like UTF-8 ) are designed to be auto-detected, and the heuristics can be quite reliable. EBCDIC was not designed to be auto-detected, and any attempt to do so is likely to be fragile.
IF you understand exactly which encodings the sender might use,
AND you understand how those encodings differ
AND you think you could write an algorithm to reliably detect each one
THEN you could write yourself a CCSID-detection algorithm.
I wouldn't bother. I would go back to the sender and tell them to fix their software. |
|
Back to top |
|
 |
|