Author |
Message
|
venkat_chekka |
Posted: Tue Dec 14, 2010 12:13 pm Post subject: Creating XML message from PDF file input |
|
|
Apprentice
Joined: 14 Apr 2006 Posts: 37
|
Has any one implemented converting a pdf message to a XML using Message Broker? Is it possible.
I am trying to do a POC. Any inputs will be greatly appreciated |
|
Back to top |
|
 |
bsiggers |
Posted: Tue Dec 14, 2010 12:21 pm Post subject: What format? |
|
|
Acolyte
Joined: 09 Dec 2010 Posts: 53 Location: Vancouver, BC
|
Anything is possible - but it is not clear what you are trying to accomplish.
You could just encode the PDF file using Base64 and stick it in some XML field, for example - done, your PDF file is now in an XML message. |
|
Back to top |
|
 |
smdavies99 |
Posted: Tue Dec 14, 2010 12:35 pm Post subject: |
|
|
 Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
Slightly OT but multi-part MIME Messages are ideal for this sort of thing. The properties of the elements describe the type of data in the BLOB part. _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
 |
venkat_chekka |
Posted: Tue Dec 14, 2010 12:35 pm Post subject: |
|
|
Apprentice
Joined: 14 Apr 2006 Posts: 37
|
I want to read some particular data from pdf message and need to populate xml message using that pdf information.
Is it possible in Message broker?? |
|
Back to top |
|
 |
smdavies99 |
Posted: Tue Dec 14, 2010 12:42 pm Post subject: |
|
|
 Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
Get yourself a postscript interpreter and away you go.
You need to convert the PDF to Text. Then you can manipulate the text within.
Look at the features of Ghostscript.It might do hat you want. I know it can combine PDF's. _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
 |
Vitor |
Posted: Tue Dec 14, 2010 12:42 pm Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
venkat_chekka wrote: |
Is it possible in Message broker?? |
How would you achieve this using another application language (like c#)? Code the same thing in WMB. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
bsiggers |
Posted: Tue Dec 14, 2010 12:50 pm Post subject: |
|
|
Acolyte
Joined: 09 Dec 2010 Posts: 53 Location: Vancouver, BC
|
Google is your friend, as usual. Searching for 'Java PDF' came up with this as the first hit:
http://pdfbox.apache.org/
Java, open source - sounds like it would be worth a try at least, it has the ability to extract stuff from PDF files. |
|
Back to top |
|
 |
venkat_chekka |
Posted: Tue Dec 14, 2010 1:14 pm Post subject: |
|
|
Apprentice
Joined: 14 Apr 2006 Posts: 37
|
Can we implemant this using Message Broker ESQL language?? |
|
Back to top |
|
 |
mqjeff |
Posted: Tue Dec 14, 2010 1:34 pm Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
Implement WHAT?
What do you *need* to DO with the PDF document?
Do you need to parse, extract, and understand it?
Or do you just need to deal with it as a chunk?
Regardless, you should strongly consider asking for help locally, like discuss this with your team lead. |
|
Back to top |
|
 |
Vitor |
Posted: Tue Dec 14, 2010 1:44 pm Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
mqjeff wrote: |
What do you *need* to DO with the PDF document?
|
venkat_chekka wrote: |
I want to read some particular data from pdf message and need to populate xml message using that pdf information |
_________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
kimbert |
Posted: Tue Dec 14, 2010 2:06 pm Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
There is no built-in parser for PDF documents in WMB. However, you can parse a PDF document using a third-party Java library and build a message tree from the extracted information.
Quote: |
Can we implemant this using Message Broker ESQL language? |
Technically, ESQL can do anything you like, so the answer is 'yes'. But why would you bother, when there is at least one ready-made solution available in Java? |
|
Back to top |
|
 |
venkat_chekka |
Posted: Tue Dec 14, 2010 8:16 pm Post subject: |
|
|
Apprentice
Joined: 14 Apr 2006 Posts: 37
|
Actually I can not use any third party Java libraries and looking for only ESQL code to implemant this.
Here is the complete information of my POC.
PDF file has some information but I want to extract only one text value from the PDF file.
Example: PDF message has below information.
Policy Number:xxxxx1234
I want to take the only Policy Number information from the PDF file.
Is there any chance to extract above information from ESQL coding. |
|
Back to top |
|
 |
smdavies99 |
Posted: Tue Dec 14, 2010 10:29 pm Post subject: |
|
|
 Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
venkat_chekka wrote: |
Actually I can not use any third party Java libraries and looking for only ESQL code to implemant this.
|
Ah the good old 'Not invented here and who's backside are we going to kick if it goes wrong' excuse.
Then get hold of as many examples of the PDF you can and scan them looking for the particular bits of postcript that hold the data you are looking for.
If there is a clear pattern then just extract that and pull out the bit of data you want. Remember that the PDF will be in a BLOB form so your substring will need to look for the HEX representation of a series of characters.
Then convert the extracted part into to a char. Then parse it finally to get the data you need.
If there is NO CLEAR PATTERN in the postscript then you will need to use a 3rd party Java Library unless you really, really want to put yourself through pain & torture and write your own parser.
The deep joys of Systems Integratino and the real world of PHB's and their bright ideas...
 _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
 |
venkat_chekka |
Posted: Thu Dec 16, 2010 1:29 pm Post subject: |
|
|
Apprentice
Joined: 14 Apr 2006 Posts: 37
|
My data is optical character recognition type data that means my data is part of image in the pdf message.
So can I access this type image data from PDF message in the Message Broker? |
|
Back to top |
|
 |
mqjeff |
Posted: Thu Dec 16, 2010 2:08 pm Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
Yes, you can access this kind of data from Message Broker.
You can read the PDF document as a string of bytes, and then write any code you want to process those bytes and extract meaning.
I expect that it would take a competent and well trained programmer at least six months, more likely a year, to complete a meaningful and robust OCR system purely in ESQL.
Your management is asking you to do something VERY VERY HARD. You need to tell them that you need to use either a third party Java library or a third party PHP library to process and parse the PDF document and then a third party image library to perform OCR on the data.
If you are a strong C programmer, you can create your own User Defined Node to do this, as well - probably again using a third-party library.
Or you need to quit your job and get a new one with better or smarter management. |
|
Back to top |
|
 |
|