Author |
Message
|
anild |
Posted: Tue Apr 07, 2009 11:14 pm Post subject: creating xml messages with data from mulitiple files |
|
|
Novice
Joined: 19 Sep 2007 Posts: 13
|
I am having data from two different files and I need to construct a xml messages. The scenario is similar to construct an xml message from two database tables.we can't use buffers because the file sizes are large (ex: files size 50MB).
Can any one suggest how to handle this kind of scenario's.
Thanks in advance. |
|
Back to top |
|
 |
Vitor |
Posted: Tue Apr 07, 2009 11:54 pm Post subject: Re: creating xml messages with data from mulitiple files |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
anild wrote: |
we can't use buffers because the file sizes are large (ex: files size 50MB).
|
That's not a big file. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
anild |
Posted: Wed Apr 08, 2009 1:23 am Post subject: Re: creating xml messages with data from mulitiple files |
|
|
Novice
Joined: 19 Sep 2007 Posts: 13
|
Vitor wrote: |
anild wrote: |
we can't use buffers because the file sizes are large (ex: files size 50MB).
|
That's not a big file. |
Invoking MRM parser it will take very longer time. because each file is containing around 900000 + records storing into system memory(setting into environment variable) its very system memory consuming.
The scenario will be like this:
we are receiving 2 .txt files from FTP, each file size around 50MB and around 900000+ rows in each file.
We need to construct xml message using 2 files with m*n data rows into individual xml message. |
|
Back to top |
|
 |
WMBDEV1 |
Posted: Wed Apr 08, 2009 1:25 am Post subject: Re: creating xml messages with data from mulitiple files |
|
|
Sentinel
Joined: 05 Mar 2009 Posts: 888 Location: UK
|
|
Back to top |
|
 |
Vitor |
Posted: Wed Apr 08, 2009 1:33 am Post subject: Re: creating xml messages with data from mulitiple files |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
anild wrote: |
storing into system memory(setting into environment variable) its very system memory consuming. |
There's no possibility that you can construct an interim message from File A (parsing MRM -> XML) then augment the message with the contents of File B? _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
WMBDEV1 |
Posted: Wed Apr 08, 2009 1:33 am Post subject: Re: creating xml messages with data from mulitiple files |
|
|
Sentinel
Joined: 05 Mar 2009 Posts: 888 Location: UK
|
anild wrote: |
The scenario will be like this:
we are receiving 2 .txt files from FTP, each file size around 50MB and around 900000+ rows in each file.
We need to construct xml message using 2 files with m*n data rows into individual xml message. |
Seen this after my other post.
This will be difficult. Doing a quick sum of 50 * 50 (although as the input is not XML this number could be even bigger) gives you an output message / file of 2.5 GB and this doesnt include the memory required by the parsers. I think you're going to need some sort of streaming solution, especially on the output of the data and use the link before for help.
Why does the output file require a cartesian product? sounds inefficient to me. Can this be changed or broken down into smaller chunks? |
|
Back to top |
|
 |
Vitor |
Posted: Wed Apr 08, 2009 1:36 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
You could also cheat a little, use a feeder flow to put both files in queues, read (and parse) File A from it's queue, use an MQGet to read (and parse) File B from it's queue then manipulate the output message from the 2 trees.
This would leave the door open for the files to be processed on a record by record basis later. Which the users may one day want, and you might actually need!
Just a thought, untested, willing to be shot at by anyone on this, no liability accepted for loss, damage or headaches resulting. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
WMBDEV1 |
Posted: Wed Apr 08, 2009 1:44 am Post subject: |
|
|
Sentinel
Joined: 05 Mar 2009 Posts: 888 Location: UK
|
Vitor wrote: |
You could also cheat a little, use a feeder flow to put both files in queues, read (and parse) File A from it's queue, use an MQGet to read (and parse) File B from it's queue then manipulate the output message from the 2 trees.
This would leave the door open for the files to be processed on a record by record basis later.
|
Sounds doable but you're still going to need to propagate chunks of the output a bit at a time else you will exhaust the heap. |
|
Back to top |
|
 |
Vitor |
Posted: Wed Apr 08, 2009 1:52 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
WMBDEV1 wrote: |
Vitor wrote: |
You could also cheat a little, use a feeder flow to put both files in queues, read (and parse) File A from it's queue, use an MQGet to read (and parse) File B from it's queue then manipulate the output message from the 2 trees.
This would leave the door open for the files to be processed on a record by record basis later.
|
Sounds doable but you're still going to need to propagate chunks of the output a bit at a time else you will exhaust the heap. |
Even if you're not using Java nodes?
There's a central design issue here in that there are 2 files that apparently need to be processed simultaniously in memory, hence my question about augmentation.
There's an even more central question around why a site with WMB/WMQ is processing files not messages, and in large chunks to boot.
This ties back to your point that irrespective of memory you'd be "better" to take the files and propogate them into individual messages for downstream processing. I sense the spectre of affinity rising up, where the records need to be combined and processed in order...... _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
WMBDEV1 |
Posted: Wed Apr 08, 2009 2:01 am Post subject: |
|
|
Sentinel
Joined: 05 Mar 2009 Posts: 888 Location: UK
|
Vitor wrote: |
Even if you're not using Java nodes?
|
Absolutely, you're really gonna stuggle to allocate the 2.5gb (estimated) output without streaming bits of it out.
Vitor wrote: |
There's a central design issue here in that there are 2 files that apparently need to be processed simultaniously in memory, hence my question about augmentation.
|
Sure, its not nice but this is probably not the hardest thing to overcome in this case.
Quote: |
There's an even more central question around why a site with WMB/WMQ is processing files not messages, and in large chunks to boot.
|
Not one I can answer
Quote: |
This ties back to your point that irrespective of memory you'd be "better" to take the files and propogate them into individual messages for downstream processing. I sense the spectre of affinity rising up, where the records need to be combined and processed in order...... |
Agree again, message sementation or grouping may help.
The big issue for me remains the large size of the output though. |
|
Back to top |
|
 |
Vitor |
Posted: Wed Apr 08, 2009 2:07 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
WMBDEV1 wrote: |
Vitor wrote: |
Even if you're not using Java nodes?
|
Absolutely, you're really gonna stuggle to allocate the 2.5gb (estimated) output without streaming bits of it out. |
I didn't think so, and would welcome input from others here.
WMBDEV1 wrote: |
Vitor wrote: |
There's a central design issue here in that there are 2 files that apparently need to be processed simultaniously in memory, hence my question about augmentation.
|
Sure, its not nice but this is probably not the hardest thing to overcome in this case. |
No, but it's a question the poster would find value in thinking about.
WMBDEV1 wrote: |
Quote: |
There's an even more central question around why a site with WMB/WMQ is processing files not messages, and in large chunks to boot.
|
Not one I can answer  |
Nor me - I was being rhetorical! Also trying to provoke thought in the poster.
WMBDEV1 wrote: |
Quote: |
This ties back to your point that irrespective of memory you'd be "better" to take the files and propogate them into individual messages for downstream processing. I sense the spectre of affinity rising up, where the records need to be combined and processed in order...... |
Agree again, message sementation may help.
The big issue for me remains the large size of the output though. |
I remain unconvinced it's that much of an issue but I still agree with your point that propogation is the better way forward. Again, the poster needs to think about this, especially on how to break the affinity I suspect exists. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mqpaul |
Posted: Wed Apr 08, 2009 5:03 am Post subject: How about a good old two-file match? |
|
|
 Acolyte
Joined: 14 Jan 2008 Posts: 66 Location: Hursley, UK
|
This may be well off mark, so my apologies in advance if it's useless. Also very sorry if this is teaching grandmother to suck eggs.
I'm assuming you're not using the Broker V6.1 file nodes. You might be able to use one File node to start you flow and process one record at a time, but I can't think of a way you could have two file nodes in the same flow, as Broker only provides a FileInput node, not a FileRead node. So I guess you're using Java.
One approach already mentioned is to copy one or both file's records to queues and then merge them, but for the purposes of this response, that's just changing the plumbing for reading records.
I presume the application is similar in structure to a ledger update, with a master file with account records, and a transaction file with 0 or more records for each account, possibly including transactions for new accounts. (Substitute other key fields for "account" if your application is not financial.)
The trick is always to sort the input (I don't have suggestions on how to do that from broker) into account number/transaction date sequence. Then you read one record from each file. If both records have the same account record, do your update (eg, increment ledger balance by transaction amount), and read the next account record. Continue while the account numbers are the same. When ledger account number is lower, emit the ledger record you just processed, and read the next ledger record. If the transaction number is lower, you have a new account, so build a new ledger record, and repeat the above process for matching account transactions. The nice bit is you only need storage for three records - the ledger record you're working on, the next one you've read, and the transaction record you're working on.
You can find lots of stuff about this in old Data Processing text books. In particular Michael Jackson (not the moonwalker ) Structured Programming. _________________ Paul |
|
Back to top |
|
 |
|