Author |
Message
|
catshout |
Posted: Tue Oct 01, 2013 12:27 am Post subject: Batch processing of incoming files with delimited records |
|
|
Acolyte
Joined: 15 May 2012 Posts: 57
|
Hi folks,
I'm required to do following:
1. A file with a large number of (delimited) records (could be up to 1.2 mio records) is being copied to an input directory.
2. This file does have a certain amount of header records and a certain amount of trailer records.
3. The header records and trailer records need to be captured first.
4. Each line of the file (except header and trailer lines) has to be enriched with the previously captured header and trailer and has to be send to a message queue for further processing.
My ideas are the following:
1. An "input" flow recognizes the file and reads it with a FileInput Node set to delimited record processing. Only the End of Data terminal is wired. Getting an output here gives me the value of InputLocalEnvironment.File.Record for the number of records. This message is being sent to a trigger queue. The file is being moved to the mqsiarchive directory after reading.
2. A "split" flow triggered by the message in the trigger queue does know the # of header rows, trailer rows and the number of records in the file when starting. I can access the file in the mqsiarchive directory.
My questions are the following:
1. When a large file is being copied to the input directory, the "input" flow starts immediately and sends the wrong number of records to the trigger queue. Is there an option of the FileInput Node for starting if the file copy/write has been finished? Or do I need copy and rename the file when finished copying?
2. For the "split" flow I'm looking for ideas how to capture the header and trailer records from the file first without reading the whole file (into the memory). E.g. does the FileRead Node allow to capture dedicated records from anywhere in a file?
Any comments/ideas are highly appreciated. |
|
Back to top |
|
 |
smdavies99 |
Posted: Tue Oct 01, 2013 1:03 am Post subject: |
|
|
 Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
most of your questions have been asked and answered before.
for example, writing a large file takes time so in order to stop the broker from picking it up before the write is complete, write the file to a DIFFERENT directory or a different file pattern. When it is done, simply rename/move the file to what you are expecting. That takes a lot less time and ensures that the broker does not pick up a half full file.
Please search this forum and I am sure you will get most of your questions answered. _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
 |
Simbu |
Posted: Tue Oct 01, 2013 1:18 am Post subject: |
|
|
 Master
Joined: 17 Jun 2011 Posts: 289 Location: Tamil Nadu, India
|
smdavies99 wrote: |
most of your questions have been asked and answered before.
for example, writing a large file takes time so in order to stop the broker from picking it up before the write is complete, write the file to a DIFFERENT directory or a different file pattern. When it is done, simply rename/move the file to what you are expecting. That takes a lot less time and ensures that the broker does not pick up a half full file.
Please search this forum and I am sure you will get most of your questions answered. |
Hi smdavies99, do we really need to use different or a different file pattern and then simply rename/move the file when we are dealing with large file. I read that
Quote: |
While a file is being processed, the file system is used to lock the file. As a result, other programs (including other execution groups) are prevented from reading, writing, or deleting the file while it is being processed by the file nodes. |
so if the filesystem lock the file when writing the files into the directory which is listen by the FileInput Node then will the node pick the file? I need to try this option and see how it behaves. |
|
Back to top |
|
 |
catshout |
Posted: Tue Oct 01, 2013 2:01 am Post subject: |
|
|
Acolyte
Joined: 15 May 2012 Posts: 57
|
Thanks both for the reply. So far I'll use a 'mv' after file has been copied. So far the file will become visible after finishing copy process.
Still looking for the other part .. |
|
Back to top |
|
 |
kimbert |
Posted: Tue Oct 01, 2013 3:51 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Quote: |
3. The header records and trailer records need to be captured first. |
So you will need to parse the entire, huge, file twice. Once to capture the header and trailer, and once again to enrich each record.
Quote: |
does the FileRead Node allow to capture dedicated records from anywhere in a file? |
No.
Your best option is to use a technique similar to this one:
http://www.ibm.com/developerworks/websphere/library/techarticles/0505_storey/0505_storey.html
This keeps memory usage to a minimum by only parsing one record at a time. Bear in mind, though, that this document is very old. It uses the XML domain, and message sets. You will be using XMLNSC and DFDL. _________________ Before you criticize someone, walk a mile in their shoes. That way you're a mile away, and you have their shoes too. |
|
Back to top |
|
 |
catshout |
Posted: Tue Oct 01, 2013 5:49 am Post subject: |
|
|
Acolyte
Joined: 15 May 2012 Posts: 57
|
Many thanks for the reply.
I was just trying following approach:
1. The "input" flow does read the file by each record and adds the offset of each record to a string variable (shared variable). The records themselves are not part of further processing here. The string variable will be part of the outgoing trigger message .. this message finally looks like
Code: |
<File>
<RecordCount>7803</RecordCount>
<RecordOffsets>|0|59|462|866</RecordOffsets>
</File> |
The example above corresponds with the file content looks like
Code: |
030020001.201201 2012010120120131201202131544
1:30020004:0:Credit Voice:...<to be continued till position 865>
9000000005898 |
2. The "split" flow is able to pick a dedicated row by looking at this 'RecordOffsets' string, e.g. for the last row it can pick the offset value 866 from the RecordOffsets string, for the row before it can pick the offset value 462 and so on. The offset value can be set before the FileRead Node at OutputLocalEnvironment.Destination.File.Offset. Then the FileRead Node will pick exactly the row that is requested.
The problem is that the RecordOffsets string could be very large, depending on the number of rows. In real life there are only less than 10 header rows and trailer rows, so far it might be better to store only the first 10 and the last 10 offset values in this string.
The file needs to be parsed twice anyway, but the flow would never consume a lot of memory as the whole file content will never be stored within memory at once.
Looking forward for any feedback. |
|
Back to top |
|
 |
mqjeff |
Posted: Tue Oct 01, 2013 6:36 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
I thought the techniques here had been moved into product samples on large messages and file handling? |
|
Back to top |
|
 |
catshout |
Posted: Tue Oct 01, 2013 10:04 pm Post subject: |
|
|
Acolyte
Joined: 15 May 2012 Posts: 57
|
I've created the "input" flow as following:
This flow captures the file with the FileInput Node record by record delimited. It checks for each record the offset and creates finally a structure with the record counter and the last 10 offsets like
Code: |
<File>
<RecordCount>859466</RecordCount>
<RecordTrailerOffsets>
<row>242205755</row>
<row>242205474</row>
<row>242205194</row>
<row>242204912</row>
<row>242204631</row>
<row>242204351</row>
<row>242204069</row>
<row>242203772</row>
<row>242203471</row>
<row>242203191</row>
</RecordTrailerOffsets>
</File> |
This one is being sent as a trigger message to the next flow, which is able now to create the trailer first based on the FileRead Node with the offset set before reading. The header isn't a big deal as the first n records will be captured as header anyway.
Each further record can be enriched now with header and trailer reading the file sequentially, even with another FileRead Node, and propagated to the output queue for transaction processing of each record.
One word to performance, the "input" flow for the 859.466 records has needed appr. 6 minutes (appr. 2.400 in one second). This is absolutely ok, as the whole process is a batch processing without any special performance requirement. |
|
Back to top |
|
 |
|