MQSeries.net :: View topic - Batch processing of incoming files with delimited records

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Batch processing of incoming files with delimited records

Batch processing of incoming files with delimited records

« View previous topic :: View next topic »

Author

Message

catshout

Posted: Tue Oct 01, 2013 12:27 am Post subject: Batch processing of incoming files with delimited records

Acolyte

Joined: 15 May 2012
Posts: 57

Hi folks,

I'm required to do following:

1. A file with a large number of (delimited) records (could be up to 1.2 mio records) is being copied to an input directory.
2. This file does have a certain amount of header records and a certain amount of trailer records.
3. The header records and trailer records need to be captured first.
4. Each line of the file (except header and trailer lines) has to be enriched with the previously captured header and trailer and has to be send to a message queue for further processing.

My ideas are the following:

1. An "input" flow recognizes the file and reads it with a FileInput Node set to delimited record processing. Only the End of Data terminal is wired. Getting an output here gives me the value of InputLocalEnvironment.File.Record for the number of records. This message is being sent to a trigger queue. The file is being moved to the mqsiarchive directory after reading.
2. A "split" flow triggered by the message in the trigger queue does know the # of header rows, trailer rows and the number of records in the file when starting. I can access the file in the mqsiarchive directory.

My questions are the following:

1. When a large file is being copied to the input directory, the "input" flow starts immediately and sends the wrong number of records to the trigger queue. Is there an option of the FileInput Node for starting if the file copy/write has been finished? Or do I need copy and rename the file when finished copying?
2. For the "split" flow I'm looking for ideas how to capture the header and trailer records from the file first without reading the whole file (into the memory). E.g. does the FileRead Node allow to capture dedicated records from anywhere in a file?

Any comments/ideas are highly appreciated.

smdavies99

Posted: Tue Oct 01, 2013 1:03 am Post subject:

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

most of your questions have been asked and answered before.

for example, writing a large file takes time so in order to stop the broker from picking it up before the write is complete, write the file to a DIFFERENT directory or a different file pattern. When it is done, simply rename/move the file to what you are expecting. That takes a lot less time and ensures that the broker does not pick up a half full file.

Please search this forum and I am sure you will get most of your questions answered.
_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.

Simbu

Posted: Tue Oct 01, 2013 1:18 am Post subject:

Master

Joined: 17 Jun 2011
Posts: 289
Location: Tamil Nadu, India

smdavies99 wrote:

Hi smdavies99, do we really need to use different or a different file pattern and then simply rename/move the file when we are dealing with large file. I read that

Quote:

While a file is being processed, the file system is used to lock the file. As a result, other programs (including other execution groups) are prevented from reading, writing, or deleting the file while it is being processed by the file nodes.

so if the filesystem lock the file when writing the files into the directory which is listen by the FileInput Node then will the node pick the file? I need to try this option and see how it behaves.

catshout

Posted: Tue Oct 01, 2013 2:01 am Post subject:

Acolyte

Joined: 15 May 2012
Posts: 57

Thanks both for the reply. So far I'll use a 'mv' after file has been copied. So far the file will become visible after finishing copy process.

Still looking for the other part ..

kimbert

Posted: Tue Oct 01, 2013 3:51 am Post subject:

Jedi Council

Joined: 29 Jul 2003
Posts: 5543
Location: Southampton

Quote:

3. The header records and trailer records need to be captured first.

So you will need to parse the entire, huge, file twice. Once to capture the header and trailer, and once again to enrich each record.

Quote:

does the FileRead Node allow to capture dedicated records from anywhere in a file?

No.

Your best option is to use a technique similar to this one:
http://www.ibm.com/developerworks/websphere/library/techarticles/0505_storey/0505_storey.html
This keeps memory usage to a minimum by only parsing one record at a time. Bear in mind, though, that this document is very old. It uses the XML domain, and message sets. You will be using XMLNSC and DFDL.
_________________
Before you criticize someone, walk a mile in their shoes. That way you're a mile away, and you have their shoes too.

catshout

Posted: Tue Oct 01, 2013 5:49 am Post subject:

Acolyte

Joined: 15 May 2012
Posts: 57

Many thanks for the reply.

I was just trying following approach:

1. The "input" flow does read the file by each record and adds the offset of each record to a string variable (shared variable). The records themselves are not part of further processing here. The string variable will be part of the outgoing trigger message .. this message finally looks like

Code:

The example above corresponds with the file content looks like

Code:

030020001.201201 2012010120120131201202131544
1:30020004:0:Credit Voice:...<to be continued till position 865>
9000000005898

2. The "split" flow is able to pick a dedicated row by looking at this 'RecordOffsets' string, e.g. for the last row it can pick the offset value 866 from the RecordOffsets string, for the row before it can pick the offset value 462 and so on. The offset value can be set before the FileRead Node at OutputLocalEnvironment.Destination.File.Offset. Then the FileRead Node will pick exactly the row that is requested.

The problem is that the RecordOffsets string could be very large, depending on the number of rows. In real life there are only less than 10 header rows and trailer rows, so far it might be better to store only the first 10 and the last 10 offset values in this string.

The file needs to be parsed twice anyway, but the flow would never consume a lot of memory as the whole file content will never be stored within memory at once.

Looking forward for any feedback.

mqjeff

Posted: Tue Oct 01, 2013 6:36 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

kimbert wrote:

Your best option is to use a technique similar to this one:
http://www.ibm.com/developerworks/websphere/library/techarticles/0505_storey/0505_storey.html
This keeps memory usage to a minimum by only parsing one record at a time. Bear in mind, though, that this document is very old. It uses the XML domain, and message sets. You will be using XMLNSC and DFDL.

I thought the techniques here had been moved into product samples on large messages and file handling?

catshout

Posted: Tue Oct 01, 2013 10:04 pm Post subject:

Acolyte

Joined: 15 May 2012
Posts: 57

I've created the "input" flow as following:

This flow captures the file with the FileInput Node record by record delimited. It checks for each record the offset and creates finally a structure with the record counter and the last 10 offsets like

Code:

This one is being sent as a trigger message to the next flow, which is able now to create the trailer first based on the FileRead Node with the offset set before reading. The header isn't a big deal as the first n records will be captured as header anyway.

Each further record can be enriched now with header and trailer reading the file sequentially, even with another FileRead Node, and propagated to the output queue for transaction processing of each record.

One word to performance, the "input" flow for the 859.466 records has needed appr. 6 minutes (appr. 2.400 in one second). This is absolutely ok, as the whole process is a batch processing without any special performance requirement.

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Batch processing of incoming files with delimited records

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP