Author |
Message
|
sumit |
Posted: Sun Mar 01, 2015 12:16 pm Post subject: Parsed Record Sequence with Large message |
|
|
Partisan
Joined: 19 Jan 2006 Posts: 398
|
Hi All,
OS - Windows (for now, have to move it to Linux)
IIB (v9)
Developed a message flow to process a structured large flat file. File has header, multiple data (no specific count) and trailer.
Prepared a basic DFDL with header (optional), data (max count set to 20), trailer (optional).
I am using this DFDL in my fileInput node with Parse record sequence. The input flat file can have any number of Data records. Intention is to let the FileInput node take 20 data at once from all the available data records with the help of DFDL and sends it through the flow for processing.
I have tested this scenario with a smaller file and it works well. Even open in debug mode (I know the behavior of debug doesn't imitate the real processing), I can see that the flow picks 20 records in one go, process them and send them to output file (FileOutput node, default settings). I can see a set of 20 data records are going into the output file before flow picks next set of 20 data records.
However, when tested this flow with a large file (50Mb), it started taking a lot of time. I can see the output file getting created in transit folder but with 0kb size. I cannot see the file size growing.
I ran trace and I can see the offset value changes in trace, which shows that flow is picking 20 records in one go, but don't see them going into the output file for each processing.
I understand that parse record sequence is an expensive way of dealing with file. However, I am struggling to understand why doesn't the output file being appended with each set of 20 Data record processing.
Had this worked, I was planning to test it with 1GB and then 2Gb input message. _________________ Regards
Sumit |
|
Back to top |
|
 |
Vitor |
Posted: Mon Mar 02, 2015 6:16 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
Without looking at your DFDL schema (which you've not posted), one possible theory is that it's picking up 20 data records then parsing the rest of the 50Mb file looikng for somthing it recognises (which it doesn't find). The output file never gets any data because the flow doesn't commit anything.
If I was coding this, I'd build a DFDL model that correctly described the data (optional header, 1-n data records, optional trailer) and put a Collector node as the next one in sequence after the FileInput with a collection size of 20.
You could of course perform the same collection of records into groups of 20 with a shared varable, database, global cache (if you're on that version) or other mechanism of your choice. I'd use a Collector, but the key point is to shred the file with DFDL and group for processing with code. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
sumit |
Posted: Mon Mar 02, 2015 7:52 am Post subject: |
|
|
Partisan
Joined: 19 Jan 2006 Posts: 398
|
Vitor wrote: |
Without looking at your DFDL schema (which you've not posted) |
Input data is an EDI data with multiple 5000s and 5990s in it. Each 5000-5990 represents a record. DFDL is designed to pick 20 such records in one go. Here is the DFDL strcutrure-
Code: |
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/" xmlns:fmt="http://www.ibm.com/dfdl/GeneralPurposeFormat" xmlns:ibmDfdlExtn="http://www.ibm.com/dfdl/extensions" xmlns:ibmSchExtn="http://www.ibm.com/schema/extensions" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:import namespace="http://www.ibm.com/dfdl/GeneralPurposeFormat" schemaLocation="../IBMdefined/GeneralPurposeFormat.xsd"/>
<xsd:element ibmSchExtn:docRoot="true" name="Claims_split" type="Claims_msg"/>
<xsd:complexType name="body_msg">
<xsd:sequence dfdl:separator="" dfdl:terminator="">
<xsd:element dfdl:initiator="5000" dfdl:terminator="%LF;5990" ibmDfdlExtn:sampleValue="" name="body5000" type="xsd:string"/>
<xsd:element dfdl:terminator="%LF;" ibmDfdlExtn:sampleValue="" name="body5990" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
<xsd:element name="Claims_test" type="Claims_msg"/>
<xsd:complexType name="Claims_msg">
<xsd:sequence dfdl:separator="">
<xsd:element dfdl:initiator="HEADER" dfdl:occursCountKind="implicit" dfdl:terminator="%LF;" ibmDfdlExtn:sampleValue="" minOccurs="0" name="header" type="xsd:string"/>
<xsd:element dfdl:occursCountKind="implicit" dfdl:terminator="" maxOccurs="20" name="body" type="body_msg"/>
<xsd:element dfdl:initiator="TRAILER" dfdl:occursCountKind="implicit" ibmDfdlExtn:sampleValue="" minOccurs="0" name="trailer" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
<xsd:annotation>
<xsd:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:format ref="fmt:GeneralPurposeFormat"/>
</xsd:appinfo>
</xsd:annotation>
</xsd:schema>
|
I hope this way of presentation is fine.
Vitor wrote: |
...it's picking up 20 data records then parsing the rest of the 50Mb file looikng for somthing it recognises (which it doesn't find). The output file never gets any data because the flow doesn't commit anything. |
Hmm.. I have tested the DFDL on a smaller file. I could see the flow was writing the output file in debug mode for each iteration. As you mentioned, it may be because 'commit' behaviour is different in debug mode as compared to the actual flow processing.
Vitor wrote: |
If I was coding this, I'd build a DFDL model that correctly described the data (optional header, 1-n data records, optional trailer) and put a Collector node as the next one in sequence after the FileInput with a collection size of 20.
You could of course perform the same collection of records into groups of 20 with a shared varable, database, global cache (if you're on that version) or other mechanism of your choice. I'd use a Collector, but the key point is to shred the file with DFDL and group for processing with code. |
Thanks for the suggestion. I will try using the collector node. _________________ Regards
Sumit |
|
Back to top |
|
 |
Vitor |
Posted: Mon Mar 02, 2015 8:45 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
sumit wrote: |
Vitor wrote: |
Without looking at your DFDL schema (which you've not posted) |
Input data is an EDI data with multiple 5000s and 5990s in it. Each 5000-5990 represents a record. DFDL is designed to pick 20 such records in one go |
I'm no @kimbert, but I think that's telling DFDL there are only a maximum of 20 occurances of that structure. I don't see where you describe that the group of 1-20 repeats through the file.
sumit wrote: |
I hope this way of presentation is fine. |
This way of presentation is ideal, and I thank you for using those tags.
sumit wrote: |
Vitor wrote: |
...it's picking up 20 data records then parsing the rest of the 50Mb file looikng for somthing it recognises (which it doesn't find). The output file never gets any data because the flow doesn't commit anything. |
Hmm.. I have tested the DFDL on a smaller file. I could see the flow was writing the output file in debug mode for each iteration. As you mentioned, it may be because 'commit' behaviour is different in debug mode as compared to the actual flow processing. |
It is.
sumit wrote: |
Vitor wrote: |
If I was coding this, I'd build a DFDL model that correctly described the data (optional header, 1-n data records, optional trailer) and put a Collector node as the next one in sequence after the FileInput with a collection size of 20.
You could of course perform the same collection of records into groups of 20 with a shared varable, database, global cache (if you're on that version) or other mechanism of your choice. I'd use a Collector, but the key point is to shred the file with DFDL and group for processing with code. |
Thanks for the suggestion. I will try using the collector node. |
Please post again (on a new thread if appropriate) if you continue to expereince issues. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
sumit |
Posted: Mon Mar 02, 2015 10:22 am Post subject: |
|
|
Partisan
Joined: 19 Jan 2006 Posts: 398
|
Vitor wrote: |
I'm no @kimbert, but I think that's telling DFDL there are only a maximum of 20 occurances of that structure. I don't see where you describe that the group of 1-20 repeats through the file.
|
My understanding is that using Parsed Record Sequence at FileInput node will ensure that the flow picks first 20 records only for one set of processing.
I must mention here, the sample flow has just FileInput node and FileOutput node. _________________ Regards
Sumit |
|
Back to top |
|
 |
sumit |
Posted: Mon Mar 02, 2015 11:09 am Post subject: |
|
|
Partisan
Joined: 19 Jan 2006 Posts: 398
|
May be I am all wrong with my understanding. Testing the flow with various scenarios now. _________________ Regards
Sumit |
|
Back to top |
|
 |
Vitor |
Posted: Mon Mar 02, 2015 11:13 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
sumit wrote: |
Vitor wrote: |
I'm no @kimbert, but I think that's telling DFDL there are only a maximum of 20 occurances of that structure. I don't see where you describe that the group of 1-20 repeats through the file.
|
My understanding is that using Parsed Record Sequence at FileInput node will ensure that the flow picks first 20 records only for one set of processing. |
So your assertion is that because the record definition allows for a maximum of 20 records, the file input node will pick 20 records?
Fair enough; my assertion is that the file input node will use the record definition to identify the first 20 records of data, then trawl in confusion through the rest of the file.
One of us is right and I'm not that certain it's me (I go to a lot of trouble to avoid parsed record sequence due to the cost involved so my experience it limited). If you're right then I don't know what's the issue with your flow. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Vitor |
Posted: Mon Mar 02, 2015 11:21 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
sumit wrote: |
May be I am all wrong with my understanding. |
As I said, I wouldn't immediately assume that. I think I've used parsed record sequence once or maybe twice in the 14 years I've been using whatever-the-product-is-called-now and certainly not recently.
"Will Mr Kimbert please answer a thread on the white phone. Mr Kimbert to the white phone please." _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mqjeff |
Posted: Mon Mar 02, 2015 11:29 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
The DFDL as posted says the following:
There is a message. It consists of three parts: a header, some body stuff, and a trailer.
There is exactly one header.
There are up to but no more than 20 body records.
There is exactly one trailer.
Is that actually how your data is organized? That for every 20 records, there is a separate header and trailer?
Or is it one header and one trailer in the entire file, and a whole lot of body records?
Either way, you need to figure out how to associate 'parsed record sequence' with the fact that you have a header and a trailer. |
|
Back to top |
|
 |
sumit |
Posted: Mon Mar 02, 2015 12:25 pm Post subject: |
|
|
Partisan
Joined: 19 Jan 2006 Posts: 398
|
mqjeff wrote: |
Is that actually how your data is organized? That for every 20 records, there is a separate header and trailer?
Or is it one header and one trailer in the entire file, and a whole lot of body records? |
The input file has one header and one trailer. In between, there could be any number of body records.
I've tested my flow with a smaller file as well. I've got a file with 1 Header, 8 data records and 1 trailer. I've updated my DFDL to pick 2 records at a time (max count 2). All other properties of FileInput node are same. I have also placed a collector node with Quantity set to 1 and timeout of 10 secs. Event coodination property is set to 'Disabled'. A compute node, just to send inputroot to outputroot and then an MQ Output node.
Code: |
FileInput -> Compute (Outputroot = InputRoot) -> Collector -> Compute (OutputRoot = InputRoot) -> MQ Output |
When I am running this flow, I can see 4 messages in my output queue.
First - 1 header, a body with 2 records
Second - a body with 2 records
Third - a body with 2 records
Forth - a body with 2 records and 1 trailer.
But when I do the same thing for large Max count in DFDL (20) and run the interface on a large file, nothing comes in output for long. I am going to run trace on my new setup to see what it suggests. _________________ Regards
Sumit |
|
Back to top |
|
 |
kimbert |
Posted: Tue Mar 03, 2015 5:32 pm Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
The DFDL model needs to be a repeating choice of header/body/trailer. You cannot set maxOccurs=unbounded on a choice group, so you must create a model with an element that repeats unbounded. The element contains the choice of header/body/trailer.
If you want 'body' to represent ( up to ) 20 body records then you must create a sequence group that contains an element with minOccurs=0 and maxOccurs=20 on that branch of the choice.
Easier to show it like this:
Code: |
Message
complex type
element name='record' maxOccurs='unbounded'
choice group
element name='header'
sequence group
element name='body' minOccurs=0 maxOccurs=20
element name='trailer' |
You may need to add discriminators to header, body and trailer elements to ensure that the DFDL parser always resolves the choice correctly. Remember that the DFDL trace is your best diagnosis tool. _________________ Before you criticize someone, walk a mile in their shoes. That way you're a mile away, and you have their shoes too. |
|
Back to top |
|
 |
sumit |
Posted: Fri Mar 06, 2015 9:29 am Post subject: |
|
|
Partisan
Joined: 19 Jan 2006 Posts: 398
|
Thanks for suggestion Kimbert. I'll try it out.
After my last post, I dropped the idea of using 'Parse Record Sequenc' and built a flow with 'Records and Elements' set to 'Delimited' at my FileInput node.
Used a collector node to collect 20/50 records at a time, then built a logical mesasge with header and trailer and sent it to an output queue. This is on the same lines what 'wbi_telecom' mentioned in this post.
Tested the flow on 2GB file and it processed the whole fine in close to 2 mins and 20 secs. _________________ Regards
Sumit |
|
Back to top |
|
 |
|