|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
|
|
DFDL, variable length records with only record initiators? |
« View previous topic :: View next topic » |
Author |
Message
|
AlanV |
Posted: Wed Nov 23, 2022 3:39 am Post subject: DFDL, variable length records with only record initiators? |
|
|
Newbie
Joined: 23 Nov 2022 Posts: 3
|
Hi, as I’m currently bit a novice with DFDL, I’m struggling how could I model the following data format:
HEA001xxxxxxxxxyyyyyzz
REC002xxxxyyyyzzzzxxx
REC002xxxxyyyyzz
zz
REC002xxxxyyyyzzzz
REC002xxxxyy
yyzzzz
xxx
FOO003xxxxxxxxxyyyyyzz
In above, there is 1 header record, identified by it's starting keyword: HEA001.
Then there are 4 variable length records, identified by them starting with keyword: REC002. The REC002 might containing zero to N line feeds.
Then there is 1 footer record, identified by its starting keyword: FOO003.
As the REC002 records might contain zero or more line feeds (CR LF), the line feed cannot be used a record delimiter. Also as the records are variable length, fixed length record parsing doesn’t work and also as there is no terminator for REC002 records, terminator cannot be used to signal DFDL to evaluate what record should be parsed next.
Somehow the record initiator (REC002) should be used to both initiate and terminate for REC002 record parsing, but I’m struggling how to accomplish this, as if I use it as a terminator, then the DFDL parser eats it and it’s no longer available to be used as a initiator - so some other DFDL approach is required (length kind - pattern?)
Any ideas how this DFDL parsing challenge could be achieved?
Thank you,
Alan |
|
Back to top |
|
|
timber |
Posted: Wed Nov 23, 2022 8:42 am Post subject: |
|
|
Grand Master
Joined: 25 Aug 2015 Posts: 1290
|
That's a very good description of the problem. It's unusual to be able to propose a solution without asking further questions, so thanks for that.
You're correct that this is a challenging format to model using DFDL, but I think it's possible. One approach would be to define the length of REC002 using a regular expression. This can be done using lengthKind="pattern". The regular expression (pattern) will be a little tricky to compose, though.
A simpler alternative would be as follows (not tested):
Code: |
element name="root"
complexType terminator="%NL;FOO003" separator="%NL;REC"
element name="HEA001" initiator="HEA001"
element name="REC002" initiator="002" maxOccurs=4
<more content here>
element name="FOO003" initiator="FOO003"
|
This should work because the terminator "FOO003" will be in scope and will terminate the 4th occurrence of REC002, even though the separator "%NL;REC" will not do so. |
|
Back to top |
|
|
AlanV |
Posted: Wed Nov 23, 2022 10:06 am Post subject: |
|
|
Newbie
Joined: 23 Nov 2022 Posts: 3
|
Thank you, this simpler alternative looks good!
The only alteration that I made, was moving the FOO003 footer to it’s own sequence, outside of sequence containing 1 HEA001 and unbounded REC002’s. If the FOO003 was in same sequence with others, it’s parsing failed - I think it was due to terminator="%NL;FOO003" eating the FOO003 and the last footer row, not starting with %NL;REC … have to check DFDL trace did I understood the outcome correctly.
But anyway, even with the small alteration, DFDL parsing works nicely, much simpler and better solution that the regular expression pattern, which would have been not good performance wise.
In real use case, these REC002 records (read from batch file) will be something, numbering from 50k to 2000k records. Need to inspect next, how alter the DFDL structure, so I can use it to read bigger batch file in smaller pieces (eg. 1k-5k records at time), to keep the memory usage in check (FileInput, Message domain: DFDL, Record detection: Parsed Record Sequence). |
|
Back to top |
|
|
timber |
Posted: Thu Nov 24, 2022 3:51 am Post subject: |
|
|
Grand Master
Joined: 25 Aug 2015 Posts: 1290
|
Quote: |
In real use case, these REC002 records (read from batch file) will be something, numbering from 50k to 2000k records. Need to inspect next, how alter the DFDL structure, so I can use it to read bigger batch file in smaller pieces (eg. 1k-5k records at time), to keep the memory usage in check |
Actually, that is a really important consideration. The design of your flow will be strongly influenced by this non-functional requirement.
There is a well-documented pattern for reading/processing a huge file, one record at a time: https://www.ibm.com/docs/en/integration-bus/9.0.0?topic=SSMKHH_9.0.0/com.ibm.etools.mft.samples.largemessaging.doc/doc/introduction.html
The main ideas are:
- Design your model so that every record is the same type. This will require a Choice in your model with 3 members (Header, Record, Footer)
- Use Parse Timing = Parsed Record Sequence to ensure that one record at a time gets propagated from the input node (see why the model needs to treat all types of records the same?)
- Use the techniques describes in the linked topic to DELETE the message tree after processing each record. You are using DFDL and not XMLNSC but the same technique applies.
You could decide to implement the split-into-records flow and the process-records flow as separate message flows linked by a queue. I think you can implement a scalable solution without doing that, though. |
|
Back to top |
|
|
AlanV |
Posted: Fri Nov 25, 2022 2:43 am Post subject: |
|
|
Newbie
Joined: 23 Nov 2022 Posts: 3
|
Well this turned to be, a journey to world of DFDL.
But I perhaps accomplished the goal (needs still further testing), through DFDL documentation reading (pullings hairs out of my head) and mostly by trial-and-error in DFDL modelling attempts.
And thank you again for your guidance, without it (comment that this can be accomplished in DFDL), I would have most likely try to solve this by some ground-level java reading-parsing the file for ACE, which would have not been so sound solution.
I did and currently test the following DFDL, which seems to work for this use case:
Code: |
element name="root"
sequence
element name="record" maxOccurs="2"
sequence
element name="keyPrefix" lengthKind="pattern" lengthPattern="(HEA)" minOccurs="0"
element name="keySuffix" lengthKind="explicit" length="3"
choice choiceDispatchKey="{./keySuffix}"
element name="HEA001" choiceBranchKey="001" terminator="%NL;REC %NL;FOO"
element name="REC002" choiceBranchKey="002" terminator="%NL;REC %NL;FOO"
element name="FOO003" choiceBranchKey="003" |
* keyPrefix takes care, "eats" for the first head line the HEA word.
* keySuffix is the record type identifying key of 001,002 or 003 for choice.
* For 001 header and 002 record, there is two terminators, either newLine+REC or newLine+FOO - I learned that you can have multiple terminator strings, not just one.
Based on very small sample, this DFDL seems to take care of variable length REC002 records, that in certain cases, might be split to multiple lines (problem in data producing legacy app, that won't be fixed
And when I tested this DFDL with FileInput, Message domain: DFDL, Record detection: Parsed Record Sequence - from Trace node, I observed that the FileInput node processed 2 records at time. For real use case, I of course raise the record maxOccurs from current 2 to greater value, this was just for this initial DFDL testing. |
|
Back to top |
|
|
timber |
Posted: Fri Nov 25, 2022 7:58 am Post subject: |
|
|
Grand Master
Joined: 25 Aug 2015 Posts: 1290
|
I think you're almost there, but there may be a couple of bumps in the road to avoid.
The most important point is that Parsed Record Sequence assumes that the root element in the DFDL schema is a repeating record. In your model, I think the repeating choice is one level below the root. I think that explains why the FileInput node is propagating more than one record at a time.
My suggestion:
Code: |
element name="root"
choice maxOccurs="unbounded" initiatedContent='yes'
element name="HEA001" initiator"HEA001" terminator="%NL;REC %NL;FOO"
element name="002" terminator="%NL;REC %NL;FOO"
element name="FOO003" initiator="003" terminator="%NL
|
Based on your ability to read and apply the DFDL specification, I think you can safely call yourself an expert from now on! I think dfdl:initiatedContent is the best choice here because every branch has an initiator...but you can revert to using dfdl:choiceDispatchKey if you hit problems. |
|
Back to top |
|
|
|
|
|
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|