ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » DFDL, variable length records with only record initiators?

Post new topic  Reply to topic
 DFDL, variable length records with only record initiators? « View previous topic :: View next topic » 
Author Message
AlanV
PostPosted: Wed Nov 23, 2022 3:39 am    Post subject: DFDL, variable length records with only record initiators? Reply with quote

Newbie

Joined: 23 Nov 2022
Posts: 3

Hi, as I’m currently bit a novice with DFDL, I’m struggling how could I model the following data format:

HEA001xxxxxxxxxyyyyyzz
REC002xxxxyyyyzzzzxxx
REC002xxxxyyyyzz
zz
REC002xxxxyyyyzzzz
REC002xxxxyy
yyzzzz
xxx
FOO003xxxxxxxxxyyyyyzz


In above, there is 1 header record, identified by it's starting keyword: HEA001.

Then there are 4 variable length records, identified by them starting with keyword: REC002. The REC002 might containing zero to N line feeds.

Then there is 1 footer record, identified by its starting keyword: FOO003.


As the REC002 records might contain zero or more line feeds (CR LF), the line feed cannot be used a record delimiter. Also as the records are variable length, fixed length record parsing doesn’t work and also as there is no terminator for REC002 records, terminator cannot be used to signal DFDL to evaluate what record should be parsed next.

Somehow the record initiator (REC002) should be used to both initiate and terminate for REC002 record parsing, but I’m struggling how to accomplish this, as if I use it as a terminator, then the DFDL parser eats it and it’s no longer available to be used as a initiator - so some other DFDL approach is required (length kind - pattern?)

Any ideas how this DFDL parsing challenge could be achieved?

Thank you,
Alan
Back to top
View user's profile Send private message
timber
PostPosted: Wed Nov 23, 2022 8:42 am    Post subject: Reply with quote

Grand Master

Joined: 25 Aug 2015
Posts: 1290

That's a very good description of the problem. It's unusual to be able to propose a solution without asking further questions, so thanks for that.

You're correct that this is a challenging format to model using DFDL, but I think it's possible. One approach would be to define the length of REC002 using a regular expression. This can be done using lengthKind="pattern". The regular expression (pattern) will be a little tricky to compose, though.

A simpler alternative would be as follows (not tested):
Code:

element name="root"
  complexType terminator="%NL;FOO003" separator="%NL;REC"
    element name="HEA001" initiator="HEA001"
    element name="REC002" initiator="002" maxOccurs=4
      <more content here>
    element name="FOO003" initiator="FOO003"

This should work because the terminator "FOO003" will be in scope and will terminate the 4th occurrence of REC002, even though the separator "%NL;REC" will not do so.
Back to top
View user's profile Send private message
AlanV
PostPosted: Wed Nov 23, 2022 10:06 am    Post subject: Reply with quote

Newbie

Joined: 23 Nov 2022
Posts: 3

Thank you, this simpler alternative looks good!

The only alteration that I made, was moving the FOO003 footer to it’s own sequence, outside of sequence containing 1 HEA001 and unbounded REC002’s. If the FOO003 was in same sequence with others, it’s parsing failed - I think it was due to terminator="%NL;FOO003" eating the FOO003 and the last footer row, not starting with %NL;REC … have to check DFDL trace did I understood the outcome correctly.

But anyway, even with the small alteration, DFDL parsing works nicely, much simpler and better solution that the regular expression pattern, which would have been not good performance wise.


In real use case, these REC002 records (read from batch file) will be something, numbering from 50k to 2000k records. Need to inspect next, how alter the DFDL structure, so I can use it to read bigger batch file in smaller pieces (eg. 1k-5k records at time), to keep the memory usage in check (FileInput, Message domain: DFDL, Record detection: Parsed Record Sequence).
Back to top
View user's profile Send private message
timber
PostPosted: Thu Nov 24, 2022 3:51 am    Post subject: Reply with quote

Grand Master

Joined: 25 Aug 2015
Posts: 1290

Quote:
In real use case, these REC002 records (read from batch file) will be something, numbering from 50k to 2000k records. Need to inspect next, how alter the DFDL structure, so I can use it to read bigger batch file in smaller pieces (eg. 1k-5k records at time), to keep the memory usage in check
Actually, that is a really important consideration. The design of your flow will be strongly influenced by this non-functional requirement.

There is a well-documented pattern for reading/processing a huge file, one record at a time: https://www.ibm.com/docs/en/integration-bus/9.0.0?topic=SSMKHH_9.0.0/com.ibm.etools.mft.samples.largemessaging.doc/doc/introduction.html

The main ideas are:
- Design your model so that every record is the same type. This will require a Choice in your model with 3 members (Header, Record, Footer)
- Use Parse Timing = Parsed Record Sequence to ensure that one record at a time gets propagated from the input node (see why the model needs to treat all types of records the same?)
- Use the techniques describes in the linked topic to DELETE the message tree after processing each record. You are using DFDL and not XMLNSC but the same technique applies.

You could decide to implement the split-into-records flow and the process-records flow as separate message flows linked by a queue. I think you can implement a scalable solution without doing that, though.
Back to top
View user's profile Send private message
AlanV
PostPosted: Fri Nov 25, 2022 2:43 am    Post subject: Reply with quote

Newbie

Joined: 23 Nov 2022
Posts: 3

Well this turned to be, a journey to world of DFDL.

But I perhaps accomplished the goal (needs still further testing), through DFDL documentation reading (pullings hairs out of my head) and mostly by trial-and-error in DFDL modelling attempts.

And thank you again for your guidance, without it (comment that this can be accomplished in DFDL), I would have most likely try to solve this by some ground-level java reading-parsing the file for ACE, which would have not been so sound solution.

I did and currently test the following DFDL, which seems to work for this use case:

Code:
element name="root"
  sequence
   element name="record" maxOccurs="2"
   sequence
    element name="keyPrefix" lengthKind="pattern" lengthPattern="(HEA)" minOccurs="0"
    element name="keySuffix" lengthKind="explicit" length="3"
    choice choiceDispatchKey="{./keySuffix}"
      element name="HEA001" choiceBranchKey="001" terminator="%NL;REC %NL;FOO"
      element name="REC002" choiceBranchKey="002" terminator="%NL;REC %NL;FOO"
      element name="FOO003" choiceBranchKey="003"


    * keyPrefix takes care, "eats" for the first head line the HEA word.
    * keySuffix is the record type identifying key of 001,002 or 003 for choice.
    * For 001 header and 002 record, there is two terminators, either newLine+REC or newLine+FOO - I learned that you can have multiple terminator strings, not just one.



Based on very small sample, this DFDL seems to take care of variable length REC002 records, that in certain cases, might be split to multiple lines (problem in data producing legacy app, that won't be fixed

And when I tested this DFDL with FileInput, Message domain: DFDL, Record detection: Parsed Record Sequence - from Trace node, I observed that the FileInput node processed 2 records at time. For real use case, I of course raise the record maxOccurs from current 2 to greater value, this was just for this initial DFDL testing.
Back to top
View user's profile Send private message
timber
PostPosted: Fri Nov 25, 2022 7:58 am    Post subject: Reply with quote

Grand Master

Joined: 25 Aug 2015
Posts: 1290

I think you're almost there, but there may be a couple of bumps in the road to avoid.

The most important point is that Parsed Record Sequence assumes that the root element in the DFDL schema is a repeating record. In your model, I think the repeating choice is one level below the root. I think that explains why the FileInput node is propagating more than one record at a time.

My suggestion:
Code:
element name="root"
  choice maxOccurs="unbounded" initiatedContent='yes'
      element name="HEA001" initiator"HEA001" terminator="%NL;REC %NL;FOO"
      element name="002" terminator="%NL;REC %NL;FOO"
      element name="FOO003" initiator="003" terminator="%NL


Based on your ability to read and apply the DFDL specification, I think you can safely call yourself an expert from now on! I think dfdl:initiatedContent is the best choice here because every branch has an initiator...but you can revert to using dfdl:choiceDispatchKey if you hit problems.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic  Reply to topic Page 1 of 1

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » DFDL, variable length records with only record initiators?
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.