Author |
Message
|
visasimbu |
Posted: Thu Sep 07, 2017 11:14 pm Post subject: Need to parse more than 500 feilds in CSV file using DFDL |
|
|
 Disciple
Joined: 06 Nov 2009 Posts: 171
|
Need to look for effective way to parse the 500 fields in CSV file using DFDL.
I have input file of 500 fields in CSV file. But i have mapping of only 30 fields. The rest of fields i wont use it in mapping. Looking for best way to parse the incoming fields using DFDL. Creating DFDL with 500 fields.
After going through documentation, i have found option like parse timing which can be set to
. So that when ever fields is used in code, that field alone will get parsed. Is this the only way i can make ignore the rest of the field parsing ? Or any other items which i need to take care on best use of memory and processing ?
Note - I couldn't suggest source system to instruct to send only 30 fields which i need in IIB. |
|
Back to top |
|
 |
mqjeff |
Posted: Fri Sep 08, 2017 3:04 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
you can model the rest of the fields as blobs.
IF they are all at the end, you can model them as one big blob. If the fields are always filled with spaces, or otherwise of a constant length, you can read the input as a blob, truncate it and then parse that.
If they are at the front, you can model them as one big blob if you know how to find the first one you need. Likewise, if they are all fixed lengths, you can truncate the blog.
If they are mixed in, you can model he miixed elements as blobs. Then use your current on demand parsing.
But note that On Demand parsing will parse everything *up* to the record you want - but only once. I think. Unless I'm wrong. _________________ chmod -R ugo-wx / |
|
Back to top |
|
 |
timber |
Posted: Fri Sep 08, 2017 5:02 am Post subject: |
|
|
 Grand Master
Joined: 25 Aug 2015 Posts: 1292
|
On Demand parsing will not help you. That applies to the entire message, not to individual records within the message.
Something like this should work:
1. Model the 30 fields using the CSV wizard. This will define the comma as a separator and the line-end character(s) as a terminator for each record.
2. Open the model in the DFDL editor
3. Edit the generated model as follows:
- wrap a new sequence group around the existing sequence group that contains the 30 records.
- remove the terminator from the original (now the inner) sequence group and put it onto the new, outer sequence group.
- within the *outer* sequence group, add one more string field. Call it 'remainingFields' or something like that. Set its lengthKind property to 'delimited'.
Personally, I find it easiest to do this kind of structural change using an XSD editor, but you may prefer to do it using the DFDL editor. |
|
Back to top |
|
 |
mqjeff |
Posted: Fri Sep 08, 2017 6:18 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
timber wrote: |
1. Model the 30 fields using the CSV wizard. This will define the comma as a separator and the line-end character(s) as a terminator for each record. |
Does this assume that the 30 fields are located next to each other, and not dispersed across the entire record?
I.e : <bunch of unneeded fields or none><30 fields><bunch of unneeded fields or none>
Instead of
<bunch of unneeded fields or none><field x><bunch of uneeded fields><field Y> ... etc? _________________ chmod -R ugo-wx / |
|
Back to top |
|
 |
timber |
Posted: Fri Sep 08, 2017 8:18 am Post subject: |
|
|
 Grand Master
Joined: 25 Aug 2015 Posts: 1292
|
@mqjeff: That's a very good point.
@visasimbu: You will need to model all fields up to and including the last field that you need to map. If you're lucky, that will be the 30th field. If you're very unlucky, that will be the 500th field.
For each unmapped field that you need to model, you can refer to a single global string element 'unmappedField'. If there is a sequence of N unmapped fields then you can set maxOccurs=N to consume them all. |
|
Back to top |
|
 |
urufberg |
Posted: Fri Sep 08, 2017 11:18 am Post subject: |
|
|
Apprentice
Joined: 08 Sep 2017 Posts: 28
|
@visasimbu:
I think @timber & @mqjeff answers are pretty much the best way to approach your situation.
I just want to add that the worst case scenario would be if you have one (or more) fields in between of those you're going to use. In that case you will have around 60 fields (I know it's a lot but it's way better than 500 hundred).
I've found myself in this situation before and what I always do is insert a dummyField( 1,2,3...) and set the min and max occurences to the specific number I need. With this solution you avoid creating more than 1 dummyField for each record of fields you're not going to use.
Hope it works for you  |
|
Back to top |
|
 |
rekarm01 |
Posted: Fri Sep 08, 2017 4:30 pm Post subject: Re: Need to parse more than 500 feilds in CSV file using DFD |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
visasimbu wrote: |
Need to look for effective way to parse the 500 fields in CSV file using DFDL. |
What is the actual goal here? To improve performance, by reducing the memory, CPU, or other resources required to parse a message? Or just to simplify the message model?
If the goal is to improve performance, then just simplifying the message model probably won't help; the parser still has to count delimiters to get to the last field of interest. On demand parsing might help though, if the rest of the message has a significant number of fields that don't need parsing. |
|
Back to top |
|
 |
|