Author |
Message
|
KJCB |
Posted: Thu Feb 01, 2018 2:38 am Post subject: Parsing a CSV file with variable number of columns |
|
|
Apprentice
Joined: 09 Dec 2017 Posts: 26
|
Hi all,
I need to parse and process a CSV File downloaded from an external system which contains a variable number of columns, in order to send the information to another system through a web service.
As mentioned, it has a variable number of columns, and I just need to send some of them to the service. Moreover, they are allocated in different positions.
Let me explain myself. For example, the first time I download the file it has the following appearance:
"Column 1","Column 2", "Column 3"
And I just need to send Columns 1 and 3 through the web service.
The second time I download the file it has the following appearance:
"Column 1","Column 3","Column 4".
In this case, I don't need column 4 information for anything.
Any idea on how to archive this requirement? It's seems that creating a DFDL model for CSV files is not the best option...
[/u] |
|
Back to top |
|
 |
souciance |
Posted: Thu Feb 01, 2018 3:24 am Post subject: |
|
|
Disciple
Joined: 29 Jun 2010 Posts: 169
|
Why wouldn't DFDL work? It has support for managing variable number of columns if you there is a proper delimiter. Then just extract the columns you want and send those.
Off course, how will you keep track of a file has been sent or not and then make a decision next time a file has come? |
|
Back to top |
|
 |
KJCB |
Posted: Thu Feb 01, 2018 5:57 am Post subject: |
|
|
Apprentice
Joined: 09 Dec 2017 Posts: 26
|
souciance wrote: |
Why wouldn't DFDL work? It has support for managing variable number of columns if you there is a proper delimiter. Then just extract the columns you want and send those.
Off course, how will you keep track of a file has been sent or not and then make a decision next time a file has come? |
Hi,
I'm using IIB v9.0 and when creating the DFDL it asks me for the number of columns. I can't see how to fit this to my requirement.
Regards, |
|
Back to top |
|
 |
Vitor |
Posted: Thu Feb 01, 2018 6:04 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
KJCB wrote: |
I'm using IIB v9.0 and when creating the DFDL it asks me for the number of columns. |
So the CSV wizard expects a fixed number of columns, because that's the most common case for a CSV file.
This has nothing to do with DFDL's ability to model such a file, it just means the wizard isn't appropriate and you'll have to build it yourself. Using proper delimiters and (probably) discriminators. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
KJCB |
Posted: Thu Feb 01, 2018 6:17 am Post subject: |
|
|
Apprentice
Joined: 09 Dec 2017 Posts: 26
|
Vitor wrote: |
KJCB wrote: |
I'm using IIB v9.0 and when creating the DFDL it asks me for the number of columns. |
So the CSV wizard expects a fixed number of columns, because that's the most common case for a CSV file.
This has nothing to do with DFDL's ability to model such a file, it just means the wizard isn't appropriate and you'll have to build it yourself. Using proper delimiters and (probably) discriminators. |
Hi,
I understand I must create a new empty DFDL schema...or how?
I'm lost  |
|
Back to top |
|
 |
Vitor |
Posted: Thu Feb 01, 2018 6:38 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
KJCB wrote: |
I understand I must create a new empty DFDL schema...or how? |
It's one of the other options to "use a wizard"
KJCB wrote: |
I'm lost  |
Once you've used the wizard, you are dropped into the DFDL editor with a pre-populated DFDL model based on the answers you gave the wizard. If you don't use that, you are dropped into the DFDL editor with a blank model.
One possible option (and I don't claim this to be the best one) would be to have the wizard create a CSV model with the maximum number of columns, and then edit it the model to deal with the optional columns. This really depends how variable the file layout is; if it's highly variable I'd be inclined to build the model long hand.
But that's just me. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
fjb_saper |
Posted: Thu Feb 01, 2018 7:07 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
An other alternative would be to look at the actual layout. If you always have at a minimum 3 columns ....
What you really need here is a first line or "header" row describing the column content. You can then choose the content that is relevant for you and only extract the columns for which that content is relevant....  _________________ MQ & Broker admin |
|
Back to top |
|
 |
KJCB |
Posted: Thu Feb 01, 2018 7:30 am Post subject: |
|
|
Apprentice
Joined: 09 Dec 2017 Posts: 26
|
fjb_saper wrote: |
What you really need here is a first line or "header" row describing the column content. |
I can't configure the header row in my DFDL as I can't predict the column order.
fjb_saper wrote: |
You can then choose the content that is relevant for you and only extract the columns for which that content is relevant....  |
How can I archive this? |
|
Back to top |
|
 |
Vitor |
Posted: Thu Feb 01, 2018 8:08 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
KJCB wrote: |
How can I archive this? |
I suggested achieving this with discriminators a bit further up.
Other solutions are undoubtedly possible and may be better. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Vitor |
Posted: Thu Feb 01, 2018 8:14 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
One other tack you could take is asking whoever's producing this file what controls which columns are included and their order (seriously - "Column 1" can be either before or after "Column 3??) and if there's any way this can be determined or included - perhaps a prefix to the file name?
Another tack would be to look at it as a human - when you look at the file, how do you determine which column is where and which ones you need? From that, how would you encode that into a DFDL model?
One final point is to remember that the DFDL is not the silver bullet of data parsing. There may come a point where DFDL can only go so far and to finally achieve your end you may need to introspect the message with code. But I would urge you to go as far as you can with DFDL; it's far and away the best place to do this, and it sounds to me like DFDL can handle it, albeit with a fairly extensive model. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
KJCB |
Posted: Thu Feb 01, 2018 8:50 am Post subject: |
|
|
Apprentice
Joined: 09 Dec 2017 Posts: 26
|
Vitor wrote: |
One other tack you could take is asking whoever's producing this file what controls which columns are included and their order (seriously - "Column 1" can be either before or after "Column 3??) and if there's any way this can be determined or included - perhaps a prefix to the file name?
Another tack would be to look at it as a human - when you look at the file, how do you determine which column is where and which ones you need? From that, how would you encode that into a DFDL model?
One final point is to remember that the DFDL is not the silver bullet of data parsing. There may come a point where DFDL can only go so far and to finally achieve your end you may need to introspect the message with code. But I would urge you to go as far as you can with DFDL; it's far and away the best place to do this, and it sounds to me like DFDL can handle it, albeit with a fairly extensive model. |
Hi All,
Thanks for your help.
Just an scenario:
- I have these three columns:
ProjectTitle, TargetDepartment and TargetCompany
Column "TargetDepartment" can be present or not. If not present, it doesn't appear as empty in the CSV (it is removed when exported, so the other two columns are together).
Question 1: How can I configure my DFDL to include this scenario? I guess I must set property minOccurs to 0, and which other property do I have to change?
Question 2: I only want to extract columns ProjectTitle and TargetCompany. The others are not important for me. Is it possible to model this scenario too? How?
Thanks again for your help. |
|
Back to top |
|
 |
fjb_saper |
Posted: Thu Feb 01, 2018 9:25 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
If the order of the columns is not predictable, your header row should just have something like
column_name , occurs 1 to n for the fields.
So you would need to read the header row and analyze it to find which columns you want to extract...
Have fun  _________________ MQ & Broker admin |
|
Back to top |
|
 |
timber |
Posted: Fri Feb 02, 2018 9:30 am Post subject: |
|
|
 Grand Master
Joined: 25 Aug 2015 Posts: 1292
|
Quote: |
- I have these three columns:
ProjectTitle, TargetDepartment and TargetCompany
Column "TargetDepartment" can be present or not. If not present, it doesn't appear as empty in the CSV (it is removed when exported, so the other two columns are together). |
If you or some other engineer looked at a random example of this file, how you you know whether the TargetDepartment field was present or not? If you can answer that question, you will know how to solve the problem.
Based on what you've said so far, one option would be to parse each line as an array of columns and take the first and last columns from each row. That would be simple enough in DFDL and in ESQL. |
|
Back to top |
|
 |
Vitor |
Posted: Fri Feb 02, 2018 9:44 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
timber wrote: |
Quote: |
- I have these three columns:
ProjectTitle, TargetDepartment and TargetCompany
Column "TargetDepartment" can be present or not. If not present, it doesn't appear as empty in the CSV (it is removed when exported, so the other two columns are together). |
If you or some other engineer looked at a random example of this file, how you you know whether the TargetDepartment field was present or not? If you can answer that question, you will know how to solve the problem. |
Vitor wrote: |
Another tack would be to look at it as a human - when you look at the file, how do you determine which column is where and which ones you need? |
_________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
KJCB |
Posted: Thu Feb 15, 2018 8:50 am Post subject: |
|
|
Apprentice
Joined: 09 Dec 2017 Posts: 26
|
Hi All,
Finally we applied another approach: a map/reduce implementation that extracts the information we want from the "unpredictable" CSV file received, generating a fixed XML as output with the information we want, which is configured dynamically based in a properties file.
That way we gained independence from the system generating the CSV files and the Integration Bus got its closed schema.
Thank you. |
|
Back to top |
|
 |
|