Author |
Message
|
BHill |
Posted: Thu Aug 18, 2016 8:16 pm Post subject: Parsing select columns from a CSV |
|
|
Newbie
Joined: 18 Aug 2016 Posts: 7
|
I've search through the forums and I don't think I've seen anyone ask about this. If this is a repeat, I apologize.
I get a CSV file with a header that needs to be inserted into DB2. If the column names are fixed, I've got a solution using a DFDL schema and then loading that with some simple eSQL. Unfortunately, I don't control the incoming CSV. The incoming file will be small (about 800 rows), but the number of columns can vary. I only need a subset of the columns, and I am guaranteed that the expected columns will be there. As I mentioned, the CSV comes with a header, and I need to locate the desired subset by column names. I spent the afternoon exploring how I could use DFDL to suck in the header as an array of items in a hidden reference and then try to find the indices of the desired columns to set variables. Even if that could work, it looks like IIB doesn't implement the need parts of DFDL for that.
I'm wondering what the cleanest means of doing this is. One idea is to pull it in as a blob, and I could just write my own JavaComputeNode to parse it out. Another idea was that most DB's are also capable of sucking in raw CSV and parsing them. It seemed unlikely, but could a blob be handed to DB2 but tell DB2 not to treat it as a blob but to parse it as it would a file? Is there another route to do this? I'm new to IIB, so I'm just looking for some guidance on the IIB way of handling such a situation.
Any advice is appreciated. |
|
Back to top |
|
 |
mqjeff |
Posted: Fri Aug 19, 2016 4:02 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
Use DFDL to parse the entire CSV, using the header row if possible to identify the columns by name. Pretty sure that DFDL does that, and by default with the CSV base mdoel... (paging timber... )
You should then be able to simply refer to the specific columns of each row by name. The rest of the columns will be parsed, but ignored by your code. _________________ chmod -R ugo-wx / |
|
Back to top |
|
 |
BHill |
Posted: Tue Aug 23, 2016 7:23 am Post subject: |
|
|
Newbie
Joined: 18 Aug 2016 Posts: 7
|
Ok, thanks. I'll give DFDL another go. I'll try importing the CommaSeparatedFormat.xsd schema after reading the header. Doing so beforehand gives me complaints about using an occursCountKind of 'parsed', since it is incompatible when considering an ordered sequence.
Any other guidance from DFDL experts would be appreciated. |
|
Back to top |
|
 |
BHill |
Posted: Tue Aug 23, 2016 2:49 pm Post subject: |
|
|
Newbie
Joined: 18 Aug 2016 Posts: 7
|
So far, I've had no luck in figuring out how to get DFDL to identify columns by names.
The default IBM example for ingesting CSV files with headers expect a fixed number of columns and then just assigns the values to head_field1, head_field2, etc.
The daffodil simpleCSV example gets closer. It just creates an array element all with the same name:
<title>Hdr1</title>
<title>Hdr2</title>
.etc
And then it just does the same with each field in a record:
<item>Foo</item>
<item>Bar</item>
etc.
Your suggested route would involve two things I haven't found examples of:
1. Ingesting non-fixed number of elements and naming the element based on the value.
2. Using column-named elements to specify which items to grab out of a record-line.
I might be able to work around #1 be coding up the header parsing as a choices with a discriminators that look for expected values. Even if I have that though, I'm not sure I've seen any DFDL examples that can parse out values by either a column-name or index value #2.
I'm glad to work through this, I'm just not finding any examples that get me on the right path. I appreciate people's time in suggesting places to look.
As additional constraint, I'm using IIB v9, so that also reduces a few DFDL features, but it doesn't seem by much.
Thanks! |
|
Back to top |
|
 |
fjb_saper |
Posted: Tue Aug 23, 2016 6:29 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
BHill wrote: |
So far, I've had no luck in figuring out how to get DFDL to identify columns by names.
The default IBM example for ingesting CSV files with headers expect a fixed number of columns and then just assigns the values to head_field1, head_field2, etc.
The daffodil simpleCSV example gets closer. It just creates an array element all with the same name:
<title>Hdr1</title>
<title>Hdr2</title>
.etc
And then it just does the same with each field in a record:
<item>Foo</item>
<item>Bar</item>
etc.
Your suggested route would involve two things I haven't found examples of:
1. Ingesting non-fixed number of elements and naming the element based on the value.
2. Using column-named elements to specify which items to grab out of a record-line.
I might be able to work around #1 be coding up the header parsing as a choices with a discriminators that look for expected values. Even if I have that though, I'm not sure I've seen any DFDL examples that can parse out values by either a column-name or index value #2.
I'm glad to work through this, I'm just not finding any examples that get me on the right path. I appreciate people's time in suggesting places to look.
As additional constraint, I'm using IIB v9, so that also reduces a few DFDL features, but it doesn't seem by much.
Thanks! |
You need to do this dynamically. You parse the header line and keep it handy. Then you parse your record. Next thing you go through both your header line and the record data keeping place. The value of the header line field should match the value of your data label and when they match what you need you can extract them...
Have fun  _________________ MQ & Broker admin |
|
Back to top |
|
 |
BHill |
Posted: Tue Aug 23, 2016 6:40 pm Post subject: |
|
|
Newbie
Joined: 18 Aug 2016 Posts: 7
|
I'm currently looking at just ingesting everything as <title> and <item> arrays, and then sorting out the desired columns and data in eSQL. Maybe this is what you meant.
There may be more elegant means through DFDL, but I'm not finding enough example code to work out anything but trivial examples or clever things that require features that IBM hasn't implemented. |
|
Back to top |
|
 |
shanson |
Posted: Thu Aug 25, 2016 2:27 am Post subject: |
|
|
 Partisan
Joined: 17 Oct 2003 Posts: 344 Location: IBM Hursley
|
DFDL can not assign element names dynamically. It always uses the element name declared in the schema. |
|
Back to top |
|
 |
BHill |
Posted: Thu Aug 25, 2016 10:13 am Post subject: |
|
|
Newbie
Joined: 18 Aug 2016 Posts: 7
|
Thanks shanson. That's what it looked like to me. Given that I know what text I am looking for in the header items, I think I can setup a choice that matches the value to an expected string. That would parse the header values into similarly named elements.
Of course, that isn't quite the end goal. What matters is having a means to pull the correct row values out and insert them into a row record element that contains elements from all the desired fields. So my header parsing is really being done in the hopes that doing so also opens a mechanism to inform the parser which fields to chose out of each row (e.g. save column indices, or establish column names). That part I'm not seeing a mechanism for.
Thanks |
|
Back to top |
|
 |
mqjeff |
Posted: Thu Aug 25, 2016 10:19 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
Are you trying to deal with a case where the header row has more columns in it than the records?
That's an unsolvable problem, unless the number of field separators in each record matches the number of columns in the header
If there are the same number of separators, then you can model the record row with the right column name as the name of each element. Then make sure that any fields that are not populated are treated as null/nil/whatever.
Then your ESQL/etc. can just access the fields it needs to by name.
Model the header row so that it appears as a header, and not as the first row in the records. Likewise with any trailers. _________________ chmod -R ugo-wx / |
|
Back to top |
|
 |
BHill |
Posted: Thu Aug 25, 2016 10:37 am Post subject: |
|
|
Newbie
Joined: 18 Aug 2016 Posts: 7
|
No, the header has the same column count as the rows. No nightmare there.
I just need to sift through the header to find the desired columns, and then translate my knowledge of which columns contain the desired data to the row parsing.
mqjeff The point I'm lost on is how to transfer/use that knowledge in row parsing. Is there an example I can look at? Something that directs element "foo" to pull its value from the column that had a header name "foo".
Thanks |
|
Back to top |
|
 |
mqjeff |
Posted: Thu Aug 25, 2016 11:02 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
No, you can't really do that.
If it's a CSV file, the columns have to be in the same order as the headers. Well, they are assumed to be in the same order. The value of field ten is always taken as the value of the field named by header field 10.
So if header 10 says "foo", then field 10 in the record model should be named 'foo'. You can't tell the parser that field 10 in the model should use the name of field 10 in the header.
Although perhaps shanson will correct me. _________________ chmod -R ugo-wx / |
|
Back to top |
|
 |
BHill |
Posted: Fri Aug 26, 2016 2:05 pm Post subject: |
|
|
Newbie
Joined: 18 Aug 2016 Posts: 7
|
Ok, thanks mqjeff, and everyone. I appreciate it.
It looks like I should just parse it all out as arrays (header arrays, and row arrays) and sort out the columns I want in eSQL or possibly a compute node. |
|
Back to top |
|
 |
|