MQSeries.net :: View topic - Parsing select columns from a CSV

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Parsing select columns from a CSV

Parsing select columns from a CSV

« View previous topic :: View next topic »

Author

Message

BHill

Posted: Thu Aug 18, 2016 8:16 pm Post subject: Parsing select columns from a CSV

Newbie

Joined: 18 Aug 2016
Posts: 7

I've search through the forums and I don't think I've seen anyone ask about this. If this is a repeat, I apologize.

I get a CSV file with a header that needs to be inserted into DB2. If the column names are fixed, I've got a solution using a DFDL schema and then loading that with some simple eSQL. Unfortunately, I don't control the incoming CSV. The incoming file will be small (about 800 rows), but the number of columns can vary. I only need a subset of the columns, and I am guaranteed that the expected columns will be there. As I mentioned, the CSV comes with a header, and I need to locate the desired subset by column names. I spent the afternoon exploring how I could use DFDL to suck in the header as an array of items in a hidden reference and then try to find the indices of the desired columns to set variables. Even if that could work, it looks like IIB doesn't implement the need parts of DFDL for that.

I'm wondering what the cleanest means of doing this is. One idea is to pull it in as a blob, and I could just write my own JavaComputeNode to parse it out. Another idea was that most DB's are also capable of sucking in raw CSV and parsing them. It seemed unlikely, but could a blob be handed to DB2 but tell DB2 not to treat it as a blob but to parse it as it would a file? Is there another route to do this? I'm new to IIB, so I'm just looking for some guidance on the IIB way of handling such a situation.

Any advice is appreciated.

mqjeff

Posted: Fri Aug 19, 2016 4:02 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

Use DFDL to parse the entire CSV, using the header row if possible to identify the columns by name. Pretty sure that DFDL does that, and by default with the CSV base mdoel... (paging timber... )

You should then be able to simply refer to the specific columns of each row by name. The rest of the columns will be parsed, but ignored by your code.
_________________
chmod -R ugo-wx /

BHill

Posted: Tue Aug 23, 2016 7:23 am Post subject:

Newbie

Joined: 18 Aug 2016
Posts: 7

Ok, thanks. I'll give DFDL another go. I'll try importing the CommaSeparatedFormat.xsd schema after reading the header. Doing so beforehand gives me complaints about using an occursCountKind of 'parsed', since it is incompatible when considering an ordered sequence.

Any other guidance from DFDL experts would be appreciated.

BHill

Posted: Tue Aug 23, 2016 2:49 pm Post subject:

Newbie

Joined: 18 Aug 2016
Posts: 7

So far, I've had no luck in figuring out how to get DFDL to identify columns by names.

The default IBM example for ingesting CSV files with headers expect a fixed number of columns and then just assigns the values to head_field1, head_field2, etc.

The daffodil simpleCSV example gets closer. It just creates an array element all with the same name:
<title>Hdr1</title>
<title>Hdr2</title>
.etc

And then it just does the same with each field in a record:
<item>Foo</item>
<item>Bar</item>
etc.

Your suggested route would involve two things I haven't found examples of:
1. Ingesting non-fixed number of elements and naming the element based on the value.
2. Using column-named elements to specify which items to grab out of a record-line.

I might be able to work around #1 be coding up the header parsing as a choices with a discriminators that look for expected values. Even if I have that though, I'm not sure I've seen any DFDL examples that can parse out values by either a column-name or index value #2.

I'm glad to work through this, I'm just not finding any examples that get me on the right path. I appreciate people's time in suggesting places to look.

As additional constraint, I'm using IIB v9, so that also reduces a few DFDL features, but it doesn't seem by much.

Thanks!

fjb_saper

Posted: Tue Aug 23, 2016 6:29 pm Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20763
Location: LI,NY

BHill wrote:

You need to do this dynamically. You parse the header line and keep it handy. Then you parse your record. Next thing you go through both your header line and the record data keeping place. The value of the header line field should match the value of your data label and when they match what you need you can extract them...
Have fun

_________________
MQ & Broker admin

BHill

Posted: Tue Aug 23, 2016 6:40 pm Post subject:

Newbie

Joined: 18 Aug 2016
Posts: 7

I'm currently looking at just ingesting everything as <title> and <item> arrays, and then sorting out the desired columns and data in eSQL. Maybe this is what you meant.

There may be more elegant means through DFDL, but I'm not finding enough example code to work out anything but trivial examples or clever things that require features that IBM hasn't implemented.

shanson

Posted: Thu Aug 25, 2016 2:27 am Post subject:

Partisan

Joined: 17 Oct 2003
Posts: 344
Location: IBM Hursley

DFDL can not assign element names dynamically. It always uses the element name declared in the schema.

BHill

Posted: Thu Aug 25, 2016 10:13 am Post subject:

Newbie

Joined: 18 Aug 2016
Posts: 7

Thanks shanson. That's what it looked like to me. Given that I know what text I am looking for in the header items, I think I can setup a choice that matches the value to an expected string. That would parse the header values into similarly named elements.

Of course, that isn't quite the end goal. What matters is having a means to pull the correct row values out and insert them into a row record element that contains elements from all the desired fields. So my header parsing is really being done in the hopes that doing so also opens a mechanism to inform the parser which fields to chose out of each row (e.g. save column indices, or establish column names). That part I'm not seeing a mechanism for.

Thanks

mqjeff

Posted: Thu Aug 25, 2016 10:19 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

Are you trying to deal with a case where the header row has more columns in it than the records?

That's an unsolvable problem, unless the number of field separators in each record matches the number of columns in the header

If there are the same number of separators, then you can model the record row with the right column name as the name of each element. Then make sure that any fields that are not populated are treated as null/nil/whatever.

Then your ESQL/etc. can just access the fields it needs to by name.

Model the header row so that it appears as a header, and not as the first row in the records. Likewise with any trailers.
_________________
chmod -R ugo-wx /

BHill

Posted: Thu Aug 25, 2016 10:37 am Post subject:

Newbie

Joined: 18 Aug 2016
Posts: 7

No, the header has the same column count as the rows. No nightmare there.

I just need to sift through the header to find the desired columns, and then translate my knowledge of which columns contain the desired data to the row parsing.

mqjeff The point I'm lost on is how to transfer/use that knowledge in row parsing. Is there an example I can look at? Something that directs element "foo" to pull its value from the column that had a header name "foo".

Thanks

mqjeff

Posted: Thu Aug 25, 2016 11:02 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

No, you can't really do that.

If it's a CSV file, the columns have to be in the same order as the headers. Well, they are assumed to be in the same order. The value of field ten is always taken as the value of the field named by header field 10.

So if header 10 says "foo", then field 10 in the record model should be named 'foo'. You can't tell the parser that field 10 in the model should use the name of field 10 in the header.

Although perhaps shanson will correct me.
_________________
chmod -R ugo-wx /

BHill

Posted: Fri Aug 26, 2016 2:05 pm Post subject:

Newbie

Joined: 18 Aug 2016
Posts: 7

Ok, thanks mqjeff, and everyone. I appreciate it.

It looks like I should just parse it all out as arrays (header arrays, and row arrays) and sort out the columns I want in eSQL or possibly a compute node.

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Parsing select columns from a CSV

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP