ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Validating CSV File

Post new topic  Reply to topic Goto page 1, 2  Next
 Validating CSV File « View previous topic :: View next topic » 
Author Message
whydieanut
PostPosted: Tue Feb 07, 2012 4:29 am    Post subject: Validating CSV File Reply with quote

Disciple

Joined: 02 Apr 2010
Posts: 186

Hi all,

I need to read a large file with multiple lines of CSV data (each line separated by CR LF), using the File Input Node.
I need to pass the entire file's content as a single message, with each line mapping into a row of the output format (SAP Data Record segments)

Example:
row1_field1,row1_field2,row1_field3,row1_field4,row1_field5
row2_field1,row2_field2,row2_field3,row2_field4,row2_field5
row3_field1,row3_field2,row3_field3,row3_field4,row3_field5
row4_field1,row4_field2,row4_field3,row4_field4,row4_field5



I have a Message Type Order, which is a complex type, comma delimited, to represent each line of the CSV data.

I have another Message Type Order_List, which is a complex type consisting of repeating occurrences of type Order, with a CR LF delimiter.

My input file (sample above) successfully gets processed with this Message Set.

I am now required to validate my input message for values of certain fields (Integer Checking, Date format, etc...).
In case of any errors in validation, I need to write out an error log with details of all lines (and possibly which field) that had an error; and then reject processing of the file.
Else I need to process it.



Given these requirements, Validation at the FileInput node isn't giving me the required result.
What is a good approach to take for such a requirement.
Or is it asking for too much from WMB?

Any advice/suggestions...?
Back to top
View user's profile Send private message
Vitor
PostPosted: Tue Feb 07, 2012 5:53 am    Post subject: Re: Validating CSV File Reply with quote

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

whydieanut wrote:
What is a good approach to take for such a requirement.


Use the CSV sample as a model for your actual file layout, identifying those fields with specific formats that you need to check. Use that for validation.

whydieanut wrote:
Or is it asking for too much from WMB?




whydieanut wrote:
Any advice/suggestions...?


I'm sure @kimbert will be along in a moment with a much better idea.
_________________
Honesty is the best policy.
Insanity is the best defence.
Back to top
View user's profile Send private message
smdavies99
PostPosted: Tue Feb 07, 2012 6:44 am    Post subject: Reply with quote

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

Quote:
I am now required to validate my input message for values of certain fields (Integer Checking, Date format, etc...).
In case of any errors in validation, I need to write out an error log with details of all lines (and possibly which field) that had an error; and then reject processing of the file.


You could add lots of enumerations to the message model but that may not satifsy your requirement.

If you really really have to tell the Admins that field 36B failed on row 1034 then there is no real alternative except to for the value validation in ESQL or Java. Then you can control what happens when you get an error etc. It will slow down the flow considerably.

I'd also make doing the manual validation configurable (just in case) so that when management decided that their terrible data is 'not the cause of the slow system', you can turn off the validation without making code changes etc.

If you do go down the Enumeration and exception route, you will become a master at traversing the ExceptionList and extracting the information you need for the error log.
_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.
Back to top
View user's profile Send private message
kimbert
PostPosted: Tue Feb 07, 2012 7:32 am    Post subject: Reply with quote

Jedi Council

Joined: 29 Jul 2003
Posts: 5542
Location: Southampton

You're using the FileInput node, right? So you can split the input into records before it arrives in the flow. You can use the MRM's validation rules ( i.,e. XML Schema ) to validate the field values in each individual record. If that fails then you can write that single, failing record to a failure file somewhere.

Now all you have to do is work out how to start a new failure file each time the input file changes. Over to somebody else for that.
Back to top
View user's profile Send private message
mqsiuser
PostPosted: Tue Feb 07, 2012 7:46 am    Post subject: Reply with quote

Yatiri

Joined: 15 Apr 2008
Posts: 637
Location: Germany

smdavies99 wrote:
It will slow down the flow considerably.


Consider, that throwing exceptions "is expensive"... so if you just want to collect all validation-errors, instead of stopping at the first (with throwing an exception), instead write your validation (of e.g. date and enumerations) in code (ESQL). Especially do not throw and catch exceptions repeatedly/consecutivly! Parse your fields in as strings and based on these strings "CHAR" write validation (ESQL-)code. (Probably on DATE just check for the format and not if it really is an accurate date... to validate a date manually & with full accuracy is problematic I must admit )

Experienced people will agree that on the message-set "CHAR/String" is a reasonable choice (over trying to put in a lot of type information "INT"/DateTime-Formatting/etc at this point - it is too early there). The truth is, that it is better to do some work (e.g. CASTs) in code, than to get dubious parsing-errors.

smdavies99 wrote:
If you really really have to tell the Admins that field 36B failed on row 1034 then there is no real alternative except to for the value validation in ESQL or Java. Then you can control what happens when you get an error etc.


Write the validation in ESQL! I usually combine validation and transformation in one flow. Sounds akward, but I think practitioners can/will confirm this approach.

If you would use e.g. xsd-Validation you are constrained to what XSD is capable of (smdavies99 says "[...] but that may not satisfy your requirement"). Well... probably xsd is capable of everything you want to do... but validation errors will throw exceptions (probably you can turn this off?! - I guess NO) and most importantly you can not really react to validation-errors as you like (as smdavies99 states). Ideally you write some utility-procedures which help you validate stuff. Sorry, but I am serious with that

I am providing a couple of very basic functions within my project which can also be used for validation: isAlphanumeric(), isAlphaOrNumeric(), isIdentifier(), isNumeric(), containsCharsOnly() and countCharInChar().

These return TRUE/FALSE/INT, which you can then use to do whatever you like (e.g. throw an exception - well probably not ) or just write a log-entry or collect all errors and FINALLY throw an exception and/or return the list of validation-issues to the sender.
_________________
Just use REFERENCEs


Last edited by mqsiuser on Tue Feb 07, 2012 11:04 pm; edited 1 time in total
Back to top
View user's profile Send private message
whydieanut
PostPosted: Tue Feb 07, 2012 9:38 pm    Post subject: Reply with quote

Disciple

Joined: 02 Apr 2010
Posts: 186

Whoa! All the biggies are here...! I must choose my next words carefully...

Vitor wrote:

Use the CSV sample as a model for your actual file layout, identifying those fields with specific formats that you need to check. Use that for validation.

Used just that to get my MsgSet working...

smdavies99 wrote:

You could add lots of enumerations to the message model but that may not satifsy your requirement.

If you really really have to tell the Admins that field 36B failed on row 1034 then there is no real alternative except to for the value validation in ESQL or Java. Then you can control what happens when you get an error etc. It will slow down the flow considerably.

I'd also make doing the manual validation configurable (just in case) so that when management decided that their terrible data is 'not the cause of the slow system', you can turn off the validation without making code changes etc.

If you do go down the Enumeration and exception route, you will become a master at traversing the ExceptionList and extracting the information you need for the error log.


So manual validation using ESQL IS a valid option, though it may considerably slow down the processing?
Is it then possible to do some kind of validation using a Message Type (in my case Order), which would have the field type restrictions specified, for each line?
For example, I could loop through all Line Items, and just call a function ValidateOrderLineItem(LineItenReference), which would return TRUE or FALSE (or better, list of fields with errors).


kimbert wrote:

You're using the FileInput node, right? So you can split the input into records before it arrives in the flow. You can use the MRM's validation rules ( i.,e. XML Schema ) to validate the field values in each individual record. If that fails then you can write that single, failing record to a failure file somewhere.

Now all you have to do is work out how to start a new failure file each time the input file changes. Over to somebody else for that.


This approach would be best for me. Also the FileInput node gives you the Record index if the file is split into multiple messages; I could just use that.
I have a requirement that either the file be processed completely (i.e. no errors at all) or it be rejected completely (i.e. if even a single line has an error).
The moment I split the file, I have issues in enforcing the above rule, because of the FileInput node being Non Transactional.
Or am I missing something very basic?
Back to top
View user's profile Send private message
whydieanut
PostPosted: Tue Feb 07, 2012 9:54 pm    Post subject: Reply with quote

Disciple

Joined: 02 Apr 2010
Posts: 186

mqsiuser wrote:

Consider, that throwing exceptions "is expensive"... so if you just want to collect all validation-errors, instead of stopping at the first (with throwing an exception), instead write your validation (of e.g. date and enumerations) in code (ESQL). Especially do not throw and catch exception repeatedly/consecutivly! Parse your fields in as strings and based on these strings "CHAR" write validation (ESQL-)code. (Probably on DATE just check for the format and not if it really is an accurate date... to validate a date manually & with full accuracy is problematic I must admit )

Experienced people will agree that on the message-set "CHAR/String" is a reasonable choice (over trying to put in a lot of type information "INT"/DateTime-Formatting/etc at this point - it is too early there). The truth is, that it is better to do some work (e.g. CASTs) in code, than to get dubious parsing-errors.

Yes, currently my message set has logical properties set to decimal, integer and so on, which causes the message to be not parsed properly at all, even though no validation is set. The parsing itself is failing when it encounters an Integer field with a Character value in the input message. So I figure it's better to leave the field types to CHAR and do the validation in ESQL.



mqsiuser wrote:

but validation errors will throw exceptions (probably you can turn this off?! - I guess NO) and most importantly you can not really react to validation-errors as you like (as smdavies99 states). Ideally you write some utility-procedures which help you validate stuff. Sorry, but I am serious with that

I am providing a couple of very basic functions within my project which can also be used for validation: isAlphanumeric(), isAlphaOrNumeric(), isIdentifier(), isNumeric(), containsCharsOnly() and countCharInChar().

These return TRUE/FALSE/INT, which you can then use to do whatever you like (e.g. throw an exception - well probably not ) or just write a log-entry or collect all errors and FINALLY throw an exception and/or return the list of validation-issues to the sender.

I really wanted to avoid having to manually validate each field with functions like the above, and instead use some kind of in built validation technique; maybe use the CREATE PARSE combo with Validate option set to check each line of the CSV data, using the Order Message Type (like the ValidateOrderLineItem() function I mentioned above).
I have too many different types of messages, with different fields, to keep this manageable in code.
Back to top
View user's profile Send private message
whydieanut
PostPosted: Tue Feb 07, 2012 9:59 pm    Post subject: Re: Validating CSV File Reply with quote

Disciple

Joined: 02 Apr 2010
Posts: 186

Vitor wrote:

whydieanut wrote:
Or is it asking for too much from WMB?





What I meant was, just because something can be done in the Middleware Layer, doesn't mean that it has to be done here.
Can I just go ahead and tell the functional guys that this is not the responsibility of Middleware (considering that the only option feasible seems to be impacting performance)?
Back to top
View user's profile Send private message
mqsiuser
PostPosted: Tue Feb 07, 2012 11:26 pm    Post subject: Re: Validating CSV File Reply with quote

Yatiri

Joined: 15 Apr 2008
Posts: 637
Location: Germany

whydieanut wrote:
What I meant was, just because something can be done in the Middleware Layer, doesn't mean that it has to be done here.
Can I just go ahead and tell the functional guys that this is not the responsibility of Middleware (considering that the only option feasible seems to be impacting performance)?


Yes, a lot lot of people argue like that. You push the responsibility of valid (field-)contents to the partners, which is the alternative to having to validate and check them (which is a lot of work&responsibility for you).

You have a challenging structural transformation to do already (as always with SAP)!

I do not see that performance is that much of an issue (how many Orders can a company process ?!)
_________________
Just use REFERENCEs
Back to top
View user's profile Send private message
whydieanut
PostPosted: Wed Feb 08, 2012 12:36 am    Post subject: Reply with quote

Disciple

Joined: 02 Apr 2010
Posts: 186

I am expecting a peak load of about 10,000 lines of CSV data in the file...
The SAP guys want the entire contents of the file in a single IDoc.
Back to top
View user's profile Send private message
mqsiuser
PostPosted: Wed Feb 08, 2012 1:09 am    Post subject: Reply with quote

Yatiri

Joined: 15 Apr 2008
Posts: 637
Location: Germany

whydieanut wrote:
I am expecting a peak load of about 10,000 lines of CSV data in the file...
The SAP guys want the entire contents of the file in a single IDoc.


There are a lot of ways to ruin performance... but I do not consider 10,000 lines to be much for Broker

The problem is that the exec-group abends when it exceeds 100 MB of memory usage (Broker 6.0)... Did this change in other Broker-Releases (6.1, 7, 8 ) ?

You have to consider this! You have a List of Orders... you need to find out about your Brokers limit and ask the sending partner to send e.g. 200 Orders at the most within one file at any given time. Or ofc you take care of this issue:

If you try to parse a single file with too many records the exec group abends during parsing the in-message (you will never have a chance to do anything in code to rescue this situation). FileInput parses each line one by one... still you have to construct a big out-message and be careful when doing so. Probably you could append each order into the (one) file (but I have never done it like that), then your flow will just process single orders and still write one file. Thought there is an issue with reading such files in: You'd have to do writes and then in the end move ("mv") the file, when you are ready. You can likely not process single lines from input to output, because of message-structure-issues (!).

They (SAP) can very likely live with receiving e.g. 5 files(/messages) (with 200 orders each) instead of 1 file (with 1000 orders). Be save, e.g. 100 orders per file might be o.k. SAP just does not like to have a file/message for each order.
_________________
Just use REFERENCEs


Last edited by mqsiuser on Wed Feb 08, 2012 1:32 am; edited 1 time in total
Back to top
View user's profile Send private message
kimbert
PostPosted: Wed Feb 08, 2012 1:32 am    Post subject: Reply with quote

Jedi Council

Joined: 29 Jul 2003
Posts: 5542
Location: Southampton

mqsiuser said:
Quote:
I do not consider 10,000 lines to be much for Broker

The problem is that the exec-group abends when it exceeds 100 MB of memory usage (Broker 6.0)... Did this change in other Broker-Releases (6.1, 7, 8 ) ?

You have to consider this! You have a List of Orders... you need to find out about your Brokers limit and ask them to send (or you take care of this) e.g. 200 Orders at the most within one file at any given time.
The best way to process a large file of repeating records is to process it one record at a time. Once you have done that, you can process files of arbitrary size.
I'm not sure why mqsiuser is recommending the splitting up of the input file, but I may be missing something. It seems to be that controlling the transactional behaviour of the flow is one of the key challenges.
Back to top
View user's profile Send private message
mqsiuser
PostPosted: Wed Feb 08, 2012 1:42 am    Post subject: Reply with quote

Yatiri

Joined: 15 Apr 2008
Posts: 637
Location: Germany

kimbert wrote:
The best way to process a large file of repeating records is to process it one record at a time. Once you have done that, you can process files of arbitrary size.


It would be great if you could always just process one record/line at a time like "In->process->Out". I do not think that this works for SAP-Systems.

Explanation with XML as output: To populate a (single) XML-Out-message (SAP-Order) you will have to read in all flatfile-records/lines that belong to that order first. On an XML-(Out-)Message you only have one root-element (...that is so trivial, I must say), so you need to load everything (into your flow) that you want to write out into a single xml-message, e.g. 1 order. This also holds true for e.g. 200 orders: You have to read in all lines/records to create one Out-XML-Message! This is just a simple (probably weak) argument, but I guess you write XML "at once" and not "start, append, end" (within several flow-executions). Some Adapter might be able to do that for you, but I would not recommend to write out an XML-Message within several flow-executions. A more proper explanation is regarding structural-transformation (from in- to out-message):

Explanation (even when using flat-file-output): If you have different record-types/lines (e.g. an order-header, order-lines and an order-trailer for each order) then it is very likely not possible to process each record/line independently. It is likely that you have to at least read in each order (each "order-header, -lines and -trailer") fully into your flow (that will be e.g. 20 lines/records). Otherwise you will not be able to mediate all possible structural differences. This means you will probably/likely not be able to properly populate the fields in the IDOC-(flatfile-)output-message.

kimbert wrote:
I'm not sure why mqsiuser is recommending the splitting up of the input file, but I may be missing something.


This just offloads responsiblity to the sending application. I guess reading in line by line is fair enough. Finish with flow-processing after each order (which will be a couple of lines/records), or after a batch of orders, e.g. 10 or 100). Note that this requirement is somewhat created by Broker not the other End (SAP) - This would be a bad idea: To impose requirements from the receiving system to the sending system.

Unfortunaltey I think the OP has a file with all order-headers and a file with all order-lines (and a file with order-batches ?!) and needs to correlate these ?!

In that case the sending application should split the file(s)/message(s), because Broker can't read in too large files (which is likely required to do the orderHeader-orderLines-correlation).

I guess you send XML-IDOCS (and not FlatFile-IDOCS) to the SAP-System ?!
_________________
Just use REFERENCEs
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Wed Feb 08, 2012 7:00 pm    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20756
Location: LI,NY

kimbert wrote:
mqsiuser said:
Quote:
I do not consider 10,000 lines to be much for Broker

The problem is that the exec-group abends when it exceeds 100 MB of memory usage (Broker 6.0)... Did this change in other Broker-Releases (6.1, 7, 8 ) ?

You have to consider this! You have a List of Orders... you need to find out about your Brokers limit and ask them to send (or you take care of this) e.g. 200 Orders at the most within one file at any given time.
The best way to process a large file of repeating records is to process it one record at a time. Once you have done that, you can process files of arbitrary size.
I'm not sure why mqsiuser is recommending the splitting up of the input file, but I may be missing something. It seems to be that controlling the transactional behaviour of the flow is one of the key challenges.


The best way to handle this seems to me to create an output file and have the output file uploaded to SAP.
Sap handles pretty well by itself abort and retry on such idoc uploads from files.

(marks current place in file, file name in the DB at each commit)

Have fun
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
whydieanut
PostPosted: Wed Feb 08, 2012 9:36 pm    Post subject: Reply with quote

Disciple

Joined: 02 Apr 2010
Posts: 186

kimbert wrote:

It seems to be that controlling the transactional behaviour of the flow is one of the key challenges.

Yes, it is. I am explaining the file structure and the requirements in detail below.

mqsiuser wrote:

I guess you send XML-IDOCS (and not FlatFile-IDOCS) to the SAP-System ?!

Yes

fjb_saper wrote:

The best way to handle this seems to me to create an output file and have the output file uploaded to SAP.
Sap handles pretty well by itself abort and retry on such idoc uploads from files.

(marks current place in file, file name in the DB at each commit)

Technically speaking, SAP doesn't need WMB for reading from a file at all


Coming to the exact requirements, my input CSV file is in the following format:
Code:

DOCNO_001,DR,10000,...,ABC
DOCNO_001,CR,5000,...,DEF
DOCNO_001,CR,5000,...,GHI
DOCNO_002,DR,2000,...,ABC
DOCNO_002,CR,2000,...,XYZ
DOCNO_003,DR,15000,...,LMN
DOCNO_003,CR,4000,...,XYZ
DOCNO_003,CR,2000,...,ABC
DOCNO_003,CR,9000,...,UVW

(The SAP IDoc has a similar structure with each line of CSV translating to a segment in the Data Record of the IDoc.)

This example consists of 3 separate transactions, identified by the value of the first field (DOCNO_001, DOCNO_002, DOCNO_003)
Each transaction consists of variable number of line items.

Requirement is to send a complete transaction to SAP; so if 1 line item of a transaction has an error, the entire transaction needs to be rejected (all line items for it)

In addition, since it's a batch, the sending application wants us to reject the whole file in case of error in even a single transaction; they don't want the file to be sent partially (i.e. sending all valid transactions), since reprocessing only failed transactions would be difficult for them.

Also, SAP wants us to send in the whole batch as a single IDoc.

To add to this, the overall design consists of 2 flows - One reading in the input CSV files and converting it into an IDoc and putting into a Queue, and another common flow reading from this Queue and sending the IDoc into the SAP Adapter (without any further processing).


Given all this, managing transactions is becoming tricky.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic  Reply to topic Goto page 1, 2  Next Page 1 of 2

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Validating CSV File
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.