|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
Validating CSV File |
« View previous topic :: View next topic » |
Author |
Message
|
whydieanut |
Posted: Tue Feb 07, 2012 4:29 am Post subject: Validating CSV File |
|
|
 Disciple
Joined: 02 Apr 2010 Posts: 186
|
Hi all,
I need to read a large file with multiple lines of CSV data (each line separated by CR LF), using the File Input Node.
I need to pass the entire file's content as a single message, with each line mapping into a row of the output format (SAP Data Record segments)
Example:
row1_field1,row1_field2,row1_field3,row1_field4,row1_field5
row2_field1,row2_field2,row2_field3,row2_field4,row2_field5
row3_field1,row3_field2,row3_field3,row3_field4,row3_field5
row4_field1,row4_field2,row4_field3,row4_field4,row4_field5
I have a Message Type Order, which is a complex type, comma delimited, to represent each line of the CSV data.
I have another Message Type Order_List, which is a complex type consisting of repeating occurrences of type Order, with a CR LF delimiter.
My input file (sample above) successfully gets processed with this Message Set.
I am now required to validate my input message for values of certain fields (Integer Checking, Date format, etc...).
In case of any errors in validation, I need to write out an error log with details of all lines (and possibly which field) that had an error; and then reject processing of the file.
Else I need to process it.
Given these requirements, Validation at the FileInput node isn't giving me the required result.
What is a good approach to take for such a requirement.
Or is it asking for too much from WMB?
Any advice/suggestions...? |
|
Back to top |
|
 |
Vitor |
Posted: Tue Feb 07, 2012 5:53 am Post subject: Re: Validating CSV File |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
whydieanut wrote: |
What is a good approach to take for such a requirement. |
Use the CSV sample as a model for your actual file layout, identifying those fields with specific formats that you need to check. Use that for validation.
whydieanut wrote: |
Or is it asking for too much from WMB? |
whydieanut wrote: |
Any advice/suggestions...? |
I'm sure @kimbert will be along in a moment with a much better idea. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
smdavies99 |
Posted: Tue Feb 07, 2012 6:44 am Post subject: |
|
|
 Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
Quote: |
I am now required to validate my input message for values of certain fields (Integer Checking, Date format, etc...).
In case of any errors in validation, I need to write out an error log with details of all lines (and possibly which field) that had an error; and then reject processing of the file.
|
You could add lots of enumerations to the message model but that may not satifsy your requirement.
If you really really have to tell the Admins that field 36B failed on row 1034 then there is no real alternative except to for the value validation in ESQL or Java. Then you can control what happens when you get an error etc. It will slow down the flow considerably.
I'd also make doing the manual validation configurable (just in case) so that when management decided that their terrible data is 'not the cause of the slow system', you can turn off the validation without making code changes etc.
If you do go down the Enumeration and exception route, you will become a master at traversing the ExceptionList and extracting the information you need for the error log. _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
 |
kimbert |
Posted: Tue Feb 07, 2012 7:32 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
You're using the FileInput node, right? So you can split the input into records before it arrives in the flow. You can use the MRM's validation rules ( i.,e. XML Schema ) to validate the field values in each individual record. If that fails then you can write that single, failing record to a failure file somewhere.
Now all you have to do is work out how to start a new failure file each time the input file changes. Over to somebody else for that. |
|
Back to top |
|
 |
mqsiuser |
Posted: Tue Feb 07, 2012 7:46 am Post subject: |
|
|
 Yatiri
Joined: 15 Apr 2008 Posts: 637 Location: Germany
|
smdavies99 wrote: |
It will slow down the flow considerably. |
Consider, that throwing exceptions "is expensive"... so if you just want to collect all validation-errors, instead of stopping at the first (with throwing an exception), instead write your validation (of e.g. date and enumerations) in code (ESQL). Especially do not throw and catch exceptions repeatedly/consecutivly! Parse your fields in as strings and based on these strings "CHAR" write validation (ESQL-)code. (Probably on DATE just check for the format and not if it really is an accurate date... to validate a date manually & with full accuracy is problematic I must admit )
Experienced people will agree that on the message-set "CHAR/String" is a reasonable choice (over trying to put in a lot of type information "INT"/DateTime-Formatting/etc at this point - it is too early there). The truth is, that it is better to do some work (e.g. CASTs) in code, than to get dubious parsing-errors.
smdavies99 wrote: |
If you really really have to tell the Admins that field 36B failed on row 1034 then there is no real alternative except to for the value validation in ESQL or Java. Then you can control what happens when you get an error etc. |
Write the validation in ESQL! I usually combine validation and transformation in one flow. Sounds akward, but I think practitioners can/will confirm this approach.
If you would use e.g. xsd-Validation you are constrained to what XSD is capable of (smdavies99 says "[...] but that may not satisfy your requirement"). Well... probably xsd is capable of everything you want to do... but validation errors will throw exceptions (probably you can turn this off?! - I guess NO) and most importantly you can not really react to validation-errors as you like (as smdavies99 states). Ideally you write some utility-procedures which help you validate stuff. Sorry, but I am serious with that
I am providing a couple of very basic functions within my project which can also be used for validation: isAlphanumeric(), isAlphaOrNumeric(), isIdentifier(), isNumeric(), containsCharsOnly() and countCharInChar().
These return TRUE/FALSE/INT, which you can then use to do whatever you like (e.g. throw an exception - well probably not ) or just write a log-entry or collect all errors and FINALLY throw an exception and/or return the list of validation-issues to the sender. _________________ Just use REFERENCEs
Last edited by mqsiuser on Tue Feb 07, 2012 11:04 pm; edited 1 time in total |
|
Back to top |
|
 |
whydieanut |
Posted: Tue Feb 07, 2012 9:38 pm Post subject: |
|
|
 Disciple
Joined: 02 Apr 2010 Posts: 186
|
Whoa! All the biggies are here...! I must choose my next words carefully...
Vitor wrote: |
Use the CSV sample as a model for your actual file layout, identifying those fields with specific formats that you need to check. Use that for validation.
|
Used just that to get my MsgSet working...
smdavies99 wrote: |
You could add lots of enumerations to the message model but that may not satifsy your requirement.
If you really really have to tell the Admins that field 36B failed on row 1034 then there is no real alternative except to for the value validation in ESQL or Java. Then you can control what happens when you get an error etc. It will slow down the flow considerably.
I'd also make doing the manual validation configurable (just in case) so that when management decided that their terrible data is 'not the cause of the slow system', you can turn off the validation without making code changes etc.
If you do go down the Enumeration and exception route, you will become a master at traversing the ExceptionList and extracting the information you need for the error log.
|
So manual validation using ESQL IS a valid option, though it may considerably slow down the processing?
Is it then possible to do some kind of validation using a Message Type (in my case Order), which would have the field type restrictions specified, for each line?
For example, I could loop through all Line Items, and just call a function ValidateOrderLineItem(LineItenReference), which would return TRUE or FALSE (or better, list of fields with errors).
kimbert wrote: |
You're using the FileInput node, right? So you can split the input into records before it arrives in the flow. You can use the MRM's validation rules ( i.,e. XML Schema ) to validate the field values in each individual record. If that fails then you can write that single, failing record to a failure file somewhere.
Now all you have to do is work out how to start a new failure file each time the input file changes. Over to somebody else for that.
|
This approach would be best for me. Also the FileInput node gives you the Record index if the file is split into multiple messages; I could just use that.
I have a requirement that either the file be processed completely (i.e. no errors at all) or it be rejected completely (i.e. if even a single line has an error).
The moment I split the file, I have issues in enforcing the above rule, because of the FileInput node being Non Transactional.
Or am I missing something very basic? |
|
Back to top |
|
 |
whydieanut |
Posted: Tue Feb 07, 2012 9:54 pm Post subject: |
|
|
 Disciple
Joined: 02 Apr 2010 Posts: 186
|
mqsiuser wrote: |
Consider, that throwing exceptions "is expensive"... so if you just want to collect all validation-errors, instead of stopping at the first (with throwing an exception), instead write your validation (of e.g. date and enumerations) in code (ESQL). Especially do not throw and catch exception repeatedly/consecutivly! Parse your fields in as strings and based on these strings "CHAR" write validation (ESQL-)code. (Probably on DATE just check for the format and not if it really is an accurate date... to validate a date manually & with full accuracy is problematic I must admit )
Experienced people will agree that on the message-set "CHAR/String" is a reasonable choice (over trying to put in a lot of type information "INT"/DateTime-Formatting/etc at this point - it is too early there). The truth is, that it is better to do some work (e.g. CASTs) in code, than to get dubious parsing-errors.
|
Yes, currently my message set has logical properties set to decimal, integer and so on, which causes the message to be not parsed properly at all, even though no validation is set. The parsing itself is failing when it encounters an Integer field with a Character value in the input message. So I figure it's better to leave the field types to CHAR and do the validation in ESQL.
mqsiuser wrote: |
but validation errors will throw exceptions (probably you can turn this off?! - I guess NO) and most importantly you can not really react to validation-errors as you like (as smdavies99 states). Ideally you write some utility-procedures which help you validate stuff. Sorry, but I am serious with that
I am providing a couple of very basic functions within my project which can also be used for validation: isAlphanumeric(), isAlphaOrNumeric(), isIdentifier(), isNumeric(), containsCharsOnly() and countCharInChar().
These return TRUE/FALSE/INT, which you can then use to do whatever you like (e.g. throw an exception - well probably not ) or just write a log-entry or collect all errors and FINALLY throw an exception and/or return the list of validation-issues to the sender.
|
I really wanted to avoid having to manually validate each field with functions like the above, and instead use some kind of in built validation technique; maybe use the CREATE PARSE combo with Validate option set to check each line of the CSV data, using the Order Message Type (like the ValidateOrderLineItem() function I mentioned above).
I have too many different types of messages, with different fields, to keep this manageable in code. |
|
Back to top |
|
 |
whydieanut |
Posted: Tue Feb 07, 2012 9:59 pm Post subject: Re: Validating CSV File |
|
|
 Disciple
Joined: 02 Apr 2010 Posts: 186
|
Vitor wrote: |
whydieanut wrote: |
Or is it asking for too much from WMB? |
|
What I meant was, just because something can be done in the Middleware Layer, doesn't mean that it has to be done here.
Can I just go ahead and tell the functional guys that this is not the responsibility of Middleware (considering that the only option feasible seems to be impacting performance)? |
|
Back to top |
|
 |
mqsiuser |
Posted: Tue Feb 07, 2012 11:26 pm Post subject: Re: Validating CSV File |
|
|
 Yatiri
Joined: 15 Apr 2008 Posts: 637 Location: Germany
|
whydieanut wrote: |
What I meant was, just because something can be done in the Middleware Layer, doesn't mean that it has to be done here.
Can I just go ahead and tell the functional guys that this is not the responsibility of Middleware (considering that the only option feasible seems to be impacting performance)? |
Yes, a lot lot of people argue like that. You push the responsibility of valid (field-)contents to the partners, which is the alternative to having to validate and check them (which is a lot of work&responsibility for you).
You have a challenging structural transformation to do already (as always with SAP)!
I do not see that performance is that much of an issue (how many Orders can a company process ?!) _________________ Just use REFERENCEs |
|
Back to top |
|
 |
whydieanut |
Posted: Wed Feb 08, 2012 12:36 am Post subject: |
|
|
 Disciple
Joined: 02 Apr 2010 Posts: 186
|
I am expecting a peak load of about 10,000 lines of CSV data in the file...
The SAP guys want the entire contents of the file in a single IDoc. |
|
Back to top |
|
 |
mqsiuser |
Posted: Wed Feb 08, 2012 1:09 am Post subject: |
|
|
 Yatiri
Joined: 15 Apr 2008 Posts: 637 Location: Germany
|
whydieanut wrote: |
I am expecting a peak load of about 10,000 lines of CSV data in the file...
The SAP guys want the entire contents of the file in a single IDoc. |
There are a lot of ways to ruin performance... but I do not consider 10,000 lines to be much for Broker
The problem is that the exec-group abends when it exceeds 100 MB of memory usage (Broker 6.0)... Did this change in other Broker-Releases (6.1, 7, 8 ) ?
You have to consider this! You have a List of Orders... you need to find out about your Brokers limit and ask the sending partner to send e.g. 200 Orders at the most within one file at any given time. Or ofc you take care of this issue:
If you try to parse a single file with too many records the exec group abends during parsing the in-message (you will never have a chance to do anything in code to rescue this situation). FileInput parses each line one by one... still you have to construct a big out-message and be careful when doing so. Probably you could append each order into the (one) file (but I have never done it like that), then your flow will just process single orders and still write one file. Thought there is an issue with reading such files in: You'd have to do writes and then in the end move ("mv") the file, when you are ready. You can likely not process single lines from input to output, because of message-structure-issues (!).
They (SAP) can very likely live with receiving e.g. 5 files(/messages) (with 200 orders each) instead of 1 file (with 1000 orders). Be save, e.g. 100 orders per file might be o.k. SAP just does not like to have a file/message for each order. _________________ Just use REFERENCEs
Last edited by mqsiuser on Wed Feb 08, 2012 1:32 am; edited 1 time in total |
|
Back to top |
|
 |
kimbert |
Posted: Wed Feb 08, 2012 1:32 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
mqsiuser said:
Quote: |
I do not consider 10,000 lines to be much for Broker
The problem is that the exec-group abends when it exceeds 100 MB of memory usage (Broker 6.0)... Did this change in other Broker-Releases (6.1, 7, 8 ) ?
You have to consider this! You have a List of Orders... you need to find out about your Brokers limit and ask them to send (or you take care of this) e.g. 200 Orders at the most within one file at any given time. |
The best way to process a large file of repeating records is to process it one record at a time. Once you have done that, you can process files of arbitrary size.
I'm not sure why mqsiuser is recommending the splitting up of the input file, but I may be missing something. It seems to be that controlling the transactional behaviour of the flow is one of the key challenges. |
|
Back to top |
|
 |
mqsiuser |
Posted: Wed Feb 08, 2012 1:42 am Post subject: |
|
|
 Yatiri
Joined: 15 Apr 2008 Posts: 637 Location: Germany
|
kimbert wrote: |
The best way to process a large file of repeating records is to process it one record at a time. Once you have done that, you can process files of arbitrary size. |
It would be great if you could always just process one record/line at a time like "In->process->Out". I do not think that this works for SAP-Systems.
Explanation with XML as output: To populate a (single) XML-Out-message (SAP-Order) you will have to read in all flatfile-records/lines that belong to that order first. On an XML-(Out-)Message you only have one root-element (...that is so trivial, I must say), so you need to load everything (into your flow) that you want to write out into a single xml-message, e.g. 1 order. This also holds true for e.g. 200 orders: You have to read in all lines/records to create one Out-XML-Message! This is just a simple (probably weak) argument, but I guess you write XML "at once" and not "start, append, end" (within several flow-executions). Some Adapter might be able to do that for you, but I would not recommend to write out an XML-Message within several flow-executions. A more proper explanation is regarding structural-transformation (from in- to out-message):
Explanation (even when using flat-file-output): If you have different record-types/lines (e.g. an order-header, order-lines and an order-trailer for each order) then it is very likely not possible to process each record/line independently. It is likely that you have to at least read in each order (each "order-header, -lines and -trailer") fully into your flow (that will be e.g. 20 lines/records). Otherwise you will not be able to mediate all possible structural differences. This means you will probably/likely not be able to properly populate the fields in the IDOC-(flatfile-)output-message.
kimbert wrote: |
I'm not sure why mqsiuser is recommending the splitting up of the input file, but I may be missing something. |
This just offloads responsiblity to the sending application. I guess reading in line by line is fair enough. Finish with flow-processing after each order (which will be a couple of lines/records), or after a batch of orders, e.g. 10 or 100). Note that this requirement is somewhat created by Broker not the other End (SAP) - This would be a bad idea: To impose requirements from the receiving system to the sending system.
Unfortunaltey I think the OP has a file with all order-headers and a file with all order-lines (and a file with order-batches ?!) and needs to correlate these ?!
In that case the sending application should split the file(s)/message(s), because Broker can't read in too large files (which is likely required to do the orderHeader-orderLines-correlation).
I guess you send XML-IDOCS (and not FlatFile-IDOCS) to the SAP-System ?! _________________ Just use REFERENCEs |
|
Back to top |
|
 |
fjb_saper |
Posted: Wed Feb 08, 2012 7:00 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
kimbert wrote: |
mqsiuser said:
Quote: |
I do not consider 10,000 lines to be much for Broker
The problem is that the exec-group abends when it exceeds 100 MB of memory usage (Broker 6.0)... Did this change in other Broker-Releases (6.1, 7, 8 ) ?
You have to consider this! You have a List of Orders... you need to find out about your Brokers limit and ask them to send (or you take care of this) e.g. 200 Orders at the most within one file at any given time. |
The best way to process a large file of repeating records is to process it one record at a time. Once you have done that, you can process files of arbitrary size.
I'm not sure why mqsiuser is recommending the splitting up of the input file, but I may be missing something. It seems to be that controlling the transactional behaviour of the flow is one of the key challenges. |
The best way to handle this seems to me to create an output file and have the output file uploaded to SAP.
Sap handles pretty well by itself abort and retry on such idoc uploads from files.
(marks current place in file, file name in the DB at each commit)
Have fun  _________________ MQ & Broker admin |
|
Back to top |
|
 |
whydieanut |
Posted: Wed Feb 08, 2012 9:36 pm Post subject: |
|
|
 Disciple
Joined: 02 Apr 2010 Posts: 186
|
kimbert wrote: |
It seems to be that controlling the transactional behaviour of the flow is one of the key challenges.
|
Yes, it is. I am explaining the file structure and the requirements in detail below.
mqsiuser wrote: |
I guess you send XML-IDOCS (and not FlatFile-IDOCS) to the SAP-System ?!
|
Yes
fjb_saper wrote: |
The best way to handle this seems to me to create an output file and have the output file uploaded to SAP.
Sap handles pretty well by itself abort and retry on such idoc uploads from files.
(marks current place in file, file name in the DB at each commit)
|
Technically speaking, SAP doesn't need WMB for reading from a file at all
Coming to the exact requirements, my input CSV file is in the following format:
Code: |
DOCNO_001,DR,10000,...,ABC
DOCNO_001,CR,5000,...,DEF
DOCNO_001,CR,5000,...,GHI
DOCNO_002,DR,2000,...,ABC
DOCNO_002,CR,2000,...,XYZ
DOCNO_003,DR,15000,...,LMN
DOCNO_003,CR,4000,...,XYZ
DOCNO_003,CR,2000,...,ABC
DOCNO_003,CR,9000,...,UVW
|
(The SAP IDoc has a similar structure with each line of CSV translating to a segment in the Data Record of the IDoc.)
This example consists of 3 separate transactions, identified by the value of the first field (DOCNO_001, DOCNO_002, DOCNO_003)
Each transaction consists of variable number of line items.
Requirement is to send a complete transaction to SAP; so if 1 line item of a transaction has an error, the entire transaction needs to be rejected (all line items for it)
In addition, since it's a batch, the sending application wants us to reject the whole file in case of error in even a single transaction; they don't want the file to be sent partially (i.e. sending all valid transactions), since reprocessing only failed transactions would be difficult for them.
Also, SAP wants us to send in the whole batch as a single IDoc.
To add to this, the overall design consists of 2 flows - One reading in the input CSV files and converting it into an IDoc and putting into a Queue, and another common flow reading from this Queue and sending the IDoc into the SAP Adapter (without any further processing).
Given all this, managing transactions is becoming tricky. |
|
Back to top |
|
 |
|
|
 |
Goto page 1, 2 Next |
Page 1 of 2 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|