Author |
Message
|
matuwe |
Posted: Tue Oct 25, 2011 1:44 pm Post subject: Parse data at runtime,the delimeter only known at runtime |
|
|
 Master
Joined: 05 Dec 2007 Posts: 296
|
Hi
Please advice, I am not sure how to handle this flow. I am retriving data from a DB, which is a BLOB. I need to Convert to CHAR and the data then looks like
Quote: |
Name|Surname|age
help|Test|23
|
When retrieving the data part of the data will give me the record delimiter and the row delimiter. So I need to take all this information and do an insert into another database.
The delimiter can be anything, a comma, pipe
I wrote a JAVA compute and used Stringtoken to find rows and columns, but it just takes for ever to do 50 000 records. Is there anyway that I can do this parsing? |
|
Back to top |
|
 |
lancelotlinc |
Posted: Tue Oct 25, 2011 1:54 pm Post subject: |
|
|
 Jedi Knight
Joined: 22 Mar 2010 Posts: 4941 Location: Bloomington, IL USA
|
If you are on V7, the DatabaseInput node may work for you. By handling the db rows as they come in rather than in batch mode, you may be able to tolerate the longer latency for parsing and conversion. _________________ http://leanpub.com/IIB_Tips_and_Tricks
Save $20: Coupon Code: MQSERIES_READER |
|
Back to top |
|
 |
matuwe |
Posted: Tue Oct 25, 2011 2:06 pm Post subject: |
|
|
 Master
Joined: 05 Dec 2007 Posts: 296
|
Owww thanks a million for the quick response.
I am on V7. Can the DBInput accept delimeters at runtime . My other issue is the database that I writing to only accepts JDBC and want me to call their Stored Proc and insert record at the time.
I am not sure I understand correctly, can the the db insert accept SQL statement at runtime too?
Thanking you in advance |
|
Back to top |
|
 |
lancelotlinc |
Posted: Tue Oct 25, 2011 2:18 pm Post subject: |
|
|
 Jedi Knight
Joined: 22 Mar 2010 Posts: 4941 Location: Bloomington, IL USA
|
mqjeff, kimbert or Vitor may be better resources for this situation. You can do all those things you mentioned, but the parsing part your still needing to be done in a compute node, either ESQL or Java. Probably Java would be more flexible, but the three people aforementioned could offer you some more accurate advice to parse out the blob the way you want in the fastest way possible.
DatabaseInput node starts processing when it senses a new record in the table it is monitoring. _________________ http://leanpub.com/IIB_Tips_and_Tricks
Save $20: Coupon Code: MQSERIES_READER |
|
Back to top |
|
 |
matuwe |
Posted: Tue Oct 25, 2011 2:24 pm Post subject: |
|
|
 Master
Joined: 05 Dec 2007 Posts: 296
|
Owww cool thanks a million. I shall wait for more feedback hopefully some one will give me a more robust solution to handle this.. |
|
Back to top |
|
 |
kimbert |
Posted: Tue Oct 25, 2011 2:40 pm Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Seems to be "parser week" on this forum...
Quote: |
When retrieving the data part of the data will give me the record delimiter and the row delimiter |
The MRM parser *can* cope with varying delimiters, but it is not easy, and there are limits to what it can handle.
However, if you answer the following questions about the data format I can give some specific advice:
Q1: Does your data format have a well-known name. Like 'HL7' or 'EDIFACT' or 'ACORD'?
Q2: How *exactly* are the delimiters specified in the data? Is there some kind of header records with a special field that describes the delimiters?
Q3: You said 'The delimiter can be anything, a comma, pipe'. I guess that in the real world, one or two of those are used in 99% of messages. If you know ( or can find out ) the most common cases then you can optimize your flow to check for the common cases. But that requires a proper understanding of your data format, so I can't tell you how to do that until you answer Q1 and/or Q2. |
|
Back to top |
|
 |
kimbert |
Posted: Tue Oct 25, 2011 2:44 pm Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Oh...and before somebody else jumps in and says it, the other answer is:
This would be easy if you could use DFDL. Unfortunately WMB v8 isn't available just yet... |
|
Back to top |
|
 |
matuwe |
Posted: Tue Oct 25, 2011 2:55 pm Post subject: |
|
|
 Master
Joined: 05 Dec 2007 Posts: 296
|
Oww thanks for the response..
Currently all my 50 interfaces uses | pipe delimiter and the rows uses \n colon deliminator. All the data will be handles a CHAR datatype.
The first row on my payload tells me what delimiter to use, the second row shows me the insert statement for the destination and the last row is what contains the BLOB with the rows and columns.
I only know at runtime what the delimiter is plus how many columns. This is why I could not work out how I can create a message set to accommodate dynamic number of columns depending on the data.
And Lastly, no I do not have a header row. |
|
Back to top |
|
 |
matuwe |
Posted: Tue Oct 25, 2011 3:18 pm Post subject: |
|
|
 Master
Joined: 05 Dec 2007 Posts: 296
|
Just for more design question..
Is it better to have one java compte node to strip the records from the payload and propagate the record, then have another Java compute to do the actual insert into the database?
Or is it safer to strip all messages and build them into the message tree then propagate to the note that will then insert this data into the DB.
Is there a way of making builk insert??? or is this not recommended? Currently I am building a JDBC component, and I will need to build the ODBC version. So that is why I am trying to split the insert node so I can reuse the parsing on both.
Code looks like this
Code: |
String columnseparator = "|" ; //datasourcename.getNextSibling();
String rowsaperator = "\n" ;
String commitrows = "0" ;//rowsaperator.getNextSibling();
String tokens[] = DBJStringManupilator.tokenizer(data,rowsaperator);
///TODO: - use the row separator from the database.
// Build the output area for the row data
MbElement rowData = outAssembly.getLocalEnvironment().getRootElement().createElementAsLastChild(MbElement.TYPE_NAME, "ROW", "");
int numTokens = tokens.length;
MbElement recordCount = outAssembly.getLocalEnvironment().getRootElement().createElementAsLastChild(MbElement.TYPE_NAME, "RECORDCOUNT", "");
recordCount.createElementAsLastChild(MbElement.TYPE_VALUE, "recordNum", numTokens);
int count = 1;
for (int i = 0; i < numTokens; i++) {
// TODO: Remove the following hardcoded column delimiter - Replace with the
// correct one from the configuration
// Split into separate columns
//String columns[] = StringManupilator.tokenizer(tokens[i],"\\"+columnseparator.getValueAsString());
String columns[] = tokens[i].split("\\"+columnseparator);
int numCols = columns.length;
String batch_id;
// Copy each field into the output area
count=1;
for (int j = 0; j < numCols; j++) {
// if (j==0){
//
// batch_id = String.valueOf(Math.round((float)Math.random()*100000));
//
// }else{
if (columns[j] == null || columns[j].equals("")){
batch_id= "";
}else{
batch_id=columns[j];
}
// }
MbElement field = rowData.createElementAsLastChild(MbElement.TYPE_VALUE, "field"+count, batch_id);
count = count + 1;
}
out.propagate(outAssembly);
rowData.detach();
rowData = outAssembly.getLocalEnvironment().getRootElement().createElementAsLastChild(MbElement.TYPE_NAME, "ROW", "");
}
} finally {
// clear the outMessage
outMessage.clearMessage();
}
|
|
|
Back to top |
|
 |
MrSmith |
Posted: Wed Oct 26, 2011 12:43 am Post subject: |
|
|
 Master
Joined: 20 Mar 2008 Posts: 215
|
Although it would mean a message set with multiple Formats if you know that your delimiter is only ever going to be a handful of those regularly used you could do the parsing in the ESQL and apply the Format part in the next node
i.e Input -------> (Node determines delimiter from 1st row then does inline parse in ESQL defining format at this point but same message set as the elements are still relevant)
It does however mean that you may for the main part have a few redundant wire formats or TDS formats but you could cover most bases with the main ones _________________ -------- *
“Outside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.” |
|
Back to top |
|
 |
mqjeff |
Posted: Wed Oct 26, 2011 1:41 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
If you know that the delimiter is always going to be unique within the message, i.e. that you don't have issues like having quotes within fields that are delimited by quotes... Then you can create a single message set with a single delimiter value, and then use ESQL to read the first row with the delimiter and REPLACE that value with the value used by the message set and then parse with the message set.
But you appear to have a message structure that requires an unknown number of repeating elements that are not distinguished very well from each other. So it may be a complicated message set to produce. |
|
Back to top |
|
 |
kimbert |
Posted: Wed Oct 26, 2011 7:00 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
You've had a lot of advice there. Here's a summary of the options:
A. Create a single message set with two or more TDS physical formats. Use one physical format per 'common' delimiter style. The message definition will consist of a message containing a single repeating record containing a single repeating field. Use ESQL or Java to select the correct physical format for each set of incoming records.
B. Do it all in Java. Write some highly-efficient Java code ( yours doesn't look highly efficient, btw ) to do the parsing of each record. You could use specialized methods for the common delimiting styles - but the Java JIT might make that unnecessary.
I'm not an expert on the database stuff - but it sounds as if there may be some gains available there. As always, you should *measure* to find the bottlenecks before you carry out any performance tuning. |
|
Back to top |
|
 |
mattfarney |
Posted: Wed Oct 26, 2011 8:12 am Post subject: |
|
|
 Disciple
Joined: 17 Jan 2006 Posts: 167 Location: Ohio
|
While I've never done Broker into a DB, the basic insert command looks like vanilla SQL to me.
As another suggestion, instead of truly parsing the input, you could simply convert the input records directly into the insert statement.
If you store off the two delimiters, your insert statement would look like
insert into <sometable> (replace(row[2],delimiter,",") VALUES
(replace(row[x],delimiter,",");
You would need to loop through the input (at least the data section) based on delimiter #2.
Just a suggestion
-mf |
|
Back to top |
|
 |
|