Author |
Message
|
Decky |
Posted: Wed Nov 23, 2005 7:52 am Post subject: Solved: TDS Data Pattern Problem |
|
|
Novice
Joined: 16 May 2005 Posts: 16 Location: London UK
|
Hi,
I'm trying to create a message set in WBIMB to replace a NEON format. I think I'm on the right track but can't quite get it. The data comes in from a file that has been split into separate messages for each line with return characters removed. Basically there are two types of record:
HEADER - can consist of data such as 'START-OF-FILE', 'PROGRAMNAME=getdata', 'DATEFORMAT=yyyymmdd', '# Security Description', 'ANY_FIELD_NAME', 'FIELDNAME' - to sum up it's a non-fixed length, non-delimited field
DATA RECORD (2 types) - Pipe delimited data record ie: Field1|Field2|...lastField|
I am using my main compound type with Composition set to 'Choice' and in the TDS layer section Data Element Separation is set to 'Use Data Pattern'. Underneath this I have two elements - a simple string element for the header and a data element with a repeating child delimited by '|'.
I have made this work by using several different elements for each type of header each with a different data pattern. ie: 'START_OF.*', '.*=.*', '#.*'. But when I try to generalise these into one element and regex it all starts going wrong. When I make changes one time the data records will parse and then on another the header records or else the data records appear in the header element. I can't get the two to live in harmony. The main difference between the header and data records is that data records all contain the word 'Equity' and have pipes '|' as delimiters - a closing pipe also appears in the record.
At the moment I have ([A-Za-z_ -#=]+[^\|]$) as my data pattern for the header and (.*Equity.*) for the records - I have also tried '.*\|$' and various other combinations. I'm guessing that as the parser has to choose between the elements it tries to match them in the order they appear? And then if they don't match will just try and parse with the last choice regardless? Correct me if I'm wrong, I'm not 100% sure how it works. Hopefully one of you can spot something as regexs aren't my strength
Cheers
Last edited by Decky on Wed Nov 23, 2005 9:04 am; edited 1 time in total |
|
Back to top |
|
 |
jefflowrey |
Posted: Wed Nov 23, 2005 7:54 am Post subject: |
|
|
Grand Poobah
Joined: 16 Oct 2002 Posts: 19981
|
Is your header only going to contain one piece of data ("START-OF-FILE","PROGRAMNAME=getdata", etc)? Or more than one?
Well, thinking about it, regardless you should model it as a group. Then either have it as a choice - if it can only contain one, or an unordered set if it can contain many. Then have fields for each type of header, each with their own data pattern. _________________ I am *not* the model of the modern major general. |
|
Back to top |
|
 |
Decky |
Posted: Wed Nov 23, 2005 8:03 am Post subject: |
|
|
Novice
Joined: 16 May 2005 Posts: 16 Location: London UK
|
Thanks for your reply, I have made it work in a similar way to your suggestion but the client would prefer only 2 elements unfortunately.
A stripped example of the data would be
START-OF-FILE
PROGRAMNAME=getdata
DATEFORMAT=yyyymmdd
START-OF-FIELDS
# Security Description
TICKER
EXCH_CODE
NAME
COUNTRY
CRNCY
SECURITY_TYP
PAR_AMT
EQY_PRIM_EXCH
EQY_PRIM_EXCH_SHRT
# Industry Classification
EQY_SIC_CODE
EQY_SIC_NAME
INDUSTRY_GROUP
INDUSTRY_SUBGROUP
INDUSTRY_SECTOR
END-OF-FIELDS
TIMESTARTED=Tue Mar 1 19:17:10 EST 2005
START-OF-DATA
XXXXX YY Equity|0|field|field|......field|
END-OF-DATA
DATARECORDS=1
TIMEFINISHED=Tue Mar 1 19:51:18 EST 2005
END-OF-FILE
Note: each line is coming in as a separate message, unfortunately that is the architecture the client uses |
|
Back to top |
|
 |
Decky |
Posted: Wed Nov 23, 2005 8:05 am Post subject: |
|
|
Novice
Joined: 16 May 2005 Posts: 16 Location: London UK
|
I think my main problem is finding distinguishing data patterns that don't overlap |
|
Back to top |
|
 |
wooda |
Posted: Wed Nov 23, 2005 8:25 am Post subject: |
|
|
 Master
Joined: 21 Nov 2003 Posts: 265 Location: UK
|
Hi Decky,
Quote: |
At the moment I have ([A-Za-z_ -#=]+[^\|]$) as my data pattern for the header and (.*Equity.*) for the records - I have also tried '.*\|$' and various other combinations. I'm guessing that as the parser has to choose between the elements it tries to match them in the order they appear? And then if they don't match will just try and parse with the last choice regardless? Correct me if I'm wrong, I'm not 100% sure how it works. Hopefully one of you can spot something as regexs aren't my strength
|
In a choice the parser will attempt to match each choice option in turn until it finds a match. So if your HEADER occurs before your DATA RECORD in your choice defintion and the header pattern matches the DATA RECORD then it will be parsed as a HEADER.
As you appear to want to match anything as a header unless it specificly matches the DATA RECORD pattern put the DATA RECORD first in your choice with a suitably unique pattern.
As an aside it would appear that you are not fully utilizing the meta data in your message . Were you to build a more specifc model using all the meta ata you have available. You may be able to avoid using data patterns altogther. For example START-OF-DATA<CR><LF> could be used as a TAG or Group Indicator for the the start of your DATA RECORD group. Use of choice and data patterns implies no order in how the records occur. |
|
Back to top |
|
 |
Decky |
Posted: Wed Nov 23, 2005 8:30 am Post subject: |
|
|
Novice
Joined: 16 May 2005 Posts: 16 Location: London UK
|
Thanks for the reply wooda I'll try swapping the elements around, as for using the metadata/tags - this isn't possible as each message is a single record and there is no <CR><LF> delimter, ie: as far as the flow is concerned a data record does not exist when it has found a header. |
|
Back to top |
|
 |
kimbert |
Posted: Wed Nov 23, 2005 8:32 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
wooda:
Note that an individual message contains only one line from the example 'message'. Bizarre, but I think that's what Decky said. So his problem is distinguishing between headers and records.
Decky:
As wooda said in his last post, a data pattern which matches a data record but not a header record should do the trick. Make sure it matches *all* of the data record - otherwise you'll end up with bitstream left over, and that will produce a parsing exception. |
|
Back to top |
|
 |
wooda |
Posted: Wed Nov 23, 2005 8:36 am Post subject: |
|
|
 Master
Joined: 21 Nov 2003 Posts: 265 Location: UK
|
Also your patterns
Quote: |
([A-Za-z_ -#=]+[^\|]$) |
and
appear to be attempting to use $ to anchor the pattern to the end of the sequence.
Message set data patterns follow the XML schema defintion for regular expressions. In which the use of '^' and '$' to anchor an expression to the start/end of the string is not supported.
All patterns are implictly anchored to the start/end of the bitstream they are parsing against.
So ^ and $ in this context are treated as literal characters. |
|
Back to top |
|
 |
Decky |
Posted: Wed Nov 23, 2005 9:02 am Post subject: |
|
|
Novice
Joined: 16 May 2005 Posts: 16 Location: London UK
|
Thanks for all your help guys, I've got it nailed. Basically I swapped around the order and put the data record first with a pattern of (.*\|.*\|)
and then it was as simple as using .* for the header pattern.
Thanks Again,
Dec |
|
Back to top |
|
 |
|