Author |
Message
|
fjb_saper |
Posted: Tue Dec 28, 2010 8:44 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
Did you try :
Copy the InputRoot to the Environment.Data
Parsing is still on demand,
Now loop on the Environment using a reference and build your output message
Start loop:
build output message (forces parsing of relevant stream part)
Propagate output message
Delete output message
Delete processed part of Environment.Data
End loop:
AFAIK copying the InputRoot to the Environment.Data allows you to move the inputstream from one to the other without parsing. Reading through the Environment forces parsing. Deleting the processed part from Environment (mutable) allows you to keep the memory low.
Have fun  _________________ MQ & Broker admin |
|
Back to top |
|
 |
ydsk |
Posted: Tue Dec 28, 2010 8:54 pm Post subject: |
|
|
Chevalier
Joined: 23 May 2005 Posts: 410
|
fjb_saper,
I tried it already. but NO luck.
I followed the techniques precisely and I also understand the concepts of "On Demand" parsing, creating a MUTABLE copy of InputRoot in Environment, and the importance of deleting the previous sibling (repeating element).
No luck so far with any approach for huge xml files.
Thanks
ydsk. |
|
Back to top |
|
 |
paerjohan |
Posted: Thu Dec 30, 2010 4:18 am Post subject: |
|
|
Newbie
Joined: 19 Jan 2010 Posts: 8
|
ydsk
Quote: |
When I dropped the 70 MB file, the memory usage for the EG increased to about 390 MB, and the CPU usage was close to 40. When the 105 MB file was dropped both the memory and CPU usage remained at zero. |
There is a default maximum length of 100 MB for records (and whole files)
in a FileInput node (see http://www-01.ibm.com/support/docview.wss?uid=swg1IC58202)
This can be overridden by setting the environment variable MQSI_FILENODES_MAXIMUM_RECORD_LENGTH. |
|
Back to top |
|
 |
ydsk |
Posted: Thu Dec 30, 2010 6:14 am Post subject: |
|
|
Chevalier
Joined: 23 May 2005 Posts: 410
|
Thank you very much paerjohan.
I knew there was some kind of limit (of about a 100 MB) on the record-size (learnt the hard-way), but didn't know what it was exactly, or how to increase it. The link you posted was for WMB v6.1 but I believe it must already be available in WMB v7.
In my case since my file is the size of a GB, I am NOT sure if I can increase the memory limit to that point.
Here is what I learned so far:
In the File Input Node, if we specify "Whole File' as a record, the entire file is read in 1 shot no matter what parser (XMLNSC, BLOB, etc, etc) is used. There is NO streaming involved in this case ( any comments ?). I feel this has NOTHING to do with ON-DEMAND parsing, which is a separate feature for XML parsers. ON-DEMAND parsing is for the file-content that's already read I guess.
If we want the FIN to read the input file in a stream, we need to specify an option other than 'Whole File'. If we specify 'Whole File' in a FIN, the node first gauges the size of the file, and if it's found more than the max-record-size(100 MB by default), FIN simply won't read/process the file, and moves it to mqsibackout folder.
FIN can read multiple XML documents (each as a record - using 'parsed record sequence' option in FIN) but NOT repeating XML elements inside a single valid XML file whose total size is more than the the max-record-size (100 MB by default).
When the multiple XML documents are in a single file for a FileInputNode to read them one-by-one SUCCESSFULLY using a stream, they don't form a valid XML (due to lack of a parent root tag), but each individual XML document in the file is (and should be) a valid XML.
In my case a FIN can read the repeating <Sale> elements in a file of any size (several GBs) when "Parsed Record Sequence" is specified, AND if all the <Sale> elements are in the file by themselves without their parent tags (<Message>, <Header>, and <Body>)
I'll let the experts comment on my learnings above.
Thanks
ydsk.
Last edited by ydsk on Thu Dec 30, 2010 6:52 am; edited 1 time in total |
|
Back to top |
|
 |
lancelotlinc |
Posted: Thu Dec 30, 2010 6:51 am Post subject: |
|
|
 Jedi Knight
Joined: 22 Mar 2010 Posts: 4941 Location: Bloomington, IL USA
|
|
Back to top |
|
 |
mqjeff |
Posted: Thu Dec 30, 2010 6:58 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
I'd suggest you run some more tests on the FileInput node behavior with XMLNSC parser and a file that is under the MQSI_FILENODES_MAXIMUM_RECORD_LENGTH. (I.e., try increasing that to, say, 200MB or even 2GB and then retest).
It is my understanding that the FI can support a streaming buffer if a streaming parser was in use, such that it would not actually load the entire contents of the file at once - but merely pass a handle to the stream so that the parser can read whatever chunks it needs to.
If your tests do not show this, open a PMR. |
|
Back to top |
|
 |
optimist |
Posted: Thu Dec 30, 2010 7:02 am Post subject: |
|
|
Apprentice
Joined: 18 Nov 2010 Posts: 33
|
ydsk --
You earlier said you used a SHARED variable approach to split large test files. Can you not do the same here with XML files?
On the FIN, set the "Record Detection" to "Delimited", so each row from the XML file is sent as a record. Then, within the Compute node wait for the equivalent of 100 Sale complex types to build up (using closing element detection) and propagate as a MQ message & delete contents of the shared variable.
Of course, you need to hold on to the header data and initialize the shared variable after each propagation.
Do you think the above approach is feasible? |
|
Back to top |
|
 |
ydsk |
Posted: Thu Dec 30, 2010 8:23 am Post subject: |
|
|
Chevalier
Joined: 23 May 2005 Posts: 410
|
optimist,
If it were that easy I would have lost my job by now (for not knowing/trying it till now).
What kind of a delimiter can you think of for XML files ? The tags can have spaces inside, the tags can be present inside xml comments, CDATA, etc , etc. I'll have to write a mini-XML parser myself that way.
I already did whatever you mentioned (precisely) for huge TEXT files using a BLOB parser and a delimiter (newline) in my FIN, and my msgflow splits huge TEXT files successfully.
Thanks
ydsk.
Last edited by ydsk on Thu Dec 30, 2010 9:55 am; edited 1 time in total |
|
Back to top |
|
 |
optimist |
Posted: Thu Dec 30, 2010 8:49 am Post subject: |
|
|
Apprentice
Joined: 18 Nov 2010 Posts: 33
|
ydsk --
Thanks for the clarification.
The assumption I made above was:
(1) We will split the huge XML file into smaller files of < 100MB and then use the same LARGE MESSAGE processing technique from WMB 7 samples to process these smaller files from a queue.
(2) Treat the contents of the XML file as TEXT and do the same as you did to split the TEXT file, except that you will have to detect when you need to split the file and prepare the next message headers.
Are you saying it is difficult to detect a line within ESQL that has a closing element tag, like </Sale>?
Just trying to understand more about the issues. Thanks in advance.
--optimist |
|
Back to top |
|
 |
ydsk |
Posted: Thu Dec 30, 2010 9:52 am Post subject: |
|
|
Chevalier
Joined: 23 May 2005 Posts: 410
|
YES.
Here are some possibilities:
</sale>, </sale >, <!-- /sale --> , similar occurences inside CDATA.
All are possible, and with what you are suggesting I need to write code to figure out the right end tag.
Thanks
ydsk. |
|
Back to top |
|
 |
ydsk |
Posted: Thu Dec 30, 2010 12:37 pm Post subject: |
|
|
Chevalier
Joined: 23 May 2005 Posts: 410
|
Thank you lancelotlinc and mqjeff.
mqjeff, I am already trying what you said.
I am thinking of coding the following in broker in 2 separate msgflows:
1) Read the huge input xml file as fixed length BLOB records, and figure out the <Header> block (by writing a mini-XML parser), and save it somewhere in a queue or a file to be later read by MQGet or FileRead node. And mark the position where the end tag </Header> ends in the input xml file. The ESQL code should be able to find the <Header> block in the first couple of records read from the file without reading a lot of records, though code should be generic enough to expect any tag anywhere.
2) Once the end of the </Header> tag is figured out, I 'll try to figure out the beginning of the first <Sale> tag, and from there through the end of the file (skipping the last </Body>, and </Message> tags), I'll write out the whole BLOB (as individual fixed length BLOB records concatenated) to a temporary file.
3) Once the 2 steps above are completed in a single msgflow, I can process the huge temporary file that has just the <Sale> elements in a simple msgflow that has FIN-->compute-->FON, by specifying 'Parsed Record sequence' and a msgset for <Sale> using XMLNSC parser. I know the step # 3 is easy and it should work.
The real work is in coding the steps #1 and #2 above, and I hope I'll be successful in the new year 2011.
Thank you all.
ydsk. |
|
Back to top |
|
 |
ydsk |
Posted: Sun Jan 02, 2011 9:36 pm Post subject: |
|
|
Chevalier
Joined: 23 May 2005 Posts: 410
|
I just tried increasing the max record size by specifying MQSI_FILENODES_MAXIMUM_RECORD_LENGTH = 150000000 (150 MB), and my msgflow could process an xml file of size 103 MB successfully(not possible earlier), but failed to process an xml file of size 200 MB(the flow simply moved the file to mqsibackout directory).
---------------------------------------------------------------
FIN-->Compute-->FON
FIN properties: "Whole file", XMLNSC ( No msgset), 'On Demand' parsing.
Compute Node properties: copy inputroot to outputroot, set file/dir names in outputLocal environment appropriately. compute mode set to 'All'. (Instead of doing root-to-root copy, I also tried to copy just the <Header> from inputroot to outputroot, but the behavior of the msgflow was exactly the same, it couldn't process beyond an xml file of size 200 MB).
FON properties: 'Whole file' is record
---------------------------------------------------------------
There is NO streaming involved when FIN reads an XML file, when 'Whole File' is specified for a record AND when XMLNSC parser is used in FIN, and so the entire file is read in 1 shot.
The 'Large Messaging' sample too reads the entire xml message FROM ITS SOURCE (INPUT QUEUE) in 1 shot and then stores it in a ROW variable. Then it is parsed as needed from the ROW variable, and each processed xml tag is deleted in a loop. The technique shown in the sample requires holding the entire input xml in the flow/EG's memory (unparsed of course). The only savings in memory happen in parsing the XML, not in reading it from the source. There is NO streaming involved in reading the XML message from the source here.
The behavior of my msgflow, and the 'Large Messaging' sample seem to confirm whatever I just stated.
I'll let Kimbert or other Hursley guys comment on this.
Thanks
ydsk. |
|
Back to top |
|
 |
ydsk |
Posted: Wed Jan 05, 2011 6:48 pm Post subject: |
|
|
Chevalier
Joined: 23 May 2005 Posts: 410
|
Kimbert/Hursley people on the forum,
Is there anything you guys can say about the issue of streaming when a FIN reads an input XML file (when the record is 'Whole File') ?
All my testing so far proved that there is NO streaming involved in reading the file, with the parameters I specified and with the xml structure I mentioned.
The FileInput Node tries to read the whole file in a single attempt, and if the file size is more than MQSI_FILENODES_MAX_RECORD_LENGTH it simply moves the file to mqsibackout without reading the file.
Thanks
ydsk. |
|
Back to top |
|
 |
blee |
Posted: Tue Jan 11, 2011 12:28 pm Post subject: |
|
|
Newbie
Joined: 28 Feb 2006 Posts: 7
|
Hi ydsk:
Have you considered using XSLT to "split" the huge XML file/message?
I have been using XSLT to perform numerous transformation (e.g. XML to XML with specific/desired elements from the document tree). And they work well for situation where you need to extract/split large file/message. |
|
Back to top |
|
 |
|