MQSeries.net :: View topic - splitting large text files (BLOB) into smaller ones in WMBv7

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » splitting large text files (BLOB) into smaller ones in WMBv7

Goto page Previous 1, 2

splitting large text files (BLOB) into smaller ones in WMBv7

« View previous topic :: View next topic »

Author

Message

fjb_saper

Posted: Tue Dec 28, 2010 8:44 pm Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20767
Location: LI,NY

Did you try :

Copy the InputRoot to the Environment.Data

Parsing is still on demand,

Now loop on the Environment using a reference and build your output message
Start loop:
build output message (forces parsing of relevant stream part)
Propagate output message
Delete output message
Delete processed part of Environment.Data
End loop:

AFAIK copying the InputRoot to the Environment.Data allows you to move the inputstream from one to the other without parsing. Reading through the Environment forces parsing. Deleting the processed part from Environment (mutable) allows you to keep the memory low.

Have fun

_________________
MQ & Broker admin

ydsk

Posted: Tue Dec 28, 2010 8:54 pm Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

fjb_saper,

I tried it already. but NO luck.

I followed the techniques precisely and I also understand the concepts of "On Demand" parsing, creating a MUTABLE copy of InputRoot in Environment, and the importance of deleting the previous sibling (repeating element).

No luck so far with any approach for huge xml files.

Thanks
ydsk.

paerjohan

Posted: Thu Dec 30, 2010 4:18 am Post subject:

Newbie

Joined: 19 Jan 2010
Posts: 8

ydsk

Quote:

When I dropped the 70 MB file, the memory usage for the EG increased to about 390 MB, and the CPU usage was close to 40. When the 105 MB file was dropped both the memory and CPU usage remained at zero.

There is a default maximum length of 100 MB for records (and whole files)
in a FileInput node (see http://www-01.ibm.com/support/docview.wss?uid=swg1IC58202)
This can be overridden by setting the environment variable MQSI_FILENODES_MAXIMUM_RECORD_LENGTH.

ydsk

Posted: Thu Dec 30, 2010 6:14 am Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

Thank you very much paerjohan.

I knew there was some kind of limit (of about a 100 MB) on the record-size (learnt the hard-way), but didn't know what it was exactly, or how to increase it. The link you posted was for WMB v6.1 but I believe it must already be available in WMB v7.

In my case since my file is the size of a GB, I am NOT sure if I can increase the memory limit to that point.

Here is what I learned so far:

In the File Input Node, if we specify "Whole File' as a record, the entire file is read in 1 shot no matter what parser (XMLNSC, BLOB, etc, etc) is used. There is NO streaming involved in this case ( any comments ?). I feel this has NOTHING to do with ON-DEMAND parsing, which is a separate feature for XML parsers. ON-DEMAND parsing is for the file-content that's already read I guess.

If we want the FIN to read the input file in a stream, we need to specify an option other than 'Whole File'. If we specify 'Whole File' in a FIN, the node first gauges the size of the file, and if it's found more than the max-record-size(100 MB by default), FIN simply won't read/process the file, and moves it to mqsibackout folder.

FIN can read multiple XML documents (each as a record - using 'parsed record sequence' option in FIN) but NOT repeating XML elements inside a single valid XML file whose total size is more than the the max-record-size (100 MB by default).

When the multiple XML documents are in a single file for a FileInputNode to read them one-by-one SUCCESSFULLY using a stream, they don't form a valid XML (due to lack of a parent root tag), but each individual XML document in the file is (and should be) a valid XML.

In my case a FIN can read the repeating <Sale> elements in a file of any size (several GBs) when "Parsed Record Sequence" is specified, AND if all the <Sale> elements are in the file by themselves without their parent tags (<Message>, <Header>, and <Body>)

I'll let the experts comment on my learnings above.

Thanks
ydsk.

Last edited by ydsk on Thu Dec 30, 2010 6:52 am; edited 1 time in total

lancelotlinc

Posted: Thu Dec 30, 2010 6:51 am Post subject:

Jedi Knight

Joined: 22 Mar 2010
Posts: 4941
Location: Bloomington, IL USA

Excellent due diligence ydsk.

Here is an open source utility that can be used to preprocess your XML files and break them up into smaller files.

It can be used interactively in Windows or on the command line in batch mode for any platform that supports Java.

http://www.codeproject.com/KB/XML/SplitLargeXMLintoSmallFil.aspx

Alternately, you could write your own preprocessor using the Java classes XmlReader and XmlWriter to split files up and make the sizes smaller.

Philosophically, WMB is a high-speed messaging engine and not ideally designed for, or intended to deal with, VERY LARGE FILE sizes. If you look at the bench tests performed by Tim Dunn from IBM Hursley, his favorite file sizes are 2,000 bytes and 200,000 bytes.

See ftp://public.dhe.ibm.com/software/integration/support/supportpacs/individual/ip08.pdf

Good luck!
_________________
http://leanpub.com/IIB_Tips_and_Tricks
Save $20: Coupon Code: MQSERIES_READER

mqjeff

Posted: Thu Dec 30, 2010 6:58 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

I'd suggest you run some more tests on the FileInput node behavior with XMLNSC parser and a file that is under the MQSI_FILENODES_MAXIMUM_RECORD_LENGTH. (I.e., try increasing that to, say, 200MB or even 2GB and then retest).

It is my understanding that the FI can support a streaming buffer if a streaming parser was in use, such that it would not actually load the entire contents of the file at once - but merely pass a handle to the stream so that the parser can read whatever chunks it needs to.

If your tests do not show this, open a PMR.

optimist

Posted: Thu Dec 30, 2010 7:02 am Post subject:

Apprentice

Joined: 18 Nov 2010
Posts: 33

ydsk --

You earlier said you used a SHARED variable approach to split large test files. Can you not do the same here with XML files?

On the FIN, set the "Record Detection" to "Delimited", so each row from the XML file is sent as a record. Then, within the Compute node wait for the equivalent of 100 Sale complex types to build up (using closing element detection) and propagate as a MQ message & delete contents of the shared variable.

Of course, you need to hold on to the header data and initialize the shared variable after each propagation.

Do you think the above approach is feasible?

ydsk

Posted: Thu Dec 30, 2010 8:23 am Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

optimist,

If it were that easy I would have lost my job by now (for not knowing/trying it till now).

What kind of a delimiter can you think of for XML files ? The tags can have spaces inside, the tags can be present inside xml comments, CDATA, etc , etc. I'll have to write a mini-XML parser myself that way.

I already did whatever you mentioned (precisely) for huge TEXT files using a BLOB parser and a delimiter (newline) in my FIN, and my msgflow splits huge TEXT files successfully.

Thanks
ydsk.

Last edited by ydsk on Thu Dec 30, 2010 9:55 am; edited 1 time in total

optimist

Posted: Thu Dec 30, 2010 8:49 am Post subject:

Apprentice

Joined: 18 Nov 2010
Posts: 33

ydsk --

Thanks for the clarification.

The assumption I made above was:

(1) We will split the huge XML file into smaller files of < 100MB and then use the same LARGE MESSAGE processing technique from WMB 7 samples to process these smaller files from a queue.

(2) Treat the contents of the XML file as TEXT and do the same as you did to split the TEXT file, except that you will have to detect when you need to split the file and prepare the next message headers.

Are you saying it is difficult to detect a line within ESQL that has a closing element tag, like </Sale>?

Just trying to understand more about the issues. Thanks in advance.

--optimist

ydsk

Posted: Thu Dec 30, 2010 9:52 am Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

YES.

Here are some possibilities:

</sale>, </sale >,  , similar occurences inside CDATA.

All are possible, and with what you are suggesting I need to write code to figure out the right end tag.

Thanks
ydsk.

ydsk

Posted: Thu Dec 30, 2010 12:37 pm Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

Thank you lancelotlinc and mqjeff.
mqjeff, I am already trying what you said.

I am thinking of coding the following in broker in 2 separate msgflows:

1) Read the huge input xml file as fixed length BLOB records, and figure out the <Header> block (by writing a mini-XML parser), and save it somewhere in a queue or a file to be later read by MQGet or FileRead node. And mark the position where the end tag </Header> ends in the input xml file. The ESQL code should be able to find the <Header> block in the first couple of records read from the file without reading a lot of records, though code should be generic enough to expect any tag anywhere.

2) Once the end of the </Header> tag is figured out, I 'll try to figure out the beginning of the first <Sale> tag, and from there through the end of the file (skipping the last </Body>, and </Message> tags), I'll write out the whole BLOB (as individual fixed length BLOB records concatenated) to a temporary file.

3) Once the 2 steps above are completed in a single msgflow, I can process the huge temporary file that has just the <Sale> elements in a simple msgflow that has FIN-->compute-->FON, by specifying 'Parsed Record sequence' and a msgset for <Sale> using XMLNSC parser. I know the step # 3 is easy and it should work.

The real work is in coding the steps #1 and #2 above, and I hope I'll be successful in the new year 2011.

Thank you all.
ydsk.

ydsk

Posted: Sun Jan 02, 2011 9:36 pm Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

I just tried increasing the max record size by specifying MQSI_FILENODES_MAXIMUM_RECORD_LENGTH = 150000000 (150 MB), and my msgflow could process an xml file of size 103 MB successfully(not possible earlier), but failed to process an xml file of size 200 MB(the flow simply moved the file to mqsibackout directory).

---------------------------------------------------------------
FIN-->Compute-->FON

FIN properties: "Whole file", XMLNSC ( No msgset), 'On Demand' parsing.

Compute Node properties: copy inputroot to outputroot, set file/dir names in outputLocal environment appropriately. compute mode set to 'All'. (Instead of doing root-to-root copy, I also tried to copy just the <Header> from inputroot to outputroot, but the behavior of the msgflow was exactly the same, it couldn't process beyond an xml file of size 200 MB).

FON properties: 'Whole file' is record
---------------------------------------------------------------

There is NO streaming involved when FIN reads an XML file, when 'Whole File' is specified for a record AND when XMLNSC parser is used in FIN, and so the entire file is read in 1 shot.

The 'Large Messaging' sample too reads the entire xml message FROM ITS SOURCE (INPUT QUEUE) in 1 shot and then stores it in a ROW variable. Then it is parsed as needed from the ROW variable, and each processed xml tag is deleted in a loop. The technique shown in the sample requires holding the entire input xml in the flow/EG's memory (unparsed of course). The only savings in memory happen in parsing the XML, not in reading it from the source. There is NO streaming involved in reading the XML message from the source here.

The behavior of my msgflow, and the 'Large Messaging' sample seem to confirm whatever I just stated.

I'll let Kimbert or other Hursley guys comment on this.

Thanks
ydsk.

ydsk

Posted: Wed Jan 05, 2011 6:48 pm Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

Kimbert/Hursley people on the forum,

Is there anything you guys can say about the issue of streaming when a FIN reads an input XML file (when the record is 'Whole File') ?

All my testing so far proved that there is NO streaming involved in reading the file, with the parameters I specified and with the xml structure I mentioned.

The FileInput Node tries to read the whole file in a single attempt, and if the file size is more than MQSI_FILENODES_MAX_RECORD_LENGTH it simply moves the file to mqsibackout without reading the file.

Thanks
ydsk.

blee

Posted: Tue Jan 11, 2011 12:28 pm Post subject:

Newbie

Joined: 28 Feb 2006
Posts: 7

Hi ydsk:

Have you considered using XSLT to "split" the huge XML file/message?

I have been using XSLT to perform numerous transformation (e.g. XML to XML with specific/desired elements from the document tree). And they work well for situation where you need to extract/split large file/message.

Display posts from previous:

Goto page Previous 1, 2

Page 2 of 2

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » splitting large text files (BLOB) into smaller ones in WMBv7

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP