MQSeries.net :: View topic - splitting large text files (BLOB) into smaller ones in WMBv7

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » splitting large text files (BLOB) into smaller ones in WMBv7

Goto page 1, 2 Next

splitting large text files (BLOB) into smaller ones in WMBv7

« View previous topic :: View next topic »

Author

Message

ydsk

Posted: Fri Dec 24, 2010 11:14 am Post subject: splitting large text files (BLOB) into smaller ones in WMBv7

Chevalier

Joined: 23 May 2005
Posts: 410

We use WMQ v7 with WMB v7 on Windows 2008 R2 servers.
We are at MQ v7.0.1, and WMB v7.0.0.1

We need to process very large xml files (100-200 MB), AND very large text files(100-200 MB) with repeating elements, and split them into smaller files for further processing by existing msgflows.

Till now we had a java program do the splitting but we want to rewrite the code in message broker as it seems to have the capability.

I saw the "Large Messaging" sample provided by IBM and it seems to be good for XML files.

I am looking for something similar on text files (BLOB domain). Our text files have a fixed header, and a repeating record structure (a line with 154 characters that ends in "carriage return" + "newline" i.e., x'0d0a'). The file size can be between 100 - 200 MB, and we need to split it into smaller files each with the fixed header, and a fixed number(say 1000) of records.

The main constraint is memory.We have a total of 8 GB memory on our WMB v7 server.

Can someone please suggest any precautions we got to take for very large Text (BLOB) files ? I have already gone through the relevant links in the WMB v7 info center but need more help.

Please answer these questions in particular:

(1) Do we need to use/declare a ROW variable for text files , just like the one given in the 'Large Messaging' sample for XML ? The sample can be found in toolkit help / infocenter. I don't see the need for the ROW variable for a large text file as there is NO parsing overhead involved (unlike XML files). I think we can directly work with the IMMUTABLE InputRoot without the ROW variable but somebody can correct me if I am wrong. Appreciate any suggestions here.

(2) Do we need to increase the heap size (not the JVM Heap size - we are NOT using Java, just ESQL) available for an execution group ? If so, how ? I already read the relevant documentation in the Info center but it's NOT mentioned anywhere.

(3) The info center says we can reduce the JVM heap size if we don't use Java. We are NOT using Java, everything is in ESQL. There is NO XSLTransformation at all. What is the minimum value we can set the JVMHeap size to for an Execution Group on Windows 2008 SR2 64-bit platform ?

(4) Do we need to increase the stack size ? If so, how ?

(5) Are there any other settings we need to do for handling large files in WMB v7 ? We are changing the queue depth/maxMsglength where applicable.

Appreciate any hints/suggestions/ideas.

Thanks in advance.
ydsk.

Last edited by ydsk on Tue Dec 28, 2010 7:38 pm; edited 1 time in total

kimbert

Posted: Mon Dec 27, 2010 4:57 am Post subject:

Jedi Council

Joined: 29 Jul 2003
Posts: 5542
Location: Southampton

Quote:

I am looking for something similar on text files (BLOB domain). Our text files have a fixed header, and a repeating record structure (a line with 154 characters that ends in "carriage return" + "newline" i.e., x'0d0a'). The file size can be between 100 - 200 MB, and we need to split it into smaller files each with the fixed header, and a fixed number(say 1000) of records.

Should be fairly straightforward. The FileInput node has faciities for automatically splitting a file based on length/delimiters. I'm surprised that you didn't mention that option - maybe you looked into it and decided that it cannot handle the header?
I'm not an expert on the heap size/stack size issues, but my instinct is that a good implementation of this flow will reduce your maximum heap requirement to a normal level where no tuning is required.

(1) Sounds as if you understand the issues pretty well. I suggest that you trust your instincts and give it a try.
(2) See my answer above - not sure why this flow would have extraordinary heap requirements if implemented correctly.
(3), (4), (5) I don't know. Hopefully others will. I would guess that the FileInput ( which streams the input to the parser in chunks ) will keep heap requirements to a reasonable maximum regardless of the size of the input.

mqjeff

Posted: Mon Dec 27, 2010 5:37 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

Some percentage of the built-in nodes are written using Java - some unspecified undocumented percentage.

You can't avoid using any Java at all in WMB7.

The Java heap and stack will only grow as large as is actually needed, it will not ever grow "just because".

ydsk

Posted: Mon Dec 27, 2010 7:21 am Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

Thanks Kimbert/mqjeff.

Actually my scenario changed, it is different now.

I was able to get the "splitting text files larger than 200 MB" working in a msgflow through the use of shared variables.

FileInput --> Compute --> FileOutput.

The msgflow performed very well, using less than 50 % CPU most of the time, and less than 100 MB of memory.

But my problem now is with the large XML files (200 MB or bigger). I thought the "Large Messaging" sample provided in the toolkit samples gallery would solve my problem. But, it is throwing an exception saying XML parsing errors have occured because a stream was asked to provide all of its data, but it can provide only through a stream.

I know the XML is well-formed as I could see it in a browser. There is No CDATA or anything, it is plain XML.

Does the technique shown in the 'Large Messaging' sample really work ? I DO understand the "deleting the previous sibling while working on a xml repeating field" part. But I am doubtful of using the ROW variable and the following statement in the sample in particular:

SET rowCachedInputXML.XMLNSC = InputRoot.XMLNSC;

The above line of code seems to be trying to copy the entire input xml message to the ROW variable, and NOT the stream.

Our xml is very similar to the one shown in the sample, and has a repeating structure inside.

Can somebody please share some thoughts on the technique used in the 'Large Messaging' sample, and if there are any better techniques available ?

Thanks.
ydsk.

Last edited by ydsk on Wed Dec 29, 2010 8:10 am; edited 1 time in total

fjb_saper

Posted: Mon Dec 27, 2010 9:44 am Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20697
Location: LI,NY

Try setting the node to parsing on demand...

Use
SET Environment.Data.XMLNSC = InputRoot.XMLNSC;

Then parse as defined in the Large Message Sample.

_________________
MQ & Broker admin

ydsk

Posted: Mon Dec 27, 2010 10:02 am Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

fjb_saper,

The parsing is default already(set to On Demand), and the parser is XMLNSC.

Whether we copy it to Environment or a ROW variable, the technique should be the same.

My question is about the technique used in the 'Large Messaging' sample. Do you know if it really works ?

Please advise.

I'll give the Environment stuff a try meanwhile.

thanks
ydsk.

kimbert

Posted: Mon Dec 27, 2010 10:17 am Post subject:

Jedi Council

Joined: 29 Jul 2003
Posts: 5542
Location: Southampton

Quote:

But my problem now is with the large XML files (200 MB). I thought the "Large Messaging" sample provided in the toolkit samples gallery would solve my problem. But, it is throwing an exception saying XML parsing errors have occured because a stream was asked to provide all of its data, but it can provide only through a stream.

Please take a user trace, format it, and post the relevant part. Please enclose the user trace output in [code] tags to make it readable.

Quote:

My question is about the technique used in the 'Large Messaging' sample. Do you know if it really works ?

You could try the scenario in the sample. If it works then there is nothing wrong with the technique. I expect the sample is based on this DeveloperWorkds article, which has been used successfully in dozens of projects: http://www.ibm.com/developerworks/websphere/library/techarticles/0505_storey/0505_storey.html

ydsk

Posted: Mon Dec 27, 2010 5:14 pm Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

An update:

There were some problems with my ESQL and after fixing them I did some tests with large XML files of varying sizes. I have: FIN--> Compute-->FON.
I specified 'whole file' as the record on FileInput, with XMLNSC, and 'On Demand' parsing, and it could only process files that were less than a 100 MB roughly.

I kept increasing my input XML file size by adding more and more repeatable xml structures inside, starting from a 5 MB file and my msgflow could split a 70 MB file successfully. Then I increased my input XML file size to 105 MB, and it wasn't processed. The whole input file was simply moved to mqsibackout directory without processing.

I also came to know our XML files could be the size of a GB. So, I realized (upon more reading) that my approach of specifying 'whole file' in my FIN won't work beyond a 100 MB, as the entire file is taken as a record.

I think I need to specify "parsed record sequence" in my File Input Node and try.

I guess I will have to use SHARED variables too (to store and concatenate previous records read), just like in the case of large TEXT files.

Will do that and then update again.

Thank you Kimbert !

ydsk.

mqjeff

Posted: Mon Dec 27, 2010 6:15 pm Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

ydsk wrote:

I guess I will have to use SHARED variables too (to store and concatenate previous records read), just like in the case of large TEXT files

That's certainly *a* solution.

You could also, for example..., add a sequence or record number of some kind to the output XML (that doesn't for any reason at all have to *exactly* match the input XML).

Then you could use... *something* to ... "re-sequence"... the records at the time that you need to process them in "order".

ydsk

Posted: Tue Dec 28, 2010 7:12 am Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

Thanks Kimbert and mqjeff.

Here is what I tried.

My flow still looks like: FIN --> Compute --> FON

My input xml File is like ( could be a GB in size):
-----------------------------------------------------------------------------
<Message>
<Header>
. . .
. . .
</Header>
<Body>
<Sale>
. . .
</Sale>
. . .
</Body>
</Message>
------------------- The <sale> complex element repeats about a MILLION times or more ----------

My requirement is to split the HUGE xml file above into smaller files with the same structure, each with the same header, and a fixed number (say 100) of the repeating element <Sale>.

I tried giving "parsed record sequence" wih XMLNSC (On Demand parsing), and a msgSet where I specified maxOccurs=100 for my repeating xml element <Sale>. I tried with an input xml that had 201 repeating <Sale> elements , but it read the entire file in 1 shot. I saw it when I added a trace node just after my FIN in the msgflow.

My question is, how do I read just a fixed number (100) of repeating <Sale> elements AT A TIME from my **HUGE** input xml file that has about a million of the repeating <Sale> elements ?

Am I doing the right things ?

Is it possible in WMB v7 to split a huge XML file of the order of a GB ?

Can someone please advise ?

Thank you very much.
ydsk.

kimbert

Posted: Tue Dec 28, 2010 2:01 pm Post subject:

Jedi Council

Joined: 29 Jul 2003
Posts: 5542
Location: Southampton

Quote:

Is it possible in WMB v7 to split a huge XML file of the order of a GB ?

Yes. Categorically yes. The FileInput node was purposely designed to allow the processing of huge ( multi-Gb ) files.

Quote:

I tried giving "parsed record sequence" wih XMLNSC (On Demand parsing), and a msgSet where I specified maxOccurs=100 for my repeating xml element <Sale>. I tried with an input xml that had 201 repeating <Sale> elements , but it read the entire file in 1 shot. I saw it when I added a trace node just after my FIN in the msgflow.

The FileInput node will expect the input fiel to contain multiple XML *documents*. It will also expect the header to be included in each document. You have a single, large document - I don't think the FileInput node allows you to split that document using Parsed Record Sequence.
Setting maxOccurs will not affect the behaviour of the XMLNSC parser at all - unless you set Validation to 'Content And Value' in which case it will trigger validation errors ( because you have more than 100 occurrences in your input document ).

So...what do you need to do? I suggest this:

- set Parse Timing to 'On demand' ( already done )
- don't bother with any Record-splitting features in the FileInput node
- write your ESQL so that it does the following:

Code:

Save the header in the local environment
Repeat until no data left
Copy the header into OutputRoot.XMLNSC
Copy ( up to ) 100 records from InputRoot.XMLNSC using a REFERENCE variable, and checking LASTMOVE after each copy.
Propagate the message
Delete OutputRoot.XMLNSC

The FileInput node will *not* read the entire 1Gb file into memory. The XMLNSC parser will only read as far as it needs to read in order to satisfy the current line of ESQL. Make sure that you delete data from OutputRoot.XMLNSC each time, else you will build the entire message tree for the 1Gb file in memory. That *would* require quite a lot of heap.

ydsk

Posted: Tue Dec 28, 2010 4:37 pm Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

Kimbert,

Thanks for the reply, and I will certainly try it today.

Won't OutputRoot.XMLNSC be deleted automatically after each propagate ? That shouldn't be an issue anyway.

Also what other options should I specify in my FIN ? I think I should use "whole file" as a record for the approach you are suggesting.

What you are suggesting seems to be straightforward, but I have a question and that is regarding the size of InputRoot tree. Won't that keep growing as more and more repeating <Sale> records are read and propagated ?

Size of the input tree was the main issue in the "Large Messaging" sample. It could process less than a 100 MB in my case as I mentioned above.

Using the technique shown in the "Large Messaging" sample, a ROW variable is used as a MUTABLE copy. BUT when "whole File" is specified as a record in FIN, and when the entire InputRoot (unparsed of course) is copied into the ROW variable at the beginning of the msgflow, the entire input message is to be held in memory in the ROW variable (is this understanding correct ? correct me here if needed ) and that's where the msgflow fails for files larger than about a 100 MB.

With the approach you just suggested there is NO mutable stuff involved and the FIN directly reads from the Input stream so I guess it is optimized to read in chunks (and NOT the entire file in one shot). But I think there will still be an issue with the size of the InputRoot as the msgflow progresses as I mentioned already. Obviously the msgflow's Execution Group can't hold a GB (a lot more than a GB in fact, due to parsing) in memory at a time. Can you please clarify how it works ?

Thanks a lot for you time.
ydsk.

Last edited by ydsk on Tue Dec 28, 2010 6:53 pm; edited 1 time in total

ydsk

Posted: Tue Dec 28, 2010 6:26 pm Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

Kimbert,

I just tried the exact method you suggested, and it exhibited the same behavior again (as the large messaging sample).

The msgflow could split a 70 MB xml file successfully into chunks of 100 repeating records each.
When I dropped a 105 MB xml file, it simply got moved to mqsibackout folder.

When I dropped the 70 MB file, the memory usage for the EG increased to about 390 MB, and the CPU usage was close to 40. When the 105 MB file was dropped both the memory and CPU usage remained at zero.

There was NO user trace generated when I dropped the 105 MB xml file (trace file had no info except the first couple of lines where it says "Timestamps are formatted local time ..." ) so I don't really know what exactly happened.

The msgflow looks like:.

FIN-->Compute-->FON

FIN properties: "Whole file", XMLNSC ( No msgset), 'On Demand' parsing.

Compute Node: The same logic you mentioned in your last post using reference variables, copying header to Environment.Variables for use in OutputRoot before every propagate, and setting OutputRoot to NULL explicitly after each propagate.

FON properties: 'Whole file' is record

I ran out of ideas on how to solve my problem.

Please advise on how to proceed.

Thank you.
ydsk

mqjeff

Posted: Tue Dec 28, 2010 7:05 pm Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

You're not doing the necessary DELETE, I think.

ydsk

Posted: Tue Dec 28, 2010 8:00 pm Post subject:

Chevalier

Joined: 23 May 2005
Posts: 410

mqjeff,

I tried giving the line below after each propagate statement:

Delete FIELD OutputRoot.XMLNSC;

And the msgflow behavior was exactly the same.
It could split a 70 MB xml file successfully, but failed to split a 105 MB file.

Unless Kimbert or Hursley guys suggest a different approach that they think would work, I am about to give up the idea of splitting huge xml files in WMB v7.

thanks
ydsk.

Display posts from previous:

Goto page 1, 2 Next

Page 1 of 2

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » splitting large text files (BLOB) into smaller ones in WMBv7

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP