MQSeries.net :: View topic - The output file is one XML record, 400 MB.....alrighty then

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » The output file is one XML record, 400 MB.....alrighty then

Goto page 1, 2 Next

The output file is one XML record, 400 MB.....alrighty then

« View previous topic :: View next topic »

Author

Message

PeterPotkay

Posted: Tue Aug 26, 2014 5:53 pm Post subject: The output file is one XML record, 400 MB.....alrighty then

Poobah

Joined: 15 May 2001
Posts: 7717

WMB 8.0.0.3 is the use case.

E-S-Q-L
There. I can spell it. And that concludes my demonstration of my ESQL expertise.

I've been asked to help the message flow developers figure out a problem a new message flow is having, and looking for your help in what I should be looking at.

FileInput Node --> Compute Node --> FileOutput Node

Input is a text file with thousands of records described by a COBOL copybook. There is a Message Set used in the flow.
Output (apparently) needs to be all those input records jammed into one gigantic single XML document.

When the input is thousands of records, it takes minutes.
When the input it tens of thousands of records, it takes hours.
When the input is 100K records, the thing takes 8 hours.
When the input is 200K, well it blows up with out of memory errors, so who knows how long.

As a test the developers added a time stamp into each element in the out put and we can see that in the begining there are hundreds of elements being added to the XML doc each second. The longer it goes, the slower it gets, to the point where it takes a second or two per element.

Soooo, where to look. I peeked at the ESQL, no nested loops, nothing obvious to my untrained eye.

Is there a way to "sip" in one record at a time on the input, and spit out one element at a time into the output file to avoid the memory hog and to increase performance? I'm guessing not - this gigantic XML doc is really just one "record", so the flow has to build this monstrosity in memory before it burps out 1 record into the file?

CPU pegs at 100% during all this, so the flow is CPU bound. I'm guessing the bigger and bigger this XML doc grows in memory, the harder and harder the ESQL has to work to append elements to the end?

File Input Node fun facts:
Input Message Parsing
Message domain = MRM
Parser Options
Parse timing = On Demand

File Output Node fun facts:
Basic
Mode for writing to file = Stage in mqsitransit
Records and Elements
Record Definition = Record is whole file

So is this a ESQL coding problem, a Broker config problem, or a requirements problem (WMB just can't deal with this scenario of a single 400 MB XML output record).

There was some focus on the out of memory issue when they tried 200K of input. I said don't worry about that - if the thing takes 10 hours before it blows up with out of memory, its irrelevent - this can't move forward performing this slow. Odds are if we can fix the performance issue, the memory issue will also be resolved.

They were originally testing using FTP for input and output. I had them go to using local files for in and out saying if it can't perform using local files why bother with FTP as another variable. Get it working properly first using local files.
_________________
Peter Potkay
Keep Calm and MQ On

fjb_saper

Posted: Tue Aug 26, 2014 7:41 pm Post subject: Re: The output file is one XML record, 400 MB.....alrighty t

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20696
Location: LI,NY

PeterPotkay wrote:

WMB 8.0.0.3 is the use case.

E-S-Q-L
There. I can spell it. And that concludes my demonstration of my ESQL expertise.

Soooo, where to look. I peeked at the ESQL, no nested loops, nothing obvious to my untrained eye.

Is there a way to "sip" in one record at a time on the input, and spit out one element at a time into the output file to avoid the memory hog and to increase performance? I'm guessing not - this gigantic XML doc is really just one "record", so the flow has to build this monstrosity in memory before it burps out 1 record into the file?

CPU pegs at 100% during all this, so the flow is CPU bound. I'm guessing the bigger and bigger this XML doc grows in memory, the harder and harder the ESQL has to work to append elements to the end?

File Input Node fun facts:
Input Message Parsing
Message domain = MRM
Parser Options
Parse timing = On Demand

File Output Node fun facts:
Basic
Mode for writing to file = Stage in mqsitransit
Records and Elements
Record Definition = Record is whole file

So is this a ESQL coding problem, a Broker config problem, or a requirements problem (WMB just can't deal with this scenario of a single 400 MB XML output record).

Ok so there are some possibilities, but first let's check the obvious ones:

are indexes being used in ESQL
Note, input AND output should be handled with reference variables (one for each). The use of indexes will have you traverse the tree n(n+1)/2 times.
You really want to traverse the tree only one time.
output and memory...
If you have repeating or non repeating structures hanging off an XML root,
like

Code:

<xmlroot>
<structure_n>
<elt_i>zyz</elt_i>
</structure_n>
</xmlroot>

you could define a message per structure and define the XML root (start and end tag ) as a DFDL...
Result: output start tag, output n records (one a a time), output end tag.
Send message to file complete terminal.
This would allow you to use the large message technique on the input file,
and keep the memory foot print at a minimum
The limit here is 2GB.(IIB9) Beyond that you'll have to slice the input file by parsed record or some other criteria.

Hope it helps

Have fun

_________________
MQ & Broker admin

PeterPotkay

Posted: Wed Aug 27, 2014 4:41 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7717

Thanks F.J., I'll discuss this with the developers.

Quote:

This would allow you to use the large message technique on the input file,
and keep the memory foot print at a minimum
The limit here is 2GB.(IIB9) Beyond that you'll have to slice the input file by parsed record or some other criteria.

But isn't the memory problem on the output side? If the requirement is to have one XML doc, that entire doc needs to be held in memory before its ejected as a single record once EOF is reached on the input. If the resultant XML doc is 400 MB, I think are hands are tied with the amount of memory that will be consumed - it will be at least 400 MB. I don't think there is a way to insert into the middle of an output file that contains only one ever growing XML document as a single record.

Quote:

The use of indexes will have you traverse the tree n(n+1)/2 times.

That sure sounds like something that would cause the thing to get slower and slower the bigger the output XML doc gets. I will ask them to focus on this.

First priority is the overall speed of this thing...then we'll tackle the memory issue.
_________________
Peter Potkay
Keep Calm and MQ On

Vitor

Posted: Wed Aug 27, 2014 5:38 am Post subject: Re: The output file is one XML record, 400 MB.....alrighty t

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

PeterPotkay wrote:

Output (apparently) needs to be all those input records jammed into one gigantic single XML document.

Aside from the very valuable advice from my worthy associate (the behaviour you describe sounds a lot like code using indexes not reference variables to me) I'd find the designer and ask in what universe a 400 Mb XML document with 200k stanzas in it is a good idea in general terms; how in your specific case he thinks the ftp of a 400Mb file will be reliable; how the receiving system will deal with the receipt of a 400Mb file over ftp (cue @mqjeff and his "halting problem" rant) and what on the receiving end is going to parse an XML document of that size without hitting memory issues.

(I was trying to think of a joke about this resulting in bad SAX that wouldn't result in me moderating myself)

If you get this working, you're still going to have a gigantic XML document lumbering round your system like a lizard with a legally protected name in Tokyo.
_________________
Honesty is the best policy.
Insanity is the best defence.

fjb_saper

Posted: Wed Aug 27, 2014 5:53 am Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20696
Location: LI,NY

PeterPotkay wrote:

Thanks F.J., I'll discuss this with the developers.

Quote:

That is why the output is no longer the full file.
If you look at the examples the structure could be looked at as an XML document in it's own right. Now if you don't send an xml declarartion and send the start and end tags differently all it takes is the output of the structure appending the file as you go. => minimal memory footprint and enables the use of large message technique... (inbound and outbound)

Have fun

_________________
MQ & Broker admin

fjb_saper

Posted: Wed Aug 27, 2014 6:01 am Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20696
Location: LI,NY

PeterPotkay wrote:

First priority is the overall speed of this thing...then we'll tackle the memory issue.

You will likely have to tackle both at the same time. Just tackling the index issue will only make the broker run out of memory so much quicker....

Anyways make you use only streaming parser on the input and use the large message technique on the input anyways regardless of whether you split up the output or not. (I would).

Have fun

_________________
MQ & Broker admin

kimbert

Posted: Wed Aug 27, 2014 8:36 am Post subject:

Jedi Council

Joined: 29 Jul 2003
Posts: 5542
Location: Southampton

Quote:

No - the FileOutput node can be configured to append to the file one record at a time.
I recommend that you
- Configure the input node to use On Demand parsing
- Set Record Detection to 'Parsed Record Sequence'
- In the 'Document Root' property, specify an element that describes exactly one Record ( not an unbounded sequence of Record ). If the element that describes one record is already a global element then you can just specify that. Otherwise, you will first need to edit the message model and make it global.
- Now write the flow to process each record and propagate the output tree to the FileOutput node each time. Should be fairly simple.

As far as I know, there are no built in limits to the file sizes that can be handled using this technique.

As others have pointed out, the XML declaration and opening root tag will need to be written before the first record is processed and the closing root tag will need to be written after the final record has been processed. Otherwise the output file will not be a well formed XML document.
_________________
Before you criticize someone, walk a mile in their shoes. That way you're a mile away, and you have their shoes too.

PeterPotkay

Posted: Thu Aug 28, 2014 11:07 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7717

I got a hold of some sample output XML and the ESQL code. I cleaned both up a bit to remove company identifying stuff, but here it is.

The input is described above , simple one line records that map to a COBOL copybook.

The out put XML needs to look like this. Multiple iterations of TheCompany (hundreds) and each company has multiple iterations of The Transaction (hundreds). The input file luckily has them all in the right order, but you never know how many companies you'll have or how many transactions each company will have. There are a lot more fields, but I think this gets the point across.

Code:

<?xml version="1.0" encoding="UTF-8" ?>
<MyOutput_file xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<schema_version>1.0</schema_version>
<TheCompany>
<COMPANY_CODE>111111</COMPANY_CODE>
<TheTransaction>
<transaction_type>aaa</transaction_type>
<company_indicator>N</company_indicator>
<first_name>BRUCE</first_name>
<middle_name />
<last_name>SMITH</last_name>
<name_suffix />
</TheTransaction>
<TheTransaction>
<transaction_type>zzz</transaction_type>
<company_indicator>N</company_indicator>
<first_name>THIS</first_name>
<middle_name />
<last_name>GUY</last_name>
<name_suffix />
</TheTransaction>
</TheCompany>
<TheCompany>
<COMPANY_CODE>999999</COMPANY_CODE>
<TheTransaction>
<transaction_type>zz12</transaction_type>
<company_indicator>N</company_indicator>
<first_name>JOHN</first_name>
<middle_name />
<last_name>DOE</last_name>
<name_suffix />
</TheTransaction>
<TheTransaction>
<transaction_type>111</transaction_type>
<company_indicator>N</company_indicator>
<first_name>UNO</first_name>
<middle_name />
<last_name>WHOH</last_name>
<name_suffix />
</TheTransaction>
</TheCompany>
</MyOutput_file>

So here's the santized ESQL. There were a lot more lines there in the middle that I deleted since they are irrelevent to the problem. But they are using referencing and an idex within an index.

Code:

CREATE COMPUTE MODULE COPY_TO_XML
CREATE FUNCTION Main() RETURNS BOOLEAN
BEGIN
   DECLARE intX INT 1;
   DECLARE intCC INT 1;
   DECLARE intTR INT 1;
   DECLARE initCnt INT 1;
   Declare inRef REFERENCE TO InputRoot.MRM.CPYBOOK_MY_REPORT_REC;
   SET Environment.Variables.FileName = InputLocalEnvironment.File.Name;
   SET OutputRoot.XMLNSC.(XMLNSC.XmlDeclaration)*.(XMLNSC.Attribute)Version = '1.0';
   SET OutputRoot.XMLNSC.(XMLNSC.XmlDeclaration)*.(XMLNSC.Attribute)Encoding = 'UTF-8';
   SET OutputRoot.XMLNSC.submitted_file.(XMLNSC.NamespaceDecl)xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance';
   SET OutputRoot.XMLNSC.submitted_file.schema_version='1.0';
   DECLARE outRef REFERENCE TO OutputRoot.XMLNSC.submitted_file;
   WHILE LASTMOVE(inRef) DO
      IF Environment.Variables.Comp_Code IS NULL THEN
         SET Environment.Variables.Comp_Code = inRef.CPYBOOK_COMPANY_CODE;
      END IF;
      IF Environment.Variables.Comp_Code <> inRef.CPYBOOK_COMPANY_CODE THEN
         SET intCC = intCC+1;
         SET intTR =1;
         SET Environment.Variables.Comp_Code = COALESCE(inRef.CPYBOOK_COMPANY_CODE,'');
      END IF;
      SET outRef.TheCompany[intCC].COMPANY_CODE = COALESCE(inRef.CPYBOOK_COMPANY_CODE,'');
      SET outRef.TheCompany[intCC].TheTransaction[intTR].transaction_type= COALESCE(inRef.CPYBOOK_TRANS_TYPE,'');
      SET outRef.TheCompany[intCC].TheTransaction[intTR].policy.company_indicator =COALESCE(inRef.CPYBOOK_COMP_IND,'');
      SET outRef.TheCompany[intCC].TheTransaction[intTR].policy.first_name =COALESCE(inRef.CPYBOOK_NAME.CPYBOOK_FIRST_NAME,'');
      SET outRef.TheCompany[intCC].TheTransaction[intTR].policy.middle_name =COALESCE(inRef.CPYBOOK_NAME.CPYBOOK_MIDDLE_NAME,'');
      SET outRef.TheCompany[intCC].TheTransaction[intTR].policy.last_name =COALESCE(inRef.CPYBOOK_NAME.CPYBOOK_LAST_NAME,'');
      SET outRef.TheCompany[intCC].TheTransaction[intTR].policy.name_suffix =COALESCE(inRef.CPYBOOK_NAME.CPYBOOK_NAME_SUFFIX,'');
      ------and about 50 more just like this that I deleted in this sanitized copy
      SET intX=intX+1;
      SET intTR = intTR+1;
      MOVE inRef NEXTSIBLING REPEAT TYPE NAME;
END WHILE;
   SET OutputLocalEnvironment.Wildcard.WildcardMatch = InputLocalEnvironment.Wildcard.WildcardMatch;
   RETURN TRUE;
END;
END MODULE;

When I reviewed this thread with the developers, they said they can't output one

Code:

at a time to the output file because as soon as they do, the closing tag for /TheCompany will automatically be written, and there will still be additional input records for that company that need to go in there.

Is that true? If yes, that means we are stuck with the huge memory footprint?

And regarding the CPU performance, with a pair of indexes one nested in the other, does that mean the performance is doomed to be poor (run time in the hours)?
_________________
Peter Potkay
Keep Calm and MQ On

mqjeff

Posted: Thu Aug 28, 2014 11:23 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

You can make a global element of TheTransaction. Then you should be able to output one at a time without outputting a closing TheCompany.

And there should be nothing wrong with sending one file for each TheCompany.

You can absolutely adjust the ESQL to use a reference in each case you are using an index.

Instead of incrementing the counter, just create lastchild and move refs as appropriate.

PeterPotkay

Posted: Thu Aug 28, 2014 4:05 pm Post subject:

Poobah

Joined: 15 May 2001
Posts: 7717

mqjeff wrote:

And there should be nothing wrong with sending one file for each TheCompany.

Sane minds agree. But the requirement is all of the output in one file, all in one
<MyOutput_file>
</MyOutput_file>

The entity setting this requirement carries some weight and is inflexible on this point, so whatchya gonna do? They only want one file from us. Some monthly true up batch process or something.

Thanks for the other pointers; I will review them with the developers.

I appreciate your guys' time in the hand holding. My next year's training dollars may need to be spent on a WMB/IIB Developers intro class (somewhere lancelotlinc just felt the urge to smile for an unknown reason).
_________________
Peter Potkay
Keep Calm and MQ On

Last edited by PeterPotkay on Fri Aug 29, 2014 4:26 am; edited 1 time in total

mqsiuser

Posted: Fri Aug 29, 2014 1:19 am Post subject:

Yatiri

Joined: 15 Apr 2008
Posts: 637
Location: Germany

Replacing your indexes with references on then input and output root should bring down processing to seconds and minutes.

Just to make it clear, sounds like reasonable: Big XML, hours, but it's not: It is seconds and minutes that you should achieve!

Old Broker-Devs shiver when they see indexes, like in this code.

Thanks Peter for all that MQ support and ... wow, you are asking this now (the most important aspect imho of broker development)
_________________
Just use REFERENCEs

Vitor

Posted: Fri Aug 29, 2014 4:44 am Post subject:

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

PeterPotkay wrote:

(somewhere lancelotlinc just felt the urge to smile for an unknown reason).

I told him I'd bring a TV down to the dungeon so he could watch the game.

_________________
Honesty is the best policy.
Insanity is the best defence.

mqjeff

Posted: Fri Aug 29, 2014 5:01 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

PeterPotkay wrote:

mqjeff wrote:

And there should be nothing wrong with sending one file for each TheCompany.

Sane minds agree. But the requirement is all of the output in one file, all in one
<MyOutput_file>
</MyOutput_file>

I've generally learned to give up on reasonable ideas whenever people start talking about using files. It's obvious that all rational thought has left the building...

That said, you can still make this process "better" by treating each TheCompany as a whole record. Even if each TheCompany has a thousand TheTransactions, then you're still only processing a thousand records at a time, rather than all 400Mbs.

You might get somewhere if you can talk to the team (likely the external entity!) that actually has to receive this beast. They might be more amenable to smaller chunks, and someone on your side has just decided to batch things because they're comfortable with batches.

On the other hand, there could actually be regulatory reasons for this. But it's worth a few minutes checking anyway.

fjb_saper

Posted: Fri Aug 29, 2014 1:24 pm Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20696
Location: LI,NY

@Peter,
I know you said your stuff is simplified.
But if you truly have only the company id as field for the company I would send all of the company tags outside of XML (as pure blob text) and would send TheTransaction as the base record...

This should allow you to do quite well memory wise...
Have fun

_________________
MQ & Broker admin

usamagdy

Posted: Mon Sep 01, 2014 3:30 am Post subject:

Newbie

Joined: 30 Mar 2013
Posts: 8

First i agree with below comments

mqsiuser wrote:

second try to marge these two if statement in one like

Code:

IF Environment.Variables.Comp_Code IS NULL THEN
SET Environment.Variables.Comp_Code = inRef.CPYBOOK_COMPANY_CODE;
ELSEIF Environment.Variables.Comp_Code <> inRef.CPYBOOK_COMPANY_CODE THEN
SET intCC = intCC+1;
SET intTR =1;
SET Environment.Variables.Comp_Code = COALESCE(inRef.CPYBOOK_COMPANY_CODE,'');
END IF;

Display posts from previous:

Goto page 1, 2 Next

Page 1 of 2

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » The output file is one XML record, 400 MB.....alrighty then

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP