MQSeries.net :: View topic - Message flow hung after large volume of message processing

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Message flow hung after large volume of message processing

Message flow hung after large volume of message processing

« View previous topic :: View next topic »

Author

Message

vishnurajnr

Posted: Thu Mar 23, 2023 2:09 pm Post subject: Message flow hung after large volume of message processing

Centurion

Joined: 08 Aug 2011
Posts: 134
Location: Trivandrum

Hi,

WMB, V8.0.04 (please excuse posting the query reg an unsupported version) on Linux

we have a message flow that will trigger every 2 seconds by timeout notification node and then browse message from MQ using MQ GET node and then process the message to backend system using TCP IP node (dynamically populate the backend details like IP and port as there are 2500 different backend systems). The flow will post the metadata to an MQ output node immediately after sending the message via TCPIP node and from another MQ input node (within the same flow thread), the TCP IP three-way handshake process will be performed and finally, the message will be removed from the queue after successful execution using another MQ GET node. This will allow us to process messages to different backend systems (around 2500 different backend systems connected by TCP IP nodes) almost in parallel without waiting for the 3-way handshake process to be completed. The message flow is looping/browsing for a max time of 6000 (so that it will not spend a long time to loop if there is a huge backlog in the queue) and the looping is performed using the ESQL PROPAGATE from a compute before the MQ GET node (browse).

Now they issue details:

This flow is processing around 700K~900K messages daily; at times, the flow appears to be stuck and not processing any messages from the queue.

No exceptions are thrown here and after restarting the flow, the message starts processing again. so we are in a situation where the flow has to be restarted 1 or 2 times or even more in a day.

The flow redesign or a major refactoring of this logic is not pragmatic due to several reasons including the out-of-support product version.

A user trace on this flow when it is found stuck does not log many details. Even when the trace is enabled for 2 mins, the trace output data is found for a few milliseconds duration only and hence unable to figure out what went wrong here.

Is this issue due to a memory constraint like STACK memory since the flow is performing a repetitive looping? or a heap memory issue since flow is processing a large volume of messages iteratively? How can we prove if this is due to a memory resource issue and what is the recommendation for increasing stack memory here?

Any suggestions or recommendations are welcome to identify the issue.

Thanks.

mgk

Posted: Fri Mar 24, 2023 6:34 am Post subject:

Padawan

Joined: 31 Jul 2003
Posts: 1638

OK, so this is not going to be easy to track down. Ideally you need to upgrade to a supported release and then you could raise a PMR if you think it is a problem with the product. There are a couple of pointers I can give, but this problem has too many variables to be sure what the real problem is.

1: You said the MQInput node was "within the same flow thread". This is incorrect. It may well be within the same Flow, but each input node in a Flow has their own thread so it will be on a different thread.

2: You asked "is this issue due to a memory constraint like STACK memory since the flow is performing a repetitive looping". This seems unlikely to me. Normally if there is a problem with the stack, then you crash with a StackOverflow error. In this case you do not say that this is happening. Also, your description suggests you are looping using ESQL propagate which is good should not cause you stack problems in most cases.

3: You said "How can we prove if this is due to a memory resource issue" What makes you think it is a memory issue? Do you see the memory keep increasing on the DataFlowEngine process? It does not seem like a memory problem from what you say.

4: When you say "the flow appears to be stuck and not processing any messages from the queue" do you know if it is only a temporarily not processing messages or does it stick forever? From what you say I'm wondering if you are consuming all the threads available to the flow (input nodes + additional instances). If each thread was waiting in the loop in your ESQL propagating to the next nodes for messages to come back then this could make it seem like it's not processing an messages because there are no threads available to process work? Can you try increasing the additional instances on the flow to see if that makes it perform better? Ideally I think I would look to split out the second part of the flow into a separate flow to prevent starvation issues if possible. Also, when it is "stuck" is the CPU high or low? If it's low it suggests it is waiting for responses...

I hope this helps a little.
_________________
MGK
The postings I make on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.

vishnurajnr

Posted: Mon Mar 27, 2023 5:58 am Post subject:

Centurion

Joined: 08 Aug 2011
Posts: 134
Location: Trivandrum

Thanks alot mqk for the input, it helps.

mgk wrote:

1: You said the MQInput node was "within the same flow thread". This is incorrect. It may well be within the same Flow, but each input node in a Flow has its own thread so it will be on a different thread.

Yes, absolutely. In fact the mq input node has additional instances to do parallel processing with multiple backend systems (three-away handshake). This node is incorporated in the same flow to update a shared row/variables used by the upstream timeout notification node thread.

mgk wrote:

2: You asked "is this issue due to a memory constraint like STACK memory since the flow is performing a repetitive looping". This seems unlikely to me. Normally if there is a problem with the stack, then you crash with a StackOverflow error. In this case, you do not say that this is happening. Also, your description suggests you are looping using ESQL propagate which is good and should not cause you to stack problems in most cases.

Yes, we are not getting a stack overflow error here. I was wondering if such errors may get suppressed somehow- But seems not the case. I could not see any significant increase in memory as well as the memory consumption looks more or less the same throughout the processing. An increase in CPU usage is observed, but that was expected when we were continuously processing messages from MQ when there was a volume spike.

mgk wrote:

4: When you say "the flow appears to be stuck and not processing any messages from the queue" do you know if it is only a temporarily not processing messages or does it stick forever? From what you say I'm wondering if you are consuming all the threads available to the flow (input nodes + additional instances). If each thread was waiting in the loop in your ESQL propagating to the next nodes for messages to come back then this could make it seem like it's not processing an messages because there are no threads available to process work? Can you try increasing the additional instances on the flow to see if that makes it perform better? Ideally I think I would look to split out the second part of the flow into a separate flow to prevent starvation issues if possible. Also, when it is "stuck" is the CPU high or low? If it's low it suggests it is waiting for respons

In the event of the message flow stuck case I explained, we could not see any backlog or uncommitted message count in the MQ input node (Three-way handshake processing). But messages were not picked using MQ GET/timeout notification node thread. We were using both the threads in the same flow to update a shared row/variables controlling this process as the same backend won't allow us to process 2 messages in parallel, and need to send the messages only if the previous one was successful. But we can send to different backend systems in parallel.

We could not replicate the issue in any lower env as we don't have a like-like test env where the same number of backend systems are available. We could not spot this issue when there is a limited number of backend systems involved.

We captured user traces from LIVE and an early indication is that some of the SHARED row were not updated to allow the processing of the message. Ideally, the shared ROW variable will get updated every time after a successful message processing to allow the subsequent messages to be picked and processed. we could not spot the exact issue of this as this seems working for quite some time or loading and then getting into these issues. And the shared row was explicitly updated to remove the variable after successful or error processing using DELETE FIELD esql.

I will post further based on our analysis of further user traces on the issue.

Thanks again mgk for your time and inputs.

mgk

Posted: Mon Mar 27, 2023 6:11 am Post subject:

Padawan

Joined: 31 Jul 2003
Posts: 1638

You said "And the shared row was explicitly updated to remove the variable after successful or error processing using DELETE FIELD esql". I can see how this could cause the kind of problem you describe and you should also make sure all access to shared row variables are inside of ATOMIC blocks to make sure you are thread safe.

"I will post further based on our analysis of further user traces on the issue."
Good luck!

MGK
_________________
MGK
The postings I make on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.

vishnurajnr

Posted: Mon Mar 27, 2023 6:38 am Post subject:

Centurion

Joined: 08 Aug 2011
Posts: 134
Location: Trivandrum

Sorry, i didn't mention it explicitly that all updates to the shared ROW were done within the ATOMIC block here.

mgk

Posted: Mon Mar 27, 2023 6:54 am Post subject:

Padawan

Joined: 31 Jul 2003
Posts: 1638

"Sorry, i didn't mention it explicitly that all updates to the shared ROW were done within the ATOMIC block here" OK Good. And if the shared row is at the SCHEMA level (so accessible by multiple nodes) are you sure any others nodes in the flow that are accessing the same shared variable are also using named ATOMIC blocks with the same name to access the shared variable?

MGK
_________________
MGK
The postings I make on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.

vishnurajnr

Posted: Mon Mar 27, 2023 8:03 am Post subject:

Centurion

Joined: 08 Aug 2011
Posts: 134
Location: Trivandrum

Its a very good catch.

The shared ROW is declared at the schema level.

It is found that the ATOMIC block labels are with different names in multiple nodes updating the same shared ROW/variables. So it looks like it is causing the issue here. I will update later on after we modify and release this (mostly in next 1~2 week)

Thanks again mgk for your input on this.

mgk

Posted: Mon Mar 27, 2023 8:15 am Post subject:

Padawan

Joined: 31 Jul 2003
Posts: 1638

Quote:

"It is found that the ATOMIC block labels are with different names in multiple nodes updating the same shared ROW/variables."

Excellent - good to know

. For reference you can think of ATOMIC blocks with the same name being the same lock where as ATOMIC blocks with different names are different locks. So you must use the same named ATOMIC block to access the same shared variable(s)...

MGK
_________________
MGK
The postings I make on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.

zpat

Posted: Mon Mar 27, 2023 8:31 am Post subject:

Jedi Council

Joined: 19 May 2001
Posts: 5849
Location: UK

If using multiple locks - acquire them in the same order to prevent deadlocks.
_________________
Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error.

mgk

Posted: Tue Mar 28, 2023 1:46 am Post subject:

Padawan

Joined: 31 Jul 2003
Posts: 1638

Quote:

If using multiple locks - acquire them in the same order to prevent deadlocks.

Yes, this is a good principle to follow.

It's also worth pointing out that there is a technote on ATOMIC blocks here: https://www.ibm.com/support/pages/apar/IT38153 This basically explains that you actually have the ability to use a READ/WRITE lock around the contents of the block. It is trying to say that for a given named ATOMIC block, by default you have a WRITE lock and only one thread can enter the block at once. However, in many scenarios, reading from a shared variable done in a different place to writing to the shared variable so you can take advantage of a READ/WRITE lock instead to improve performance.

So if you are only reading data you should use a named READ ONLY block which will allow many threads to access the block at the same time as long as the write lock is not held. But if you are creating/updating/deleting data then you should use the READ WRITE syntax for the same named block which limits access to a single thread at once. If you don't use the READ ONLY syntax then READ WRITE is assumed.

I hope this makes sense.

MGK
_________________
MGK
The postings I make on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.

vishnurajnr

Posted: Thu Apr 06, 2023 3:37 am Post subject:

Centurion

Joined: 08 Aug 2011
Posts: 134
Location: Trivandrum

Hi All,

To update you that the issue is confirmed to be resolved with the fix , ie, to use the same ATOMIC label across multiple nodes for the same shared variable updates.

Thanks mgk, zpat again for your input and assistance.

Regards,
Vishnu

mgk

Posted: Thu Apr 06, 2023 6:13 am Post subject:

Padawan

Joined: 31 Jul 2003
Posts: 1638

Hi Vishnu,

Quote:

To update you that the issue is confirmed to be resolved with the fix

That's really good to hear - thanks for letting us know this resolved the problem.

Quote:

to use the same ATOMIC label across multiple nodes for the same shared variable updates.

Just to make sure, you should also be using the same named ATOMIC block for reading as well as updating which is why the READ ONLY and READ WRITE syntax I mentioned above can help with performance...

Kinds regards,
_________________
MGK
The postings I make on this site are my own and don't necessarily represent IBM's positions, strategies or opinions.

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Message flow hung after large volume of message processing

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP