MQSeries.net :: View topic - Unexpected pauses in WMB processing

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Unexpected pauses in WMB processing

Goto page 1, 2 Next

Unexpected pauses in WMB processing

« View previous topic :: View next topic »

Author

Message

Vitor

Posted: Fri Dec 27, 2013 6:26 pm Post subject: Unexpected pauses in WMB processing

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

WMBv7.0.0.3, WMQv7.0.1.9, Sun Solaris OS & a partridge in a pear tree...

The situation is this:

A simple flow, MQInput node at one end, MQOutput node at the other, a number of nodes between them. Same flow on 2 brokers who's queue managers are in a cluster so that the queue the MQInput node reads has the traffic 50/50 between them. Each flow is in it's own EG on the broker & has 39 additional instances defined. Went live early December and chugs away happily, with 15-20 readers on each clustered instance of the queue during business hours and depth zero on the input queues. Joy everywhere.

Since Xmas we've observed spikes where the input queue depth will suddenly build to 40 or 50 messages, stay there for a minute or so then drop back to zero. Stays at zero for an hour or so, then repeats the pattern. It's happening on both brokers but not at the same time, nothing in the broker or queue manager logs, all the messages show as committed and we're still only have 15-20 readers (so we're not maxed out on instances).

In the absence of a sensible idea and because this is a business critical flow we issued an mqsireload for the EG in question on both brokers. Alarmingly this solved the problem.

We have a PMR open, but does anybody have any ideas on where to look for a root cause before I assume the pixies were using UPS to deliver magic dust & we simply ran out?
_________________
Honesty is the best policy.
Insanity is the best defence.

fjb_saper

Posted: Sat Dec 28, 2013 12:32 am Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20771
Location: LI,NY

if mqsireload seemed to fix the problem I would look at 2 things

memory consumption of the egs
monitoring
- Just to make sure what you see is not due to a spike from the upstream system.
  I saw spikes from an upstream system that was waiting on a db read to commit. Once the read returned the commit was being done and the messages appeared suddenly at the broker. Any hick-up at DB level generated spikes at the broker level.... (see spike in put rates on the input queue)

Have fun

_________________
MQ & Broker admin

mqjeff

Posted: Sat Dec 28, 2013 5:33 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

And the flow doesn't use shared variables/atomic blocks?

And it's only one flow that stalls, not all flows in all egs?

And you don't have another eg/flow using up lots of cpu/memory at the same time, or many flows using up enough cpu/memory at the same time?

And the MQ Logs aren't being cycled during the freeze?

And the disk admins aren't playing silly buggers behind the scenes because "nobody will notice"?

Vitor

Posted: Mon Dec 30, 2013 7:43 am Post subject:

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

fjb_saper wrote:

if mqsireload seemed to fix the problem I would look at 2 things

memory consumption of the egs
monitoring
- Just to make sure what you see is not due to a spike from the upstream system.
  I saw spikes from an upstream system that was waiting on a db read to commit. Once the read returned the commit was being done and the messages appeared suddenly at the broker. Any hick-up at DB level generated spikes at the broker level.... (see spike in put rates on the input queue)

Have fun

Thanks!
_________________
Honesty is the best policy.
Insanity is the best defence.

Vitor

Posted: Mon Dec 30, 2013 7:48 am Post subject:

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

mqjeff wrote:

And the flow doesn't use shared variables/atomic blocks?

It does use shared variables.

mqjeff wrote:

And it's only one flow that stalls, not all flows in all egs?

It's just that one flow, all other EGs behave normally but this flow has significantly higher throughput than others.

mqjeff wrote:

And you don't have another eg/flow using up lots of cpu/memory at the same time, or many flows using up enough cpu/memory at the same time?

We do not. See above.

mqjeff wrote:

And the MQ Logs aren't being cycled during the freeze?

Apparently not, but we're checking.

mqjeff wrote:

And the disk admins aren't playing silly buggers behind the scenes because "nobody will notice"?

We've not ruled it out.

_________________
Honesty is the best policy.
Insanity is the best defence.

mqjeff

Posted: Mon Dec 30, 2013 8:04 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

Vitor wrote:

mqjeff wrote:

And the flow doesn't use shared variables/atomic blocks?

It does use shared variables

And you are using ATOMIC blocks to update the shared variable, to ensure consistency?

And you've ruled out that the flow is getting stalled because everyone is waiting for everyone else to release the ATOMIC block?

PeterPotkay

Posted: Mon Dec 30, 2013 8:48 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

Double checking: When the q depth is at '40 or 50 for about a minute', you can browse these messages? In other words, the messages are definitely commited by the CLUSRCVR channel?

When the back up occurs, it happens on both brokers in lock step? If yes, focus on the infrastructure resource that both brokers share. There is no way memory or code freaks out on two seperate systems at the exact same time. Maybe a shared database that they both update that slows down for some reason some time?
_________________
Peter Potkay
Keep Calm and MQ On

PeterPotkay

Posted: Mon Dec 30, 2013 8:51 am Post subject: Re: Unexpected pauses in WMB processing

Poobah

Joined: 15 May 2001
Posts: 7723

Vitor wrote:

In the absence of a sensible idea and because this is a business critical flow we issued an mqsireload for the EG in question on both brokers. Alarmingly this solved the problem.

You did the mqsireload on both brokers when the flow was 'hung' on both brokers, and that got things moving?

Or things had returned to normal, and to prevent a reoccurance you reloaded the EG on both brokers and haven't see the problem reoccur since (yet)?

Does the flow read/write from a N A S location?

(Edit: Why does this site not like the acronym for Network Attached Storage?)
_________________
Peter Potkay
Keep Calm and MQ On

Vitor

Posted: Mon Dec 30, 2013 9:56 am Post subject:

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

PeterPotkay wrote:

Double checking: When the q depth is at '40 or 50 for about a minute', you can browse these messages? In other words, the messages are definitely commited by the CLUSRCVR channel?

This was verified by an MQ admin displaying the local queue and checking the UNCOM parameter.

PeterPotkay wrote:

When the back up occurs, it happens on both brokers in lock step? If yes, focus on the infrastructure resource that both brokers share. There is no way memory or code freaks out on two seperate systems at the exact same time. Maybe a shared database that they both update that slows down for some reason some time?

It happens on 1 broker and then the other. As these are Solaris Zones it's possible they're hosted on the same hardware; I'm trying to get a straight answer from the sys admins.
_________________
Honesty is the best policy.
Insanity is the best defence.

Vitor

Posted: Mon Dec 30, 2013 9:59 am Post subject: Re: Unexpected pauses in WMB processing

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

PeterPotkay wrote:

Vitor wrote:

In the absence of a sensible idea and because this is a business critical flow we issued an mqsireload for the EG in question on both brokers. Alarmingly this solved the problem.

When the flow was running on both brokers. No re-occurrence since.

PeterPotkay wrote:

Does the flow read/write from a N A S location?

(Edit: Why does this site not like the acronym for Network Attached Storage?)

[/quote]

Entirely queue based, but the queues are on that kind of shared disc.
_________________
Honesty is the best policy.
Insanity is the best defence.

zpat

Posted: Tue Dec 31, 2013 1:19 am Post subject:

Jedi Council

Joined: 19 May 2001
Posts: 5867
Location: UK

Does the flow contain anything that can cause the process or thread to enter a wait state?

Especially anything that might affect all the instances at once?

Candidates could be access to a database, or web service or invoking some JAR file - anything with a synchronous nature.

You might get better performance from running the flow several times in different EGs, rather than all the instances in one EG.

However this might affect any use of shared variables.
_________________
Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error.

PeterPotkay

Posted: Tue Dec 31, 2013 4:12 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

When the problem occurs, is it really all instances that pause? You say 40-50 messages back up for about a minute, but that you have a couple dozen instances of the possible 39 additional instances running already. Seems to me like more than 40-50 messages per minute are needed to drive that many additional instances. Does that mean when the 40-50 were waiting in the queue, others were passing thru the queue?

40-50 for about a minute..... 50 messages, 60 seconds....I keep thinking about that fact that those are the default values for the MQ channel's BATCHSZ and SHORTRTY.

If I remember correctly, your shop has WMQ and WMB admin duties segregated. Do you have any ability to check with your own eyes the MQ error logs for any MQ channel errors on both the sending and receiving ends?

You kinda answered my question about whether you could browse the messages on the queue when the problem was occuring. Kinda, but not really

_________________
Peter Potkay
Keep Calm and MQ On

Vitor

Posted: Thu Jan 02, 2014 5:29 am Post subject:

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

zpat wrote:

Candidates could be access to a database, or web service or invoking some JAR file - anything with a synchronous nature.

There are a number of candidates which we are investigating.
_________________
Honesty is the best policy.
Insanity is the best defence.

Vitor

Posted: Thu Jan 02, 2014 5:40 am Post subject:

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

PeterPotkay wrote:

We lack the granularity needed to make that determination. A combination of widely separated responsibilities and inexperienced support staff.

PeterPotkay wrote:

40-50 for about a minute..... 50 messages, 60 seconds....I keep thinking about that fact that those are the default values for the MQ channel's BATCHSZ and SHORTRTY.

Good point.

PeterPotkay wrote:

If I remember correctly, your shop has WMQ and WMB admin duties segregated. Do you have any ability to check with your own eyes the MQ error logs for any MQ channel errors on both the sending and receiving ends?

You remember correctly, I have no such ability and this incident is another nail in the coffin of this segregation. At least in terms of support where it's now clear we need a more coordinated view.

PeterPotkay wrote:

You kinda answered my question about whether you could browse the messages on the queue when the problem was occuring. Kinda, but not really

Yes, and that's the best answer I can offer. I do not believe but cannot verify that any messages were actually browsed. I believe that only the queue properties were examined.
_________________
Honesty is the best policy.
Insanity is the best defence.

Vitor

Posted: Fri Jan 03, 2014 5:59 am Post subject:

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

Update:

Having got someone to read the logs I don't have access to, there's now a little more information.

The flow in question uses a syncronous WMQ request -> reply to an external party. There's a number of error messages where the flow is attempting to deal with the MQGet timeing out, but failing to serialise the response. Interestingly, this is on the web service side of the flow, which wasn't reporting a problem at the time in question. When queried, they checked their logs and were surprised to notice huge numbers of timeouts for their web service calls to the flow. Apparently they don't have anything monitoring their monitoring system...

Anyway, the root cause seems to be this external system not responding and causing broker errors. These errors seemed to cause a huge increase in memory usage, which in turn caused the EG to pause while it waited for resource from the OS.

Does anyone have any insight / tips / views on why a SOAPReply abend (unable to serialize response) would cause a huge increase in the EG's JVM memory usage? We have a PMR open but I would welcome views from the community.
_________________
Honesty is the best policy.
Insanity is the best defence.

Display posts from previous:

Goto page 1, 2 Next

Page 1 of 2

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » Unexpected pauses in WMB processing

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP