Author |
Message
|
Vitor |
Posted: Fri Dec 27, 2013 6:26 pm Post subject: Unexpected pauses in WMB processing |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
WMBv7.0.0.3, WMQv7.0.1.9, Sun Solaris OS & a partridge in a pear tree...
The situation is this:
A simple flow, MQInput node at one end, MQOutput node at the other, a number of nodes between them. Same flow on 2 brokers who's queue managers are in a cluster so that the queue the MQInput node reads has the traffic 50/50 between them. Each flow is in it's own EG on the broker & has 39 additional instances defined. Went live early December and chugs away happily, with 15-20 readers on each clustered instance of the queue during business hours and depth zero on the input queues. Joy everywhere.
Since Xmas we've observed spikes where the input queue depth will suddenly build to 40 or 50 messages, stay there for a minute or so then drop back to zero. Stays at zero for an hour or so, then repeats the pattern. It's happening on both brokers but not at the same time, nothing in the broker or queue manager logs, all the messages show as committed and we're still only have 15-20 readers (so we're not maxed out on instances).
In the absence of a sensible idea and because this is a business critical flow we issued an mqsireload for the EG in question on both brokers. Alarmingly this solved the problem.
We have a PMR open, but does anybody have any ideas on where to look for a root cause before I assume the pixies were using UPS to deliver magic dust & we simply ran out? _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
fjb_saper |
Posted: Sat Dec 28, 2013 12:32 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
if mqsireload seemed to fix the problem I would look at 2 things
- memory consumption of the egs
- monitoring
- Just to make sure what you see is not due to a spike from the upstream system.
I saw spikes from an upstream system that was waiting on a db read to commit. Once the read returned the commit was being done and the messages appeared suddenly at the broker. Any hick-up at DB level generated spikes at the broker level.... (see spike in put rates on the input queue)
Have fun  _________________ MQ & Broker admin |
|
Back to top |
|
 |
mqjeff |
Posted: Sat Dec 28, 2013 5:33 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
And the flow doesn't use shared variables/atomic blocks?
And it's only one flow that stalls, not all flows in all egs?
And you don't have another eg/flow using up lots of cpu/memory at the same time, or many flows using up enough cpu/memory at the same time?
And the MQ Logs aren't being cycled during the freeze?
And the disk admins aren't playing silly buggers behind the scenes because "nobody will notice"? |
|
Back to top |
|
 |
Vitor |
Posted: Mon Dec 30, 2013 7:43 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
fjb_saper wrote: |
if mqsireload seemed to fix the problem I would look at 2 things
- memory consumption of the egs
- monitoring
- Just to make sure what you see is not due to a spike from the upstream system.
I saw spikes from an upstream system that was waiting on a db read to commit. Once the read returned the commit was being done and the messages appeared suddenly at the broker. Any hick-up at DB level generated spikes at the broker level.... (see spike in put rates on the input queue)
Have fun  |
Thanks! _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Vitor |
Posted: Mon Dec 30, 2013 7:48 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
mqjeff wrote: |
And the flow doesn't use shared variables/atomic blocks? |
It does use shared variables.
mqjeff wrote: |
And it's only one flow that stalls, not all flows in all egs? |
It's just that one flow, all other EGs behave normally but this flow has significantly higher throughput than others.
mqjeff wrote: |
And you don't have another eg/flow using up lots of cpu/memory at the same time, or many flows using up enough cpu/memory at the same time? |
We do not. See above.
mqjeff wrote: |
And the MQ Logs aren't being cycled during the freeze? |
Apparently not, but we're checking.
mqjeff wrote: |
And the disk admins aren't playing silly buggers behind the scenes because "nobody will notice"? |
We've not ruled it out.  _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mqjeff |
Posted: Mon Dec 30, 2013 8:04 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
Vitor wrote: |
mqjeff wrote: |
And the flow doesn't use shared variables/atomic blocks? |
It does use shared variables |
And you are using ATOMIC blocks to update the shared variable, to ensure consistency?
And you've ruled out that the flow is getting stalled because everyone is waiting for everyone else to release the ATOMIC block? |
|
Back to top |
|
 |
PeterPotkay |
Posted: Mon Dec 30, 2013 8:48 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Double checking: When the q depth is at '40 or 50 for about a minute', you can browse these messages? In other words, the messages are definitely commited by the CLUSRCVR channel?
When the back up occurs, it happens on both brokers in lock step? If yes, focus on the infrastructure resource that both brokers share. There is no way memory or code freaks out on two seperate systems at the exact same time. Maybe a shared database that they both update that slows down for some reason some time? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
PeterPotkay |
Posted: Mon Dec 30, 2013 8:51 am Post subject: Re: Unexpected pauses in WMB processing |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Vitor wrote: |
In the absence of a sensible idea and because this is a business critical flow we issued an mqsireload for the EG in question on both brokers. Alarmingly this solved the problem.
|
You did the mqsireload on both brokers when the flow was 'hung' on both brokers, and that got things moving?
Or things had returned to normal, and to prevent a reoccurance you reloaded the EG on both brokers and haven't see the problem reoccur since (yet)?
Does the flow read/write from a N A S location?
(Edit: Why does this site not like the acronym for Network Attached Storage?) _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
Vitor |
Posted: Mon Dec 30, 2013 9:56 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
PeterPotkay wrote: |
Double checking: When the q depth is at '40 or 50 for about a minute', you can browse these messages? In other words, the messages are definitely commited by the CLUSRCVR channel? |
This was verified by an MQ admin displaying the local queue and checking the UNCOM parameter.
PeterPotkay wrote: |
When the back up occurs, it happens on both brokers in lock step? If yes, focus on the infrastructure resource that both brokers share. There is no way memory or code freaks out on two seperate systems at the exact same time. Maybe a shared database that they both update that slows down for some reason some time? |
It happens on 1 broker and then the other. As these are Solaris Zones it's possible they're hosted on the same hardware; I'm trying to get a straight answer from the sys admins. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Vitor |
Posted: Mon Dec 30, 2013 9:59 am Post subject: Re: Unexpected pauses in WMB processing |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
PeterPotkay wrote: |
Vitor wrote: |
In the absence of a sensible idea and because this is a business critical flow we issued an mqsireload for the EG in question on both brokers. Alarmingly this solved the problem.
|
You did the mqsireload on both brokers when the flow was 'hung' on both brokers, and that got things moving?
Or things had returned to normal, and to prevent a reoccurance you reloaded the EG on both brokers and haven't see the problem reoccur since (yet)? |
When the flow was running on both brokers. No re-occurrence since.
PeterPotkay wrote: |
Does the flow read/write from a N A S location?
(Edit: Why does this site not like the acronym for Network Attached Storage?) |
[/quote]
Entirely queue based, but the queues are on that kind of shared disc. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
zpat |
Posted: Tue Dec 31, 2013 1:19 am Post subject: |
|
|
 Jedi Council
Joined: 19 May 2001 Posts: 5866 Location: UK
|
Does the flow contain anything that can cause the process or thread to enter a wait state?
Especially anything that might affect all the instances at once?
Candidates could be access to a database, or web service or invoking some JAR file - anything with a synchronous nature.
You might get better performance from running the flow several times in different EGs, rather than all the instances in one EG.
However this might affect any use of shared variables. _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Tue Dec 31, 2013 4:12 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
When the problem occurs, is it really all instances that pause? You say 40-50 messages back up for about a minute, but that you have a couple dozen instances of the possible 39 additional instances running already. Seems to me like more than 40-50 messages per minute are needed to drive that many additional instances. Does that mean when the 40-50 were waiting in the queue, others were passing thru the queue?
40-50 for about a minute..... 50 messages, 60 seconds....I keep thinking about that fact that those are the default values for the MQ channel's BATCHSZ and SHORTRTY.
If I remember correctly, your shop has WMQ and WMB admin duties segregated. Do you have any ability to check with your own eyes the MQ error logs for any MQ channel errors on both the sending and receiving ends?
You kinda answered my question about whether you could browse the messages on the queue when the problem was occuring. Kinda, but not really  _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
Vitor |
Posted: Thu Jan 02, 2014 5:29 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
zpat wrote: |
Candidates could be access to a database, or web service or invoking some JAR file - anything with a synchronous nature. |
There are a number of candidates which we are investigating. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Vitor |
Posted: Thu Jan 02, 2014 5:40 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
PeterPotkay wrote: |
When the problem occurs, is it really all instances that pause? You say 40-50 messages back up for about a minute, but that you have a couple dozen instances of the possible 39 additional instances running already. Seems to me like more than 40-50 messages per minute are needed to drive that many additional instances. Does that mean when the 40-50 were waiting in the queue, others were passing thru the queue? |
We lack the granularity needed to make that determination. A combination of widely separated responsibilities and inexperienced support staff.
PeterPotkay wrote: |
40-50 for about a minute..... 50 messages, 60 seconds....I keep thinking about that fact that those are the default values for the MQ channel's BATCHSZ and SHORTRTY. |
Good point.
PeterPotkay wrote: |
If I remember correctly, your shop has WMQ and WMB admin duties segregated. Do you have any ability to check with your own eyes the MQ error logs for any MQ channel errors on both the sending and receiving ends? |
You remember correctly, I have no such ability and this incident is another nail in the coffin of this segregation. At least in terms of support where it's now clear we need a more coordinated view.
PeterPotkay wrote: |
You kinda answered my question about whether you could browse the messages on the queue when the problem was occuring. Kinda, but not really  |
Yes, and that's the best answer I can offer. I do not believe but cannot verify that any messages were actually browsed. I believe that only the queue properties were examined. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Vitor |
Posted: Fri Jan 03, 2014 5:59 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
Update:
Having got someone to read the logs I don't have access to, there's now a little more information.
The flow in question uses a syncronous WMQ request -> reply to an external party. There's a number of error messages where the flow is attempting to deal with the MQGet timeing out, but failing to serialise the response. Interestingly, this is on the web service side of the flow, which wasn't reporting a problem at the time in question. When queried, they checked their logs and were surprised to notice huge numbers of timeouts for their web service calls to the flow. Apparently they don't have anything monitoring their monitoring system...
Anyway, the root cause seems to be this external system not responding and causing broker errors. These errors seemed to cause a huge increase in memory usage, which in turn caused the EG to pause while it waited for resource from the OS.
Does anyone have any insight / tips / views on why a SOAPReply abend (unable to serialize response) would cause a huge increase in the EG's JVM memory usage? We have a PMR open but I would welcome views from the community. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
|