|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
 |
|
Peculiar PubSub Issue |
« View previous topic :: View next topic » |
Author |
Message
|
vsathyan |
Posted: Tue Apr 05, 2016 1:49 am Post subject: Peculiar PubSub Issue |
|
|
Centurion
Joined: 10 Mar 2014 Posts: 121
|
Hi Team,
We are facing a peculiar problem in our pubsub layer. Its been more than 3 weeks of this production problem, and we dont know whats the
root cause. We have already engaged IBM, with a SEV1 PMR and they are still asking for traces, command outputs and other stuff. There doesn't
seem to be any light on this issue.
Here is our setup and the message flow
Code: |
Source Queue Manager (QMS1) | | -> Destination Queue Manager 1(QMD1)
Source Queue Manager (QMS2) | | -> Destination Queue Manager 2(QMD2)
Source Queue Manager (QMS3) | | -> Destination Queue Manager 3(QMD3)
. | -> PubSub Queue Manager (PSQM) |
. | |
. | |
Source Queue Manager (QMSn) | | -> Destination Queue Manager 3(QMDn)
|
All the queue managers are cluster members.
MQ v7.5.0.5, Linux, NFS3 for data/logs storage, all queue managers are running in single instance mode.
Total partial repositories < 60
Number of full repositories = 2
Here is the problem
On February 18, we saw messages piling up in the SYSTEM.INTER.QMGR.PUBS (SIQP) queue in PSQM. This queue is an internal queue used by
the pub/sub process to temporarily store the publication messages received from multiple source queue managers in the cluster, and route it to the topics present in PSQM.
We saw one specific channel was stuck in SUBSTATE(SEND) in PSQM. We did a stop and start of the channel and messages started moving out of SIQP queue.
We thought this to be a one of instance and continued.
The same problem, - messages piling up on SIQP queue occured on March 5th. But this time, none of the cluster sender channels were processing messages out of PUBSUB queue manager
PSQM. Hence we bounced the queue manager on March 5th, and it started to process the messages from SIQP queue.
This issue got more attention now, as one of the receiving application monitors the messages real time. And if they dont get the messages from pubsub in 10 mins, they start a red alert.
We opened a PMR, engaged IBM, collected the traces, discussions have been happening from March 5th.
Initially, IBM found out that one process - amqzmuf0 was not running when the traces were captured. But they were not able to find why that process was aborted. They asked us to run the
traces when the message pile up in SIQP queue.
By this time, we had observed that this issue occurs around 9:00PM to 9:30PM CST every day.
We stopped collecting the traces. One more issue was found at this time that the trace process itself was getting aborted by generating a FDC. IBM gave an iFix for it and we applied it.
When we enabled the traces next day, the queue manager performance was so degraded, that it never responded for MQSC commands. We forcefully stopped the traces again.
We changed our flow a bit by removing clustered topics and replaced them with normal topics and cluster aliases resolving to topics.
Code: |
Original flow SQM SQM PSQM PSQM DQM
Application -> Put -> Alias Queue -> CLUSTER XMQIT Q -> SIQP queue -> CLUSTER XMIT Q -> DESTINATION CLUSTER LOCAL QUEUE
New flow SQM SQM PSQM PSQM DQM
Application -> Put -> Alias Queue -> CLUSTER XMQIT Q -> CLUSTER ALIAS queue -> CLUSTER XMIT Q -> DESTINATION CLUSTER LOCAL QUEUE
|
In the new flow, we bypassed the SIQP queue by creating cluster alias to resolve to local topics -> subscriptions -> DEST.
However, with the new flow, messages started to pile up in source side CLUSTER XMIT Q (SQM).
One more peculiar behaviour observed is
Say for example, the oldest message age in sending side cluster xmit q is 300 seconds. When we bounce the PSQM, the message age in SQM cluster xmit queue drops to 10 seconds or a low value.
Again it starts building up.
The issue automatically recovers after a few mins, messages piled up in the sending side transmit queues go away, as if there was no issues and the curdepth returns to 0.
The message sizes are in KBs (average 25~30KBs, max seen was 51KB). The volume of messages is also not high - around 10000 messages in 15 minutes.
We are stuck and not sure how to fix this problem. Any suggestions and advise? _________________ Custom WebSphere MQ Tools Development C# & Java
WebSphere MQ Solution Architect Since 2011
WebSphere MQ Admin Since 2004 |
|
Back to top |
|
 |
mqjeff |
Posted: Tue Apr 05, 2016 4:01 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
The specific time range of the incidents indicates a couple of things, maybe more. - Significant increase in message traffic from applications. The queue manager isn't tuned to handle this amount of load, and so you get backup and issues
- An external action - in the network layer, in the infrastructure layer or etc. Maybe the system is being backed up, or the drives mounted are being backed up, at this time. Maybe something else is overloading the network layer between the sending and receiving apps.
This is a good place to look, at external factors.
You should also review the transactional behavior of the sending applications. Perhaps they are creating much bigger transactions than expected.
If you got poor performance from an iFix, that should be further resolved by the L3 team, and a new iFix provided that doesn't cause the same impact. But it could simply be the trace options being used that are slowing things down. Again, perhaps because of much larger message load than expected.
In general messages that 'disappear' are non-persistent messages, and if the applications rely on them being delivered, they should be persistent, or applications should be able to resend them. _________________ chmod -R ugo-wx / |
|
Back to top |
|
 |
vsathyan |
Posted: Tue Apr 05, 2016 5:00 am Post subject: |
|
|
Centurion
Joined: 10 Mar 2014 Posts: 121
|
Hi Jeff,
First of all, thanks for your patience and response.
As mentioned in my earlier post, the load is not much - 10000 messages of <50KB each, is nothing for a queue manager running with 4 CPUs and 16GB RAM.
Also, we did not have any changes in our pubsub layer when this issue started on Feb 18. The pubsub was running fine from 20 months.
And when i said messages go away, i meant they move out of the xmit q and reach the destination. The receiving application receives them. They are persistent messages, which get stuck in XMITQ and then reach the pubsub qmgr without any problem after some time.
Thanks again.
vsathyan _________________ Custom WebSphere MQ Tools Development C# & Java
WebSphere MQ Solution Architect Since 2011
WebSphere MQ Admin Since 2004 |
|
Back to top |
|
 |
mqjeff |
Posted: Tue Apr 05, 2016 5:04 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
Again, look for disk or network events that occur at the same time.
If the disk hosting your log files slows down all of a sudden, then the performance of persistent messages is going to suffer. _________________ chmod -R ugo-wx / |
|
Back to top |
|
 |
|
|
 |
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|