ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » IBM MQ Installation/Configuration Support » Spontaneous QM failovers in MSCS

Post new topic  Reply to topic
 Spontaneous QM failovers in MSCS « View previous topic :: View next topic » 
Author Message
PeterPotkay
PostPosted: Mon Oct 08, 2007 8:02 am    Post subject: Spontaneous QM failovers in MSCS Reply with quote

Poobah

Joined: 15 May 2001
Posts: 7722

Windows 2003 SP1
MQ 6.0.2.1 + IC51904 + IC53266
Microsoft hardware cluster.

These 2 node clusters (a dozen of them) have been set up and running OK for ~ 3 years. Since July I have upgrade them all to MQ 6.0.2.1. In that time I have had 7 occurrences where the QM fails over to the other node or just plain comes offline and doesn't failover. In some cases it was weeks after the MQ upgrade. Some clusters have never had this problem. One of them had it happen twice in 3 days 2 weeks after I upgraded, but not again in the 5 weeks since.

Looking in the QM error logs the first "error" is the messages saying the Repository Manager is ending, the same message you get when you ask the QM to end.

We opened a case with MSFT and they said the MQ resource is reporting the QM is not healthy (or not reporting at all), so the cluster initiates a failover. The referred us to this "fix"
http://www-1.ibm.com/support/docview.wss?rs=171&context=SSFKSJ&context=SSWHKB&q1=ic52378&uid=swg1IC52378&loc=en_US&cs=utf-8&lang=en
Quote:

"3) The timeout value for the looksAlive check has been made customizable by setting the AMQ_MSCS_TIMEOUT environment variable to the amount in milliseconds that the timeout should be. The cluster will need to be stopped and restarted for this to take effect. Note changing this value will impact how responsive the WMQ resource is to a 'real' failure, and it should only be set as an interim measure whilst the real problem is resolved. However, these changes only help reduce the symptoms, and the real problem is the source of why the check was taking so long. For example, in this instance an external process hogging the CPU and this needs to be remedied in tandem with applying this APAR."


But in all cases to the best of our knowledge the servers were barely breathing as far as CPU is concerned.

Has anyone dealt with this yet?
If I apply IC52378, what would I set "AMQ_MSCS_TIMEOUT" to? And if there is no indication that CPU was a problem, do I even go down this route?
Is there some sort of tracing I can put on specific to the MQ dll that talks with MSCS that might indicate why these 2 components are not conversing properly? I would need to leave the trace on for potentially weeks so I'm looking for a trace specific to the dll that will have a minimal effect on performance and whose log files can be set to wrap every day or so.
_________________
Peter Potkay
Keep Calm and MQ On
Back to top
View user's profile Send private message
2d0
PostPosted: Thu Oct 11, 2007 11:39 pm    Post subject: Reply with quote

Apprentice

Joined: 08 Mar 2007
Posts: 25

Peter,

We experience the same problem here.
Our configuration is a Microsoft Windows Server 2003 cluster with Service Pack 2 installed. Running on a VMware environment.
Our QM is failing over almost daily at a specified time interval. We can see in the eventlog that a clusterservice '<QM name>' fails. It tries to restart the service and sometimes does this succesfully and sometimes fails.
If it fails, it switches over to the other node.

We can see it in the log happening between 20:35 - 20:45.
It starts with a failing clusterservice and then it stops the Repository Manager.
In our Acceptance environment which has the same configuration, it doesn't happen. Our Wintel deparment can't find any trace within the Windows systemlogs. We can only find traces within the eventlog and the MQ error logs.
MQ error logs are stating that the Repository Manager ends.

If there is a solution we would like to hear that asap.
Thanks.
Back to top
View user's profile Send private message
PeterPotkay
PostPosted: Fri Oct 12, 2007 4:03 am    Post subject: Reply with quote

Poobah

Joined: 15 May 2001
Posts: 7722

what version of MQ are you on?
_________________
Peter Potkay
Keep Calm and MQ On
Back to top
View user's profile Send private message
2d0
PostPosted: Mon Oct 15, 2007 12:17 am    Post subject: Reply with quote

Apprentice

Joined: 08 Mar 2007
Posts: 25

Same version as you, 6.0.2.1.
However the failover hasn't happened anymore since last friday.
This weekend it ran fine without any failovers.

I will keep you updated if it happens again.
Back to top
View user's profile Send private message
ma.eyal
PostPosted: Mon Oct 15, 2007 1:14 am    Post subject: Reply with quote

Novice

Joined: 13 Sep 2005
Posts: 15
Location: Israel

Hello.

We had the same problem some time ago.

We resolved it by increasing the priority of the WMQ processes.

To us the problem seemed that the WMQ processes where "held" by windows backbone services of the cluster, making them wait to answer on the alive check.

I am not a windows expect, so I can't tell you exacly what was happening with the WMQ on that system concerning the MSCS. But when we decided to increase the priority, it eliminated the problem for us.

I hope this helps.
Back to top
View user's profile Send private message
PeterPotkay
PostPosted: Mon Oct 15, 2007 9:26 am    Post subject: Reply with quote

Poobah

Joined: 15 May 2001
Posts: 7722

I found out in my PMR why things are worse after the MQ upgrade. In MQ 5.3 CSD 11 and earlier, the looksAlive/isAlive check that the cluster uses to determine the status of the QM would wait indefinitely for a response. At MQ 5.3 CSD12 and later, it would wait only 10 seconds.

APAR IC53278
http://www-1.ibm.com/support/docview.wss?uid=swg1IC52378
allows us to set a system variable telling that health check how long to wait before timing out, plus cuts an FDC with a new specific Probe ID if the failover occurs because of a timeout. They still say the underlying problem is why the health check hangs and suspect high CPU, but that's not the case for us. We are going to look for heavy I/O as a possible culprit based on another company's experience.

Anyway, we are going to roll this APAR out and set the timeout to 300000 (5 minutes). If it waited forever in 5.3 CSD11 I can't see how 5 minutes is a bad thing. We tested in the lab by setting the timeout variable and the killing amqzmuc0.exe. The QM failed over immediately; it did not wait 5 minutes. They say that variable only comes into play if the health check (which checks every 5 seconds) is not responding.


ma.eyal,
I can see how upping the priority of the MQ proccesses could also help prevent this.
_________________
Peter Potkay
Keep Calm and MQ On
Back to top
View user's profile Send private message
2d0
PostPosted: Mon Oct 15, 2007 11:33 pm    Post subject: Reply with quote

Apprentice

Joined: 08 Mar 2007
Posts: 25

Peter, keep me updated on the progress.
The link you provided states a fixpack 6.0.2.3.
However this fixpack has a planned release of january 2008.

How can you implement the APAR?
Back to top
View user's profile Send private message
PeterPotkay
PostPosted: Tue Oct 16, 2007 6:19 am    Post subject: Reply with quote

Poobah

Joined: 15 May 2001
Posts: 7722

Contact IBM Mq Support, reference this APAR and they will provide a hotfix for whatever MQ version you are at.

For MQ 6.0.2.1 (what we have) the hotfix is one new dll and setting that environment variable.


No problems with in our LAB. Rolling it out to DEV this weekend, QA next and Production the weekend after that.
_________________
Peter Potkay
Keep Calm and MQ On
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic  Reply to topic Page 1 of 1

MQSeries.net Forum Index » IBM MQ Installation/Configuration Support » Spontaneous QM failovers in MSCS
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.