Author |
Message
|
PeterPotkay |
Posted: Mon Oct 08, 2007 8:02 am Post subject: Spontaneous QM failovers in MSCS |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Windows 2003 SP1
MQ 6.0.2.1 + IC51904 + IC53266
Microsoft hardware cluster.
These 2 node clusters (a dozen of them) have been set up and running OK for ~ 3 years. Since July I have upgrade them all to MQ 6.0.2.1. In that time I have had 7 occurrences where the QM fails over to the other node or just plain comes offline and doesn't failover. In some cases it was weeks after the MQ upgrade. Some clusters have never had this problem. One of them had it happen twice in 3 days 2 weeks after I upgraded, but not again in the 5 weeks since.
Looking in the QM error logs the first "error" is the messages saying the Repository Manager is ending, the same message you get when you ask the QM to end.
We opened a case with MSFT and they said the MQ resource is reporting the QM is not healthy (or not reporting at all), so the cluster initiates a failover. The referred us to this "fix"
http://www-1.ibm.com/support/docview.wss?rs=171&context=SSFKSJ&context=SSWHKB&q1=ic52378&uid=swg1IC52378&loc=en_US&cs=utf-8&lang=en
Quote: |
"3) The timeout value for the looksAlive check has been made customizable by setting the AMQ_MSCS_TIMEOUT environment variable to the amount in milliseconds that the timeout should be. The cluster will need to be stopped and restarted for this to take effect. Note changing this value will impact how responsive the WMQ resource is to a 'real' failure, and it should only be set as an interim measure whilst the real problem is resolved. However, these changes only help reduce the symptoms, and the real problem is the source of why the check was taking so long. For example, in this instance an external process hogging the CPU and this needs to be remedied in tandem with applying this APAR." |
But in all cases to the best of our knowledge the servers were barely breathing as far as CPU is concerned.
Has anyone dealt with this yet?
If I apply IC52378, what would I set "AMQ_MSCS_TIMEOUT" to? And if there is no indication that CPU was a problem, do I even go down this route?
Is there some sort of tracing I can put on specific to the MQ dll that talks with MSCS that might indicate why these 2 components are not conversing properly? I would need to leave the trace on for potentially weeks so I'm looking for a trace specific to the dll that will have a minimal effect on performance and whose log files can be set to wrap every day or so. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
2d0 |
Posted: Thu Oct 11, 2007 11:39 pm Post subject: |
|
|
 Apprentice
Joined: 08 Mar 2007 Posts: 25
|
Peter,
We experience the same problem here.
Our configuration is a Microsoft Windows Server 2003 cluster with Service Pack 2 installed. Running on a VMware environment.
Our QM is failing over almost daily at a specified time interval. We can see in the eventlog that a clusterservice '<QM name>' fails. It tries to restart the service and sometimes does this succesfully and sometimes fails.
If it fails, it switches over to the other node.
We can see it in the log happening between 20:35 - 20:45.
It starts with a failing clusterservice and then it stops the Repository Manager.
In our Acceptance environment which has the same configuration, it doesn't happen. Our Wintel deparment can't find any trace within the Windows systemlogs. We can only find traces within the eventlog and the MQ error logs.
MQ error logs are stating that the Repository Manager ends.
If there is a solution we would like to hear that asap.
Thanks.  |
|
Back to top |
|
 |
PeterPotkay |
Posted: Fri Oct 12, 2007 4:03 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
what version of MQ are you on? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
2d0 |
Posted: Mon Oct 15, 2007 12:17 am Post subject: |
|
|
 Apprentice
Joined: 08 Mar 2007 Posts: 25
|
Same version as you, 6.0.2.1.
However the failover hasn't happened anymore since last friday.
This weekend it ran fine without any failovers.
I will keep you updated if it happens again. |
|
Back to top |
|
 |
ma.eyal |
Posted: Mon Oct 15, 2007 1:14 am Post subject: |
|
|
Novice
Joined: 13 Sep 2005 Posts: 15 Location: Israel
|
Hello.
We had the same problem some time ago.
We resolved it by increasing the priority of the WMQ processes.
To us the problem seemed that the WMQ processes where "held" by windows backbone services of the cluster, making them wait to answer on the alive check.
I am not a windows expect, so I can't tell you exacly what was happening with the WMQ on that system concerning the MSCS. But when we decided to increase the priority, it eliminated the problem for us.
I hope this helps. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Mon Oct 15, 2007 9:26 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
I found out in my PMR why things are worse after the MQ upgrade. In MQ 5.3 CSD 11 and earlier, the looksAlive/isAlive check that the cluster uses to determine the status of the QM would wait indefinitely for a response. At MQ 5.3 CSD12 and later, it would wait only 10 seconds.
APAR IC53278
http://www-1.ibm.com/support/docview.wss?uid=swg1IC52378
allows us to set a system variable telling that health check how long to wait before timing out, plus cuts an FDC with a new specific Probe ID if the failover occurs because of a timeout. They still say the underlying problem is why the health check hangs and suspect high CPU, but that's not the case for us. We are going to look for heavy I/O as a possible culprit based on another company's experience.
Anyway, we are going to roll this APAR out and set the timeout to 300000 (5 minutes). If it waited forever in 5.3 CSD11 I can't see how 5 minutes is a bad thing. We tested in the lab by setting the timeout variable and the killing amqzmuc0.exe. The QM failed over immediately; it did not wait 5 minutes. They say that variable only comes into play if the health check (which checks every 5 seconds) is not responding.
ma.eyal,
I can see how upping the priority of the MQ proccesses could also help prevent this. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
2d0 |
Posted: Mon Oct 15, 2007 11:33 pm Post subject: |
|
|
 Apprentice
Joined: 08 Mar 2007 Posts: 25
|
Peter, keep me updated on the progress.
The link you provided states a fixpack 6.0.2.3.
However this fixpack has a planned release of january 2008.
How can you implement the APAR? |
|
Back to top |
|
 |
PeterPotkay |
Posted: Tue Oct 16, 2007 6:19 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Contact IBM Mq Support, reference this APAR and they will provide a hotfix for whatever MQ version you are at.
For MQ 6.0.2.1 (what we have) the hotfix is one new dll and setting that environment variable.
No problems with in our LAB. Rolling it out to DEV this weekend, QA next and Production the weekend after that. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
|