|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
 |
|
Recovery from a failed cluster member. |
« View previous topic :: View next topic » |
Author |
Message
|
aboggis |
Posted: Thu Dec 02, 2004 11:33 am Post subject: Recovery from a failed cluster member. |
|
|
 Centurion
Joined: 18 Dec 2001 Posts: 105 Location: Auburn, California
|
Env: Solaris 5.8, WMQ 5.3 (CSD05).
Scenario: Cluster of 4 queue managers, one queue manager per host, 2 full repos's, two partial. Everything working fine.
Settings:
Code: |
From qm.ini:
TCP:
KeepAlive=yes
Channels:
AdoptNewMCA=ALL
AdoptNewMCATimeout=60
AdoptNewMCACheck=ALL
PipeLineLength=2
|
Code: |
From runmqsc $QMGR
dis chl(TO*) hbint batchhb kaint
1 : dis chl(TO*) hbint batchhb kaint
AMQ8414: Display Channel details.
CHANNEL(TO.PP.AR1) HBINT(15)
BATCHHB(15) KAINT(AUTO)
AMQ8414: Display Channel details.
CHANNEL(TO.PP.AS1) HBINT(15)
BATCHHB(15) KAINT(AUTO)
AMQ8414: Display Channel details.
CHANNEL(TO.SP.AR1) HBINT(15)
BATCHHB(15) KAINT(AUTO)
AMQ8414: Display Channel details.
CHANNEL(TO.SP.AS1) HBINT(15)
BATCHHB(15) KAINT(AUTO)
|
The NIC card on a host was purposely failed for testing (using ifconfig to "down" the interface). In this case the failed host was one of the cluster's full repository managers.
The card was failed at x:04:31.
Using 'netstat' to monitor that tcp/ip session state, we ascertained that the tcp/ip session went from ESTABLISHED to CLOSE_WAIT at between x:04:55 and x:04:56 on all of the remote hosts. So, within 30 seconds.
During this time the sender channels to this host [the host with the now failed NIC card) were showing as "RUNNING" [using runmqsc]. It wasn't until x:13:11 that the channels went to a RETRYING state.
We are running a custom cluster workload exit that checks the channel status to the destination queue manager before accepting a message since we do not want to accept a message into the system if it cannot be delivered right away. We have internal application logic that will retry sending the message to an alternate destination if the exit rejects the message. The problem here is that if the channel status remains at "RUNNING" the message is accepted. We can accept that there is some sort of delay in updating the channel status, a delay of more than 8 minutes is not acceptable to us.
What can I do to minimise this delay?
Setting HBINT too low results in excessive [run-away] spawning of amqrmppa & amqzlaa0 processes and basically brings the machine to it's knees requiring a re-boot (not something you want to do too often on a piece of hardware that costs nearly a $1m and can take nearly an hour for a full system restart!).
Regards,
tonyB.[/code] |
|
Back to top |
|
 |
offshore |
Posted: Thu Dec 02, 2004 11:54 am Post subject: |
|
|
 Master
Joined: 20 Jun 2002 Posts: 222
|
Aboggis,
What I did for a similar circumstance is change the channel disconnect interval, along with several other channel parms.
I set DISCINT(300) = 5 minutes, if nothing comes across the channel it shuts down.
The other things you may want to consider is:
1.] short retry count
2.] short retry time
3.] long retry time
HTH
Offshore |
|
Back to top |
|
 |
aboggis |
Posted: Thu Dec 02, 2004 1:03 pm Post subject: |
|
|
 Centurion
Joined: 18 Dec 2001 Posts: 105 Location: Auburn, California
|
How does the channel disconnect affect this scenario (network failure)? I can understand where it comes into play in terms of conservation of network resources.
Also my understanding of the retry counts & intervals is that they only have any effect once the channel has gone down.
In our failure test, once the NIC card was re-enabled the channels were successfully re-established, so my retries were working fine.
My point here is that a cluster queue manager was basically abrubtly "removed" from the cluster (by unplugging it's NIC essentially). Why did it take the remaining cluster queue managers nearly 10 minutes to change their CLUSSDR channel statuses from RUNNING to RETRYING?
What can I do to minimise this time delay? |
|
Back to top |
|
 |
PeterPotkay |
Posted: Thu Dec 02, 2004 6:03 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Tony, I don't have the answer, but some more questions. I too think that if you have your HBs set to 15, that an HB is sent every 15 seconds, and if the sender can't send it, then the sending MCA should get an immediate error from the network. Why does it take so long I wonder?
You show a display of the HB values on some channels. Are those the CLUSSNDRs? Or the CLUSRCVRs? Remember, its the CLUSRCVR channel of the FR QM whose NIC card you are pulling that determines what the HB will be on the automatically defined CLUSSNDR channels on all the incoming QMs. Is the HB of the CLUSRCVRs set to 15 also?
To be a 100% sure of what the actual HB is that is being used, do a display of the channel statuses once they are up and running. You will see a value for HB being used.
At this point I gotta think that maybe this is what is wrong; your channels are starting up with a bigger HB than you think.
What happens if instead of pulling the NIC, you stop the listener? How long does it take then? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
PeterPotkay |
Posted: Thu Dec 02, 2004 6:22 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Tony, check this post out:
http://www.mqseries.net/phpBB2/viewtopic.php?t=15619&highlight=heartbeat
Unfortunatly, it only makes our point even more: That the CLUSSNDR channel should see the other side is down as soon as it tries to send an HB(at 15 seconds). Hopefully Paul Clarke from Hursley will see this question on the listserve, although since they just changed the address, it may be a while before we get everyone reading and posting on the listserve.
Jason? Nigel? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
KeeferG |
Posted: Fri Dec 03, 2004 4:21 am Post subject: |
|
|
 Master
Joined: 15 Oct 2004 Posts: 215 Location: Basingstoke, UK
|
Hi Tony,
I have been having a chat with Paul Clarke about our channel issues last week and will tell you about it next week when I am over. He is not in the office this week so can't get hold of him for this current issue although i will have a talk with a couple of the current channel guys.
We dont really want to be using HBINT as the channels should be constantly running. The HBINT code only gets executed when the sending MCA is doing no work. The low HBINT causes us issues when we hit a queue full because the receiving MCA is paused but the sending MCA does not get told this. I think we should adjust to a put disable policy instead of queue full but we can talk about this next week. _________________ Keith Guttridge
-----------------
Using MQ since 1995 |
|
Back to top |
|
 |
|
|
 |
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|