MQSeries.net :: View topic - Recovery from a failed cluster member.

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » Clustering » Recovery from a failed cluster member.

Recovery from a failed cluster member.

« View previous topic :: View next topic »

Author

Message

aboggis

Posted: Thu Dec 02, 2004 11:33 am Post subject: Recovery from a failed cluster member.

Centurion

Joined: 18 Dec 2001
Posts: 105
Location: Auburn, California

Env: Solaris 5.8, WMQ 5.3 (CSD05).

Scenario: Cluster of 4 queue managers, one queue manager per host, 2 full repos's, two partial. Everything working fine.

Settings:

Code:

From qm.ini:

TCP:
KeepAlive=yes
Channels:
AdoptNewMCA=ALL
AdoptNewMCATimeout=60
AdoptNewMCACheck=ALL
PipeLineLength=2

Code:

From runmqsc $QMGR

dis chl(TO*) hbint batchhb kaint
1 : dis chl(TO*) hbint batchhb kaint
AMQ8414: Display Channel details.
CHANNEL(TO.PP.AR1) HBINT(15)
BATCHHB(15) KAINT(AUTO)
AMQ8414: Display Channel details.
CHANNEL(TO.PP.AS1) HBINT(15)
BATCHHB(15) KAINT(AUTO)
AMQ8414: Display Channel details.
CHANNEL(TO.SP.AR1) HBINT(15)
BATCHHB(15) KAINT(AUTO)
AMQ8414: Display Channel details.
CHANNEL(TO.SP.AS1) HBINT(15)
BATCHHB(15) KAINT(AUTO)

The NIC card on a host was purposely failed for testing (using ifconfig to "down" the interface). In this case the failed host was one of the cluster's full repository managers.

The card was failed at x:04:31.

Using 'netstat' to monitor that tcp/ip session state, we ascertained that the tcp/ip session went from ESTABLISHED to CLOSE_WAIT at between x:04:55 and x:04:56 on all of the remote hosts. So, within 30 seconds.

During this time the sender channels to this host [the host with the now failed NIC card) were showing as "RUNNING" [using runmqsc]. It wasn't until x:13:11 that the channels went to a RETRYING state.

We are running a custom cluster workload exit that checks the channel status to the destination queue manager before accepting a message since we do not want to accept a message into the system if it cannot be delivered right away. We have internal application logic that will retry sending the message to an alternate destination if the exit rejects the message. The problem here is that if the channel status remains at "RUNNING" the message is accepted. We can accept that there is some sort of delay in updating the channel status, a delay of more than 8 minutes is not acceptable to us.

What can I do to minimise this delay?

Setting HBINT too low results in excessive [run-away] spawning of amqrmppa & amqzlaa0 processes and basically brings the machine to it's knees requiring a re-boot (not something you want to do too often on a piece of hardware that costs nearly a $1m and can take nearly an hour for a full system restart!).

Regards,

tonyB.[/code]

offshore

Posted: Thu Dec 02, 2004 11:54 am Post subject:

Master

Joined: 20 Jun 2002
Posts: 222

Aboggis,

What I did for a similar circumstance is change the channel disconnect interval, along with several other channel parms.

I set DISCINT(300) = 5 minutes, if nothing comes across the channel it shuts down.

The other things you may want to consider is:
1.] short retry count
2.] short retry time
3.] long retry time

HTH
Offshore

aboggis

Posted: Thu Dec 02, 2004 1:03 pm Post subject:

Centurion

Joined: 18 Dec 2001
Posts: 105
Location: Auburn, California

How does the channel disconnect affect this scenario (network failure)? I can understand where it comes into play in terms of conservation of network resources.

Also my understanding of the retry counts & intervals is that they only have any effect once the channel has gone down.

In our failure test, once the NIC card was re-enabled the channels were successfully re-established, so my retries were working fine.

My point here is that a cluster queue manager was basically abrubtly "removed" from the cluster (by unplugging it's NIC essentially). Why did it take the remaining cluster queue managers nearly 10 minutes to change their CLUSSDR channel statuses from RUNNING to RETRYING?

What can I do to minimise this time delay?

PeterPotkay

Posted: Thu Dec 02, 2004 6:03 pm Post subject:

Poobah

Joined: 15 May 2001
Posts: 7716

Tony, I don't have the answer, but some more questions. I too think that if you have your HBs set to 15, that an HB is sent every 15 seconds, and if the sender can't send it, then the sending MCA should get an immediate error from the network. Why does it take so long I wonder?

You show a display of the HB values on some channels. Are those the CLUSSNDRs? Or the CLUSRCVRs? Remember, its the CLUSRCVR channel of the FR QM whose NIC card you are pulling that determines what the HB will be on the automatically defined CLUSSNDR channels on all the incoming QMs. Is the HB of the CLUSRCVRs set to 15 also?

To be a 100% sure of what the actual HB is that is being used, do a display of the channel statuses once they are up and running. You will see a value for HB being used.

At this point I gotta think that maybe this is what is wrong; your channels are starting up with a bigger HB than you think.

What happens if instead of pulling the NIC, you stop the listener? How long does it take then?
_________________
Peter Potkay
Keep Calm and MQ On

PeterPotkay

Posted: Thu Dec 02, 2004 6:22 pm Post subject:

Poobah

Joined: 15 May 2001
Posts: 7716

Tony, check this post out:
http://www.mqseries.net/phpBB2/viewtopic.php?t=15619&highlight=heartbeat

Unfortunatly, it only makes our point even more: That the CLUSSNDR channel should see the other side is down as soon as it tries to send an HB(at 15 seconds). Hopefully Paul Clarke from Hursley will see this question on the listserve, although since they just changed the address, it may be a while before we get everyone reading and posting on the listserve.

Jason? Nigel?
_________________
Peter Potkay
Keep Calm and MQ On

KeeferG

Posted: Fri Dec 03, 2004 4:21 am Post subject:

Master

Joined: 15 Oct 2004
Posts: 215
Location: Basingstoke, UK

Hi Tony,

I have been having a chat with Paul Clarke about our channel issues last week and will tell you about it next week when I am over. He is not in the office this week so can't get hold of him for this current issue although i will have a talk with a couple of the current channel guys.

We dont really want to be using HBINT as the channels should be constantly running. The HBINT code only gets executed when the sending MCA is doing no work. The low HBINT causes us issues when we hit a queue full because the receiving MCA is paused but the sending MCA does not get told this. I think we should adjust to a put disable policy instead of queue full but we can talk about this next week.
_________________
Keith Guttridge
-----------------
Using MQ since 1995

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » Clustering » Recovery from a failed cluster member.

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP