MQSeries.net :: View topic - MQ Cluster - Is FailOver a prime concern?

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » General Discussion » MQ Cluster - Is FailOver a prime concern?

MQ Cluster - Is FailOver a prime concern?

« View previous topic :: View next topic »

Author

Message

Vivekga

Posted: Mon Dec 09, 2002 5:20 am Post subject: MQ Cluster - Is FailOver a prime concern?

Newbie

Joined: 09 Dec 2002
Posts: 6

Hello All,

I am facing a problem with MQ clustering. I have explained the requirement below. Please help me in solving the problem.

Thanks and Regards,
Vivek

Environment
---------------------
MQSeries version: 5.2.4
Platform: Windows NT
Workload exit written in "C" language.

I have 3 Queue managers QM1, QM2 and QM3 running on 3 different machines. All are part of one cluster and all the 3 queue managers are full repositories. A Local queue with the same name say LQ is defined in both QM1 and QM2 and is shared in the cluster. When a message is put from QM3 to this queue (LQ), the message is put on both QM1 and QM2 on round robin basis.
But the requirement is to put the message on QM2 only when QM1 is down. For this I am using workload exit with network priority set in such a way that, when ever a message is put from QM3, it is always put on to QM1, QM2 will come into picture only when QM1 is down.The workload exit program is availabe at
http://www-3.ibm.com/software/ts/mqseries/txppacs/mc76.html

Problem
--------------
Now the problem is, whenever a message is put from QM3 to LQ and QM1 is down, since the cluster information is not refreshed, message is still put on QM1, which will get stuck in some system queues till QM1 comes back. Rest of the messages flow smoothly to QM2 since QM1 is down.

Requirement
--------------
I want to refresh the cluster information before any message is put on to queue on QM3 which will avoid message being put on to QM1.

OR

Is there any other ways to achieve this.Please kindly throw pointers if available.

kolban

Posted: Mon Dec 09, 2002 8:34 am Post subject:

Grand Master

Joined: 22 May 2001
Posts: 1072
Location: Fort Worth, TX, USA

Interesting ...
If QM1 is down, I don't see that a cluster refresh is actually needed. The cluster information will not have changed. There is still a QM1, there is still a cluster shared queue called LQ at QM1. The only thing that is different is that QM1 is down. This will manifest itself as a down channel.

I am thinking that if a message is put to LQ from QM3 and QM1 is down, the exit should already send the messages to QM2 without any further action.

Lets look at how this might be possible ...

The QM3 queue manager is the one that invokes the exit to select the target queue manager. The exit is provided with a list of all the possibilities... these would be

QM1
QM2

As well as the possibilities, the information would also include the channel states ... QM1 would be something OTHER than RUNNING or INACTIVE ... this could be used as the discriminator to send to QM2.

Vivekga

Posted: Tue Dec 10, 2002 7:34 pm Post subject:

Newbie

Joined: 09 Dec 2002
Posts: 6

Thanks fpr an early reply Kolban,

But the channel of QM1 is still active and it is not getting refreshed immediatly when QM1 is down. So workload exit thinks it is still active and puts message to QM1. Please suggest is there a way to refresh the channels?

Regards,
Vivek

Vivekga

Posted: Wed Dec 11, 2002 5:06 am Post subject: Is there a way to refresh Cluster information

Newbie

Joined: 09 Dec 2002
Posts: 6

Hello all,

I am still facing the same problem. Even after bringing QM1 down, the channel at QM3 is not getting reflected and hence the message is not getting diverted to QM2. The workload exit, the clwlFunction receives MQWXP structure, which contains the status of the UNREFRESHED Queuemanagers and hence the channel status will be always RUNNING or INACTIVE and not RETRYING.
This failure occurs only for the first message after QM1 is brought down, and rest of the messages goes to QM2 smoothly.

I dont want to wait for the lost message till QM1 is brought up.
Is there a way to solve this problem?

Regards,
Vivek

dutchman

Posted: Thu Dec 19, 2002 2:05 am Post subject:

Acolyte

Joined: 15 May 2001
Posts: 71
Location: Netherlands

Hi - I'm the author of MC76.

What you're seeing is a problem related to the way MQSeries detects changes in channel status, and is the same whether you use clustering or not. Any change to a remote queue manager is NOT noticed by the local queue manager for a period related to the heartbeat interval.

In other words, the local workload exit will look at the available queue managers, their channel status and whether the cluster queues are 'put enabled'. If it thinks things are 'ok' the first message is sent to that remote queue manager. The MCA retrieves the message from the common cluster transmit queue and gives it to the TCP layer to deliver. TCP does not respond back, and the channel goes to 'indoubt' and 'retrying'. When subsequent messages are sent, the local queue manager now 'knows' one of them is not 'ok' and so forwards the rest to the other remote queue manager.

The name given to this situation has become known as the 'stuck' or 'marooned' message syndrome.

Now, how do you solve this problem? It depends ...

I'll respond if there is interest.

Vivekga

Posted: Thu Dec 19, 2002 3:30 am Post subject: Please validate the alternate solution

Newbie

Joined: 09 Dec 2002
Posts: 6

Hello Dutchman,

Thanks for the reply. I'll try the scenario keeping the low heartbeat interval.
We tried couple of POCs (proof of Concepts) and failed. Finally we are successful with the following approach. However, please validate the below solution.

We are running a daemon process which keeps checking the RETRYING status of the sender channel. When a message is stuck, the channel automatically goes to RETRYING state. Then the Daemon process does the following
1) Resolves the stuck message and puts the message back to the transmission queue.
2) Replaces the Sender channel with an active one.
3) Chops of the header put by Transmission queue.

And the message flows to the active queue manager.

Is this a right way of backing out the stuck message or is there any better way of achieving this?

Regards,
Vivek

dutchman

Posted: Thu Dec 19, 2002 3:45 am Post subject:

Acolyte

Joined: 15 May 2001
Posts: 71
Location: Netherlands

Hi Vivekga - that's an interesting solution you have there! Remember, though, that no matter how small you make the heartbeat interval, there will always be a small window in which a problem can occur.

Do I take it that you actually read the message off the cluster transmit queue, strip off the header, re-put it to the cluster queue, and then let the workload exit choose the next one?

The 'official' way is to use the new set of cluster utilities that come with the IBM Supportpac MS0G

http://www-3.ibm.com/software/ts/mqseries/txppacs/ms0g.html

It's well worth getting yourself familiar with these utilities before you go into production.

I do, however, have a question for you: what happens if you've already sent a number of messages to one queue manager, and then something goes wrong. The remaining messages are sent to the next queue manager, so at the end your messages will be spread across 2 queue managers. Is this ok?

Are you still using the MC76 supportpac?

BTW MQ V5.3 come with cluster enhancements.

Regards ... R

Vivekga

Posted: Thu Dec 19, 2002 4:03 am Post subject: I'll try the new links

Newbie

Joined: 09 Dec 2002
Posts: 6

Thank you very much for the immediate reply.

Yes we are using MC76, but the "Total Solution" is under testing.

At the receiving end, a custom program will be keep monitoring the queues for messages in both the places. Hence no problems are seen so far

Thanks for the link. I'll try to have an official solution

Thanks again,

Regards,
Vivek

Vivekga

Posted: Tue Jan 07, 2003 2:43 am Post subject: SOlution

Newbie

Joined: 09 Dec 2002
Posts: 6

Hello all,

No need to do the steps I have explained above. Instead we can smoke out the 'stuck' message by executing backout at command prompt.

I am executing backout command to all the sender channels of cluster queue managers at regular intervals. My problem is solved. But If any better solution, please do post it.

Regards,
Vivek

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » General Discussion » MQ Cluster - Is FailOver a prime concern?

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP