Author |
Message
|
JYama |
Posted: Sun Nov 04, 2007 6:31 pm Post subject: How do cluster channels detect a network error? |
|
|
 Master
Joined: 27 Mar 2002 Posts: 281
|
I'm using WMQv6.021 on AIX and have MQ cluster environment.
My question is how MQ cluster detect a network error and switch the route when one of the queue managers participating in the cluster unexpectedly stops.
For example, what is the interval needed that the cluster channel can detect network error? Heartbeat or something?
In my environment with heartbeat=10[secs], it took about 30 seconds that the failed/stopped QMgr was removed from the cluster.
I want to shorten this time if possible.
Any ideas?
My environment is here;
3 QMgrs are clustered and QMgr1/2 is FullRepository and QMgr0 is partial.
MQ applications "MQPut/Get" to/from AQ which is an aliasQ targeting remote Q1.
Therefore messages are round-robined between Q1 on QMgr1 and QMgr2.
APLMQPUT/GET to AQ) -->QMgr0 AQ(alt queue, tgtQ=Q1)
+
+--- QMgr1(Q1)
+
+--- QMgr2(Q1)
One of the interesting things is that QMgr0 seemed to route and send messages to Q1 on QMgr2 even the QMgr2 was not working(node down).
As a result, I have 4 messages lost during 30 secs of processing time.
I thought the MQ cluster would immediately detects the QMgr2's down and removes it from the destinations. Is it wrong???
(I'm using non-per msgs, BTW.)
I have no idea how I can cope with this.  |
|
Back to top |
|
 |
Michael Dag |
Posted: Sun Nov 04, 2007 11:23 pm Post subject: |
|
|
 Jedi Knight
Joined: 13 Jun 2002 Posts: 2607 Location: The Netherlands (Amsterdam)
|
that's why clustering is not a high availability solution!
when the channel goes into retrying mode, messages are still delivered to the SCTQ destined for QMgr2, so messages are 'stuck' on the SCTQ.
When you stop the channel to Qmgr2 or suspend Qmgr2 and the route becomes unavailable there is a 'last minute' mechanism that checks if stuck messages can be delivered elsewhere... _________________ Michael
MQSystems Facebook page |
|
Back to top |
|
 |
JYama |
Posted: Sun Nov 04, 2007 11:42 pm Post subject: |
|
|
 Master
Joined: 27 Mar 2002 Posts: 281
|
Thank you very much for useful information, Michael.
Stuck messages are OK.
My problem is that multiple messages were lost and there's no stuck messages.
Are there 'external' parameters that affect the behavior of MQ Clustering?  |
|
Back to top |
|
 |
Michael Dag |
Posted: Sun Nov 04, 2007 11:50 pm Post subject: |
|
|
 Jedi Knight
Joined: 13 Jun 2002 Posts: 2607 Location: The Netherlands (Amsterdam)
|
JYama wrote: |
Thank you very much for useful information, Michael.
Stuck messages are OK.
My problem is that multiple messages were lost and there's no stuck messages.
Are there 'external' parameters that affect the behavior of MQ Clustering?  |
there are no lost messages... most likely they are 'in' the retrying channel... _________________ Michael
MQSystems Facebook page |
|
Back to top |
|
 |
JYama |
Posted: Sun Nov 04, 2007 11:56 pm Post subject: |
|
|
 Master
Joined: 27 Mar 2002 Posts: 281
|
Michael Dag wrote: |
there are no lost messages... most likely they are 'in' the retrying channel... |
Thanks for your update.
Before I'll contact IBM support, I'd like to carify the 'route' of clustered messages.
Is this correct that inbound messages targeted QMgr2 are always pass QMgr1 like QMgr0 -> QMgr1 ->QMgr2?
Additionally if Qmgr2 is not working, the messages would be 'stuck' on STCQ on 'QMgr1'? |
|
Back to top |
|
 |
JYama |
Posted: Mon Nov 05, 2007 2:22 am Post subject: |
|
|
 Master
Joined: 27 Mar 2002 Posts: 281
|
BTW, when does the status of a 'clussdr' channel change from RUNNING to RETRYING if one of the target QMgrs suddenly stopped?
Who decides whether a target QMgr is running or not?
In my case, multiple messages were lost during the 'routing' process.
One interesting thing was that 'clussdr' channel status was 'RUNNING' even its target QMgr was NOT running. When does the status change to RETRYING??
I have a lot of questions about MQ Clustering now...  |
|
Back to top |
|
 |
Vitor |
Posted: Mon Nov 05, 2007 2:45 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
JYama wrote: |
Is this correct that inbound messages targeted QMgr2 are always pass QMgr1 like QMgr0 -> QMgr1 ->QMgr2?
Additionally if Qmgr2 is not working, the messages would be 'stuck' on STCQ on 'QMgr1'? |
No, the cluster will auto-define channels between the source and target queue manaers. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Michael Dag |
Posted: Mon Nov 05, 2007 2:47 am Post subject: |
|
|
 Jedi Knight
Joined: 13 Jun 2002 Posts: 2607 Location: The Netherlands (Amsterdam)
|
JYama wrote: |
BTW, when does the status of a 'clussdr' channel change from RUNNING to RETRYING if one of the target QMgrs suddenly stopped?
Who decides whether a target QMgr is running or not?
|
JYama wrote: |
In my case, multiple messages were lost during the 'routing' process.
|
JYama wrote: |
One interesting thing was that 'clussdr' channel status was 'RUNNING' even its target QMgr was NOT running. When does the status change to RETRYING??
I have a lot of questions about MQ Clustering now...  |
_________________ Michael
MQSystems Facebook page |
|
Back to top |
|
 |
Vitor |
Posted: Mon Nov 05, 2007 2:51 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
JYama wrote: |
BTW, when does the status of a 'clussdr' channel change from RUNNING to RETRYING if one of the target QMgrs suddenly stopped?
Who decides whether a target QMgr is running or not? |
At the same time as for a non-clustered queue manager - when the sender MCA fails to receive a response!
JYama wrote: |
In my case, multiple messages were lost during the 'routing' process.
|
It's the fate of non-persistent messages to be lost when things go a bit funny.
JYama wrote: |
One interesting thing was that 'clussdr' channel status was 'RUNNING' even its target QMgr was NOT running. When does the status change to RETRYING?? |
When the various intervals expire.
JYama wrote: |
I have a lot of questions about MQ Clustering now...  |
Then the Clustering manual will be your friend!  _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Michael Dag |
Posted: Mon Nov 05, 2007 2:51 am Post subject: |
|
|
 Jedi Knight
Joined: 13 Jun 2002 Posts: 2607 Location: The Netherlands (Amsterdam)
|
Pressed submit too soon...
This is not a clustering course, take one or get a manual it is explained in great detail...
JYama wrote: |
BTW, when does the status of a 'clussdr' channel change from RUNNING to RETRYING if one of the target QMgrs suddenly stopped?
Who decides whether a target QMgr is running or not?
|
you were probably looking at the clussdr channel to the repository?
JYama wrote: |
In my case, multiple messages were lost during the 'routing' process.
|
stop saying messages were lost... MQ does not loose messages unless you did something to make it lose messages, like resetting a channel...
JYama wrote: |
One interesting thing was that 'clussdr' channel status was 'RUNNING' even its target QMgr was NOT running. When does the status change to RETRYING??
I have a lot of questions about MQ Clustering now...  |
cluster channels are no different then other channels, like I said, you were probably looking at the defined cluster channels and not the auto defined ones, like Vitor mentioned. _________________ Michael
MQSystems Facebook page |
|
Back to top |
|
 |
JYama |
Posted: Mon Nov 05, 2007 3:06 am Post subject: |
|
|
 Master
Joined: 27 Mar 2002 Posts: 281
|
Quote: |
you were probably looking at the clussdr channel to the repository? |
Yes, that's right.
Doesn't it indicate the status of a target?
Quote: |
stop saying messages were lost... MQ does not loose messages unless you did something to make it lose messages, like resetting a channel...
|
What I did was that I tried to shutdown one of the target QMgrs.
Thus I guess at least one message should be lost, this is OK, but in my case, 4 msgs were gone... , this is my problem.. |
|
Back to top |
|
 |
Vitor |
Posted: Mon Nov 05, 2007 3:17 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
JYama wrote: |
Thus I guess at least one message should be lost, this is OK, but in my case, 4 msgs were gone... , this is my problem.. |
How did you come to this number of 4? As you say at least one message should be lost so why do you think 4 is a problem?
And if lost messages are a problem, use persistent messages!  _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
fjb_saper |
Posted: Mon Nov 05, 2007 4:36 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
Vitor wrote: |
And if lost messages are a problem, use persistent messages!  |
 _________________ MQ & Broker admin |
|
Back to top |
|
 |
PeterPotkay |
Posted: Mon Nov 05, 2007 7:24 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
The 4 messages that went to QM2 - were they told to go to QM2? Were they specifically addressed to QM2, or is it possible that the putting app used the BIND_ON_OPEN option?
What is the NPMSPEED of the CLUSSRCVR channel on QM2? If its set to FAST and you are sending non persistent messages its quite possible several messages may be sent down a channel that is no longer 100% good and those messages are discarded.
Its gonna take some time for a SNDR to realize the RCVR is having issues. But if your channel speed is Normal, and/or you are using persistent messages, I don't think you should lose any messages assuming the putting app doesn't say that the messages should go to the down QM.
Try your tests again with persistent messages and let us know the results.
The Heartbeat will help identify a downed channel but only comes into play if there are no messages flowing. Also, realize that HB values set to below 60 seconds act a little differently than you would think (or want!): Lookie here:
http://www.mqseries.net/phpBB2/viewtopic.php?t=15619&highlight=heartbeats+seconds _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
JYama |
Posted: Mon Nov 05, 2007 7:42 am Post subject: |
|
|
 Master
Joined: 27 Mar 2002 Posts: 281
|
Vitor wrote: |
How did you come to this number of 4? As you say at least one message should be lost so why do you think 4 is a problem?
And if lost messages are a problem, use persistent messages!  |
You're right. Using psersistent msgs is the only solution for this.
'4 msgs lost' mean that I couldn't find them in SCTQ, DLQ, etc...
Actually I have an application which was keeping sending a msg per 5secs to the front QMgr0 containg the aliasQ.
What happened was that it took approximately 20 to 30 secs that the 'failed route' was removed from routing.
Msgs of #1,#2,#3,#4, (20 secs in total), were gone, and #5 was successfully routed because, I guess, MQ Cluster could detect that the targetQMgr had not been running...
What I'd like to know is why the clussdr channel indicated 'RUNNING' even the target QMgr was NOT running, why the status didn't change to 'RETRYING' after sending ONE message to the target QMgr which had not been running, why it took 20 to 30 secs that the status of the clussdr changed to 'RETRYING', and what is this 'long' interval.
How can I shorten this interval?
Again, I agree with you that I should use persitent msgs to avoid msg lost, but I'd like to make it clear what's going on in my environment.  |
|
Back to top |
|
 |
|