Author |
Message
|
Wally |
Posted: Wed Sep 22, 2010 5:19 am Post subject: Message loss with failover in MQ V7 Cluster |
|
|
Novice
Joined: 22 Sep 2010 Posts: 15
|
Hello out there,
- I do run a small test MQ v7 cluster on a single Windows box
- I have 3 QMGRs named T1, T2 and S1 listening on ports 1501, 1502 and 1503
- T1 and T2 are full repos and S1 partitial repository
- T1 and T1 have a local queue named TARGETQ which is shared on the cluster named CLUST1 (BindType is Not Fixed and persistant)
- S1 has an alias queue SENDQ on TARGETQ
- Setup is created via MQ Explorer first T1/T2 cluster then add S1
Plan is to send messages to S1 and process them on T1 or T2.
When I send messages to S1.SENDQ the messages are happily distributed across T1 and T2, but when I stop T1 or T2 and again try to send message always the first message based on Round-Robin to the off-line QMGR is lost and all subsequent messages appear on-line QMGR. So if I send messages M1, M2 and M3 the first message M1 going to the now off-line member disappears completely without any error or warning.
Even if the stopped QMGR is started again the message is lost. I tried to send the message with the MQ Explorer as well with a simple JMS Client program and explicit setting message persistence.
According to the documentation this should work. Does anyone has had the same strange issue or can provide me with some example config I can compare to mine. Any help appreciated!
Please  |
|
Back to top |
|
 |
Vitor |
Posted: Wed Sep 22, 2010 5:50 am Post subject: Re: Message loss with failover in MQ V7 Cluster |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
Wally wrote: |
when I stop T1 or T2 and again try to send message always the first message based on Round-Robin to the off-line QMGR is lost and all subsequent messages appear on-line QMGR. So if I send messages M1, M2 and M3 the first message M1 going to the now off-line member disappears completely without any error or warning. |
I'd expect the message to be in the SCTQ rather than lost. Does the message have expiry set?
Wally wrote: |
According to the documentation this should work. |
It should work within limits. There have been a number of discussions on using a WMQ cluster for this kind of HA solution and why it doesn't work all that well; the "stuck messsage" problem.
(Your M1 should be stuck in the SCTQ rather than lost). _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Wally |
Posted: Wed Sep 22, 2010 6:01 am Post subject: Re: Message loss with failover in MQ V7 Cluster |
|
|
Novice
Joined: 22 Sep 2010 Posts: 15
|
Quote: |
I'd expect the message to be in the SCTQ rather than lost. Does the message have expiry set?
It should work within limits. There have been a number of discussions on using a WMQ cluster for this kind of HA solution and why it doesn't work all that well; the "stuck messsage" problem.
(Your M1 should be stuck in the SCTQ rather than lost).
|
I would also expect to see the message in the TX or a DL queue, but I can't see any message there. Can you please provide me with some info material on the HA discussion.
The only thing i observe is the error log of S1
Code: |
-------------------------------------------------------------------------------
9/22/2010 15:00:43 - Process(30492.15) User(_xyz) Program(amqrmppa.exe)
Host(abc)
AMQ9202: Remote host 'abc(xx.xxx.xxx.xx) (1502)' not available, retry
later.
EXPLANATION:
The attempt to allocate a conversation using TCP/IP to host 'abc
(xx.xxx.xxx.xx) (1502)' was not successful. However the error may be a
transitory one and it may be possible to successfully allocate a TCP/IP
conversation later.
ACTION:
Try the connection again later. If the failure persists, record the error
values and contact your systems administrator. The return code from TCP/IP is
10061 (X'274D'). The reason for the failure may be that this host cannot reach
the destination host. It may also be possible that the listening program at
host 'abc (xx.xxx.xxx.xx) (1502)' was not running. If this is the case,
perform the relevant operations to start the TCP/IP listening program, and try
again.
----- amqccita.c : 1289 -------------------------------------------------------
9/22/2010 15:00:43 - Process(30492.15) User(_xyz) Program(amqrmppa.exe)
Host(abc)
AMQ9999: Channel program ended abnormally.
EXPLANATION:
Channel program 'TO.T2' ended abnormally.
ACTION:
Look at previous error messages for channel program 'TO.T2' in the error files
to determine the cause of the failure.
----- amqrccca.c : 921 --------------------------------------------------------
|
[/quote] |
|
Back to top |
|
 |
exerk |
Posted: Wed Sep 22, 2010 6:07 am Post subject: Re: Message loss with failover in MQ V7 Cluster |
|
|
 Jedi Council
Joined: 02 Nov 2006 Posts: 6339
|
Wally wrote: |
...I would also expect to see the message in the TX or a DL queue, but I can't see any message there... |
Stop all your channels from the queue manager from where you 'sent' the message, then check the S.C.T.Q _________________ It's puzzling, I don't think I've ever seen anything quite like this before...and it's hard to soar like an eagle when you're surrounded by turkeys. |
|
Back to top |
|
 |
mqjeff |
Posted: Wed Sep 22, 2010 6:08 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
If the message is not persistent, this is not unexpected.
If the channel is not fully stopped, then the message could be in transaction. |
|
Back to top |
|
 |
bruce2359 |
Posted: Wed Sep 22, 2010 6:09 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
Quote: |
I would also expect to see the message in the TX or a DL queue, but I can't see any message there. |
Exactly which queues did you look into? The SCTQ is not named TX. _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
Wally |
Posted: Wed Sep 22, 2010 6:17 am Post subject: |
|
|
Novice
Joined: 22 Sep 2010 Posts: 15
|
So I had a look into the S1.SYSTEM.CLUSTER.TRANSMIT.QUEUE, but saying this it looks like this is by default non-persistant - will modify this to persistant and run my test again. The message should be persistant as I also use this code to send it
Code: |
jmsTemplate.send(new MessageCreator() {
public Message createMessage(Session session) throws JMSException {
TextMessage msg = session.createTextMessage(message);
msg.setJMSDeliveryMode(DeliveryMode.PERSISTENT);
return msg;
}
});
|
|
|
Back to top |
|
 |
Vitor |
Posted: Wed Sep 22, 2010 6:23 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
Wally wrote: |
it looks like this is by default non-persistant - will modify this to persistant |
That's only a default setting, as has been said many, many times here.
Wally wrote: |
Code: |
jmsTemplate.send(new MessageCreator() {
public Message createMessage(Session session) throws JMSException {
TextMessage msg = session.createTextMessage(message);
msg.setJMSDeliveryMode(DeliveryMode.PERSISTENT);
return msg;
}
});
|
|
This should (my JMS is weak) override that setting _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Vitor |
Posted: Wed Sep 22, 2010 6:24 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
mqjeff wrote: |
If the channel is not fully stopped, then the message could be in transaction. |
Check the count on the queue _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Wally |
Posted: Wed Sep 22, 2010 6:33 am Post subject: |
|
|
Novice
Joined: 22 Sep 2010 Posts: 15
|
So when I run the "failover test" I stop the complete qmgr T1; wait a time and then again try to send messages.
Changing the System.Cluster.Transmit.Queue to persistant does help either.
So when I stop both sender channels to the rest of the cluster the messages sit in the System.Cluster.Transmit.Queue and wait. Starting the channel to the off-line member doesn't change the situation and starting the second channel again will transfer ALL of the message to the on-line member.
But having a failure of one node will not stop the sender channels before - no? |
|
Back to top |
|
 |
bruce2359 |
Posted: Wed Sep 22, 2010 6:35 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
Quote: |
So I had a look into the S1.SYSTEM.CLUSTER.TRANSMIT.QUEUE, but saying this it looks like this is by default non-persistant - will modify this to persistant and run my test again. |
First, queues are neither persistent nor non-persistent - messages are.
Is the SCTQ really named S1.SYSTEM.CLUSTER.TRANSMIT.QUEUE?
What queue does your application open? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
Wally |
Posted: Wed Sep 22, 2010 6:41 am Post subject: |
|
|
Novice
Joined: 22 Sep 2010 Posts: 15
|
bruce2359 wrote: |
First, queues are neither persistent nor non-persistent - messages are.
Is the SCTQ really named S1.SYSTEM.CLUSTER.TRANSMIT.QUEUE?
What queue does your application open? |
Sorry for the confusion I meant SYSTEM.CLUSTER.TRANSMIT.QUEUE on my qmgr S1.
So I have 2 clustered queues on T1 and T2 named TARGETQ and my littel sample app or the MQ Explorer send its test message to the TARGETQ at S1. |
|
Back to top |
|
 |
Vitor |
Posted: Wed Sep 22, 2010 6:46 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
Wally wrote: |
So when I run the "failover test" I stop the complete qmgr T1; wait a time and then again try to send messages. |
How long a time? It will take a discussed and documented period for the channels to notice the failure.
Changing the System.Cluster.Transmit.Queue to persistant does help either.
Wally wrote: |
So when I stop both sender channels to the rest of the cluster the messages sit in the System.Cluster.Transmit.Queue and wait. Starting the channel to the off-line member doesn't change the situation |
By which you mean the 1st message (which you've seen by browsing on the SCTQ) does not arrive on the on-line queue manager but the other do?
Wally wrote: |
and starting the second channel again will transfer ALL of the message to the on-line member. |
So in one scenario you get all the messages, in the other you get all-1?
Wally wrote: |
But having a failure of one node will not stop the sender channels before - no? |
It will put the channel to the downed queue manager (don't call it a node - this isn't an HA situation) into retry after a period. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Wally |
Posted: Wed Sep 22, 2010 6:57 am Post subject: |
|
|
Novice
Joined: 22 Sep 2010 Posts: 15
|
Vitor wrote: |
How long a time? It will take a discussed and documented period for the channels to notice the failure.
|
I wait like 30 seconds, but at least the time until the Explorer come back with the message stopped.
Vitor wrote: |
Changing the System.Cluster.Transmit.Queue to persistant does help either.
|
I have already changed it to be persistant but still same behaviour.
Vitor wrote: |
By which you mean the 1st message (which you've seen by browsing on the SCTQ) does not arrive on the on-line queue manager but the other do?
So in one scenario you get all the messages, in the other you get all-1?
|
So when I have stopped the channels before sending out messages again I can see all my messages in the S.C.T.Q on S1 and when starting up the channels all messages are transfered to the on-line qmgr.
Wheras in the scenario when I only bring on qmgr off-line and send messages the first message targeted to the off-line qmgr (according to the round-robin algorithm) is lost (like me now). |
|
Back to top |
|
 |
Vitor |
Posted: Wed Sep 22, 2010 7:23 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
Wally wrote: |
Wheras in the scenario when I only bring on qmgr off-line and send messages the first message targeted to the off-line qmgr (according to the round-robin algorithm) is lost (like me now). |
How do you mean "targeted"? If a message is addressed to a given queue manager it bypasses the cluster workload distribution.
So if you have 3 messages browsable in the SCTQ and bring one of the queue managers on line what happens?
If a message (M1) isn't sent to the on-line queue manager but M2 & M3 are, what happens if you then bring the other queue manager on-line?
Are you certain expiry isn't in use? _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
|