Author |
Message
|
In_love_with_MQ |
Posted: Fri Nov 03, 2006 5:44 am Post subject: CLWLPRTY FAILOVER TAKES SOME TIME : Urgent issue |
|
|
Acolyte
Joined: 10 Jul 2005 Posts: 70
|
Hi ,
We have 2 P590 which has 2 instances of cluster Queue into which messages comes from many Qm managers in the cluster .
On one (A) of them a queue called X is defined with CLWLPRTY AS 9
On another (B)a queue called X is defined with CLWLPRTY AS 8
So all messages flows via A(No issues) . Now when i shutdown A as per design the messages should start flowing immediately via B .
But No , it takes some time for this and messages starts flowing after a while only due to this lots of messages are stuck in the XMITQ of the senders as they have pointings to A .
So why is failover not hapenning instantly ???
Is there a Limitation .......If so then this is a Bug ..Or else am i doing anything wrong |
|
Back to top |
|
 |
Vitor |
Posted: Fri Nov 03, 2006 6:04 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
<Standard Rant>
I'm not happy about MQ clustering being used for failover; it's primarially for workload balancing and I think high availablility clustering should be done with dedicated software.
</Standard Rant>
Having said that, I suspect the problem is the workload balancing algortham doesn't immediately detect that A has stopped processing messages. When you say "shutdown" do you mean "endmqm" or do you mean "pulled cable out of box"?
In any event I suspect the cluster heatbeat interval (or similar mechanism, I'm guessing at this point) needs to detect that A is unresponsive; the Clusters manual says:
Quote: |
WebSphere MQ obtains the priority of queue managers after checking channel status. This means that only accessible queue managers are available for selection, and it allows WebSphere MQ to prioritize, where multiple destinations are available. |
I might be inclined to try setting CLWLPRTY on the channel rather than the queue. Or doing failover with dedicated software.
Note: I accept unreservedly that the Clusters manual for v6.0 says these parameters can be used to provide failover (and they even put it in italics). My assertion is that it's not designed for that, the attributes are new for v6 and it's not especially good at it (as this poster seems to have found).
I mean, my car's manual says it can carry 60Kg in the boot. Doesn't say how fast it goes with that much load, and doesn't mean I won't hire a van if I ever need to carry something that weighs that much. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Gaya3 |
Posted: Fri Nov 03, 2006 6:05 am Post subject: |
|
|
 Jedi
Joined: 12 Sep 2006 Posts: 2493 Location: Boston, US
|
Hi
When ever the first queue manager (here its A) downs, automatically the Second queuemanager will starts off as per the fail over condition
I suspect it is trying to connect with A for some interval of time
than later it came to know that it is down. after that the fail over condition is working with QM B.
So if you have set some heart beat interval, disconnect interval of the channel
Thanks and Regards
Gaythri _________________ Regards
Gayathri
-----------------------------------------------
Do Something Before you Die |
|
Back to top |
|
 |
Vitor |
Posted: Fri Nov 03, 2006 6:22 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
Gaya3 wrote: |
So if you have set some heart beat interval, disconnect interval of the channel
|
Isn't that what I said......?  _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Gaya3 |
Posted: Fri Nov 03, 2006 6:25 am Post subject: |
|
|
 Jedi
Joined: 12 Sep 2006 Posts: 2493 Location: Boston, US
|
Hi Vitor
Yes exactly, we both worked on this issue at the same time i believe
Gayathri _________________ Regards
Gayathri
-----------------------------------------------
Do Something Before you Die |
|
Back to top |
|
 |
happyj |
Posted: Fri Nov 03, 2006 6:31 am Post subject: |
|
|
Voyager
Joined: 07 Feb 2005 Posts: 87
|
If I understand this right, the messages will still go to A until the sending queue manager knows that the channel is not active. They will then go to B. The channel status can be determined by using the BATCHHB on the channel. |
|
Back to top |
|
 |
Vitor |
Posted: Fri Nov 03, 2006 6:47 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
That's certainly what I suspect to be happening. Now to avoid this you need to detect change in channel status which is why I suggested using the channel priority; if the xmitq to A is filling up the workload balancer might be inclined to start routing to B. It's a question for better minds than mine how quickly it will do that & how configurable it is.
To get instant failover I fear you'd need the heartbeat interval set so high message throughput would suffer (note this is a fear not a fact). Another way round this would be to write a workload balancing exit that (by fair means or foul) determined the loss of a queue manager and rerouted traffic. Not a trivial task, and opens up a raft of maintainability / support / efficientcy issues.
Better to buy high availability software! <StandardRant/> _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Vitor |
Posted: Fri Nov 03, 2006 6:49 am Post subject: Re: CLWLPRTY FAILOVER TAKES SOME TIME : Urgent issue |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
In_love_with_MQ wrote: |
But No , it takes some time for this and messages starts flowing after a while only due to this lots of messages are stuck in the XMITQ of the senders as they have pointings to A .
|
Am I reading this correctly and no messages are getting stuck and unprocessed on A? Everything ends up on B, albeit after a delay? _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
In_love_with_MQ |
Posted: Fri Nov 03, 2006 7:04 am Post subject: Veritas is costly affair |
|
|
Acolyte
Joined: 10 Jul 2005 Posts: 70
|
Hi ,
Third party softwares like Veritas is above the budget of the project and also since IBM clearly mentioned in the doc that CLWLPRTY is used for Failover then we just went for it .
We initially did a POC but had missed to test them thoroughly and we have ended up with this issue in Prod .
As you asked , the messages do not reach B . The messages travelling during that time gap of failover remains in xmitq and travels only after A is up . |
|
Back to top |
|
 |
jefflowrey |
Posted: Fri Nov 03, 2006 7:17 am Post subject: |
|
|
Grand Poobah
Joined: 16 Oct 2002 Posts: 19981
|
If the budget of the project does not include costs for the requirements of the project, then it is a badly budgetted project or badly written or understood requirements.
That is, if the requirements of the project say "high availability" and "failover", and the budget does not cover real HA software like Veritas and the hardware necessary to run that - then either the budget doesn't meet the requirements, or the requirements are only "suggestions". _________________ I am *not* the model of the modern major general. |
|
Back to top |
|
 |
Vitor |
Posted: Fri Nov 03, 2006 7:19 am Post subject: Re: Veritas is costly affair |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
In_love_with_MQ wrote: |
As you asked , the messages do not reach B . The messages travelling during that time gap of failover remains in xmitq and travels only after A is up . |
The curse of the stuck messages.
I thought for one glorious moment it was fixed in the new release.
I do wish IBM had used a different word than "failover" when describing the new clustering parameters.
There might be some mileage in raising a PMR with IBM. At worst, you can suggest they use a less emotive & misleading word...  _________________ Honesty is the best policy.
Insanity is the best defence.
Last edited by Vitor on Fri Nov 03, 2006 7:23 am; edited 1 time in total |
|
Back to top |
|
 |
Vitor |
Posted: Fri Nov 03, 2006 7:21 am Post subject: Re: Veritas is costly affair |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
In_love_with_MQ wrote: |
....since IBM clearly mentioned in the doc that CLWLPRTY is used for Failover then we just went for it .
|
Gotta admire the gung ho spirit though!  _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
In_love_with_MQ |
Posted: Fri Nov 03, 2006 7:35 am Post subject: I think in that case Ibm is to be blamed |
|
|
Acolyte
Joined: 10 Jul 2005 Posts: 70
|
Then I think IBM needs to give proper documentations .
I think we were fooled . Or became fools ourselves ......
Now our headache starts now to solve this issue ...But what ever it is .
I tried the BATCH hb ALSO and even then i can see some messages stuck in .
" CLWLPRTY IS NOT FOR FAILOVER" ::: LESSON LEARNT
I am gonna sue IBM  |
|
Back to top |
|
 |
Vitor |
Posted: Fri Nov 03, 2006 8:06 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
IBM have got to be a first port of call.
If I was in the situation you're in (and bear in mind I'm not!) and I was fairly happy with coding I might have a go at the workload exit, code something to distribute traffic away from failed queue managers.
Alternatively you could have something which monitors queue manager health directly and suspends the queue manager from the cluster at the 1st sign of trouble. You'd still get some stuck messages, but it would highlight to the default workload balancer that the queue manager should no longer be used.
Just an idea and not fully thought about as yet. Other posters may have better ideas (and whatever you decide, try it in dev first!!) _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
|