|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
 |
|
Strange behavior for load balancing with MQ 6.0 |
« View previous topic :: View next topic » |
Author |
Message
|
LMD |
Posted: Mon May 21, 2007 9:36 am Post subject: Strange behavior for load balancing with MQ 6.0 |
|
|
 Acolyte
Joined: 30 Oct 2002 Posts: 56 Location: Paris - France
|
Hi all,
I'm facing an interesting situation about load balancing with clustered queues on MQ V6.0
I have a network of 6 Queue Managers, 3 on a Primary site (suffixed "P"), 3 on a Secondary site (suffixed "S")
S is supposed to be the backup of the corresponding P Queue Managers in case of failure.
On the P side, I have the following QM:
- POSP --> Full repository
- GUEP --> The QM where source application reside
- SAGP --> The QM where target application reside
On the S side, I have the 3 same QM, ending with S instead of P.
GUEx and SAGx have explicit sender channels to POSP and POSS.
All the six QM belongs to the same MQ Cluster.
On SAGP and SAGS, I have a cluster queue named GUE_SAG.
On SAGP, the channel TO.SAGP have a cluster priority of 6.
On SAGS, the channel TO.SAGS have a cluster priority of 0.
All other parameters are default.
On normal circonstances, when I put a message from GUEP to GUE_SAG queue, the message is delivered on GUE_SAG queue on SAGP.
After a restart (endmqmq + strmqm) of GUEP, if I put ONE message from GUEP to GUE_SAG, the message is delivered on SAGP queue.
After a restart (endmqmq + strmqm) of GUEP, if I put MORE THAN ONE message from GUEP to GUE_SAG (for example using the "Put Message" option from MO71 and "Message count = 5") on a short period of time (say 2 seconds), load balancing occurs between SAGP et SAGS queues.
Same if I use amqsput with a hight rate of inputs (so this is not a MO71 related behaviour).
This is true only for the first "batch" of messages, other puts send all mssages to SAGP queue.
This is not the behavior I expect for load balancing in this transient situation ....
Do I have miss something in parameters ?
Is this a bug, a feature ?
I was able to reproduce this:
- on Aix, with MQ 6.0.1.0
- on Linux, with MQ 6.0.1.0 and 6.0.2.0
- on WinXP, with MQ 6.0.2.0 and 6.0.2.1
Any help will be greatly appreciated.
PS : I can provide MQSC définitions for this cluster, who actually run on my laptop. _________________ lmd_at_demey-consulting.fr - http://demey-consulting.fr - Paris, France.
WMQ, WAS & IIB Certified.
#IBMChampion |
|
Back to top |
|
 |
Ivans |
Posted: Mon May 21, 2007 6:18 pm Post subject: |
|
|
Apprentice
Joined: 03 Jan 2006 Posts: 48 Location: Hursley
|
One explanation could be channel status changes as follows...
Channels TO.SAGP and TO.SAGS on GUEP are in INACTIVE state.
1) A message is put to GUE_SAG. As both channels have equal status the message goes to the highest priority channel (i.e. TO.SAGP).
2) A second message is put to GUE_SAG. TO.SAGP is now in STARTING state (because it was trigger started by the first message) and TO.SAGS is in INACTIVE state. INACTIVE channels are chosen over STARTING channels so the message goes to TO.SAGS.
3) A third message is put to GUE_SAG. TO.SAGP is now in STARTING state (because it was trigger started by the first message) and TO.SAGS is now in STARTING state (because it was trigger started by the second message). As both channels have equal status the message goes to the highest priority channel (i.e. TO.SAGP).
4) A fourth message is put to GUE_SAG. TO.SAGP is now in RUNNING state (because it was trigger started by the first message) and TO.SAGS is now in STARTING state (because it was trigger started by the second message). RUNNING channels are chosen over STARTING channels so the message goes to TO.SAGP.
5) A message is put to GUE_SAG. Both channels are now in RUNNING state and because they have equal status the message goes to the highest priority channel (i.e. TO.SAGP).
You can see that as channels start they temporarily enter less preferential states, thus overriding CLWLPRTY. You can avoid this by starting channels before workload, but keeping the channels up in this manner is likely to be difficult to manage. If you really really must have all messages go to the production queue managers you may want to consider using CLWLRANK, but this would need manual intervention (to alter CLWLRANKs) at failover.
Cheers,
Ian |
|
Back to top |
|
 |
LMD |
Posted: Wed May 23, 2007 12:38 am Post subject: |
|
|
 Acolyte
Joined: 30 Oct 2002 Posts: 56 Location: Paris - France
|
Hi Ivans,
Thank you for your answer.
I agree with your explanation about channel status, but it's not the expected behavior on a load balancing environnement.
Extract from MQ Infocenter, CLWLPRTY channel attribute :
Quote: |
Where there are two possible destinations, you can use this attribute to allow one queue manager to act as a failover, if the other queue manager becomes unavailable. |
If we restart GUEP with some persistant messages waiting in xmitq for GUE_SAG queue, and TO.SAGP channel have a CLWLPRTY higther than TO.SAGS, no any messages should be sent to SAGS queue.
CLWLPRTY should not be overriden by channel status.
If load balancing occurs at this point, this means that CLWLPRTY is useless for a "failover Qmgr" perspective.
I hope I am wrong, and that there is a combination of parameters that allow to configure MQ clusters
Thanks in advance for any ideas ... _________________ lmd_at_demey-consulting.fr - http://demey-consulting.fr - Paris, France.
WMQ, WAS & IIB Certified.
#IBMChampion |
|
Back to top |
|
 |
PeterPotkay |
Posted: Wed May 23, 2007 8:46 am Post subject: Re: Strange behavior for load balancing with MQ 6.0 |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
LMD wrote: |
On normal circonstances, when I put a message from GUEP to GUE_SAG queue, the message is delivered on GUE_SAG queue on SAGP.
After a restart (endmqmq + strmqm) of GUEP, if I put ONE message from GUEP to GUE_SAG, the message is delivered on SAGP queue.
After a restart (endmqmq + strmqm) of GUEP, if I put MORE THAN ONE message from GUEP to GUE_SAG (for example using the "Put Message" option from MO71 and "Message count = 5") on a short period of time (say 2 seconds), load balancing occurs between SAGP et SAGS queues.
Same if I use amqsput with a hight rate of inputs (so this is not a MO71 related behaviour).
This is true only for the first "batch" of messages, other puts send all mssages to SAGP queue.
|
So for how long does the unexpected / unwanted load balancing occur?
I'm with you on this one, in that it doesn't work like we think it should. But check this quote out a few sentences later under CLWLPRTY:
Quote: |
WebSphere MQ obtains the priority of queue managers after checking channel status. This means that only accessible queue managers are available for selection, and it allows WebSphere MQ to prioritize, where multiple destinations are available. |
After checking Channel Status???!!!
Well, there's your problem. I think Ians explanation kinda matches this. And if true, I think it makes CLWLPRTY a little less useful than you would first imagine.
I guess it boils down to is any channel status other than Running or Inactive going to be treated as an un-"accessible" QM, in which case it is ignored and CLWLPRTY takes a back seat to a starting channel? _________________ Peter Potkay
Keep Calm and MQ On
Last edited by PeterPotkay on Thu May 24, 2007 1:56 pm; edited 1 time in total |
|
Back to top |
|
 |
LMD |
Posted: Wed May 23, 2007 9:27 am Post subject: Re: Strange behavior for load balancing with MQ 6.0 |
|
|
 Acolyte
Joined: 30 Oct 2002 Posts: 56 Location: Paris - France
|
PeterPotkay wrote: |
So for how long does the unexpected / unwanted load balancing occur?
|
I have not made detailled mesurements, but it's for the 2 or 3 first seconds after a restart of the QM.
Enough to jeopardize the application, as 2 ou 3 waiting messages can be sent to the backup site.
PeterPotkay wrote: |
I'm with you on this one, in that it doesn't work like we think it should. But check this quote out a few sentences later under CLWLPRTY:
Quote: |
WebSphere MQ obtains the priority of queue managers after checking channel status. This means that only accessible queue managers are available for selection, and it allows WebSphere MQ to prioritize, where multiple destinations are available. |
|
yes ! but before :
Quote: |
Where there are two possible destinations, you can use this attribute to allow one queue manager to act as a failover, if the other queue manager becomes unavailable. |
I will be very interested to learn how touse this attribute to allow one queue manager to act as a failover.
PeterPotkay wrote: |
And if true, I think it makes CLWLPRTY a little less useful than you would first imagine.
|
I think we should read "it makes CLWLPRTY a lot less useful ...""
PeterPotkay wrote: |
I guess it boils down to is any channel status other than Running or Inactive going to be treated as an un-"accessible" QM, in which case it is ignored and CLWLPRTY takes a back seat to a starting channel? |
I totally agree that CLWLPRTY could be checked AFTER channel status, but I don't understand why a "starting" channel is less desirable than an "inactive" one. It look like a design bug ... _________________ lmd_at_demey-consulting.fr - http://demey-consulting.fr - Paris, France.
WMQ, WAS & IIB Certified.
#IBMChampion |
|
Back to top |
|
 |
PeterPotkay |
Posted: Wed May 23, 2007 9:34 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
I replied to your post on the list serve as well with the same comments, as I know there are more Hursley people there as well. Hopefully we'll get an answer.
I can see their thinking on the way its working now. What if the channel to your 1st QM was stuck in Binding for hours. The arguement could be made that you do want to route traffic to QM #2.
But for the 2 or 3 seconds a channel is starting? I think that's to sensitive. But where is the cutoff for a channel that isn't running - 5 seconds? 5 minutes? You'll never please everyone no matter what they choose.
Atthe very least the doc needs to be more explicit on how this attribute works in relation to variosus channel statuses. IF we are going down the right path here. _________________ Peter Potkay
Keep Calm and MQ On
Last edited by PeterPotkay on Thu May 24, 2007 1:56 pm; edited 1 time in total |
|
Back to top |
|
 |
Ivans |
Posted: Thu May 24, 2007 11:45 am Post subject: |
|
|
Apprentice
Joined: 03 Jan 2006 Posts: 48 Location: Hursley
|
The relationship I described between channel status and CLWLPRTY is by design, so I'll get together with our technical writers and get the manual reviewed and updated.
If you have ideas on how we can improve this function in future (e.g. the timing window on channel startup where less preferential states are ignored) , please let me know, but I suggest that you submit an official requirement so that the request is officially logged.
Cheers,
Ian |
|
Back to top |
|
 |
PeterPotkay |
Posted: Thu May 24, 2007 1:52 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Ian, I think CLWLPRTY INTERVAL needs to be present as a new Channel Attribute.
Lets say it defaults to 10 seconds. This would mean that the cluster workload algorithim would still consider the higher priority channel a valid destination for 10 seconds regardless of the channel status. If the channel has been in a state other than INACTIVE or RUNNING for more than 10 seconds, the message are then sent to the next priority channel. As soon as the higher priority channel status returns to INACTIVE or RUNNING, the algorithim once again promotes that channel to the head of the list, and the 10 second counter is reset.
By having this parm available, MQ admins can determine how long an outage they are willing to tolerate before automaic failover to the lower priority QMs occurs. AND, the documentation on the new attribute would make it very clear on how and why a lower priority channel starts getting messages ahead of a higher priority one.
Sounds simple enough, although you being familiar with the internals might think otherwise. What do you think? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
LMD |
Posted: Thu May 24, 2007 2:11 pm Post subject: |
|
|
 Acolyte
Joined: 30 Oct 2002 Posts: 56 Location: Paris - France
|
PeterPotkay wrote: |
Ian, I think CLWLPRTY INTERVAL needs to be present as a new Channel Attribute.
.../...
Sounds simple enough, although you being familiar with the internals might think otherwise. What do you think? |
Yes Peter,
this is exactly what we need to have a reliable behaviour of MQ clusters.
How long now to have this magic parameter included in the code ?
MQ 6.0.2.3 ? MQ 6.0.3 ? MQ 7.0 ? ...
We (the project I am working on) can live without it (and so run the architecture unchanged) if we can be assured that in a near future, this parameter will exist.
Hursley, any comments ? _________________ lmd_at_demey-consulting.fr - http://demey-consulting.fr - Paris, France.
WMQ, WAS & IIB Certified.
#IBMChampion |
|
Back to top |
|
 |
fjb_saper |
Posted: Thu May 24, 2007 2:23 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
10 seconds is an awfully long time if your processes are suposed to be subsecond....
But a decent parameter there would help.
As well as a clear rule as to how the next channel is determined, and please average estimated time for the channels as they go through the status changes
ex. from inactive to running
a) inactive ?? (hours)
b) starting 2ms
c) binding 5ms
d) running ?? (hours)
So if a channel doesn't go from inactive to running in less than 20 ms (here the avg would have been 7 ms so @20ms we are at about 3 times the avg) let's try the next one...  _________________ MQ & Broker admin |
|
Back to top |
|
 |
PeterPotkay |
Posted: Thu May 24, 2007 2:38 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
fjb_saper wrote: |
10 seconds is an awfully long time if your processes are suposed to be subsecond.... |
A perfect example of why this should be configurable.
Company A Manager:
Quote: |
"WHAT??? It took 10 whole seconds to failover to our secondary cluster chanels? We can't wait 10 seconds for traffic to start flowing again! @&#$&!# That no good MQ! |
Company B Manager:
Quote: |
"WHAT??? We failed over MQ traffic to our DR site clustered QMs just because there was delay of 10 seconds in starting up that channel? Why, it looks like the higher priority channel did start up 20 seconds later. Why in the world did we fail over? Why's it so sensitive? #*%*#%*! That no good MQ! |
 _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
Ivans |
Posted: Thu May 24, 2007 2:50 pm Post subject: |
|
|
Apprentice
Joined: 03 Jan 2006 Posts: 48 Location: Hursley
|
Quote: |
How long now to have this magic parameter included in the code ?
MQ 6.0.2.3 ? MQ 6.0.3 ? MQ 7.0 ? ... |
Please enter an official requirement and you will receive an official reply. But to set expectations... IBM does not announce features for future releases on a feature by feature basis on request, rather as an official announcement preceeding availability of products.
CLWLPRTY interval is worth considering further, so I'll certainly raise this idea, and the doc change, with the development team. |
|
Back to top |
|
 |
LMD |
Posted: Mon May 28, 2007 12:41 am Post subject: |
|
|
 Acolyte
Joined: 30 Oct 2002 Posts: 56 Location: Paris - France
|
Ivans,
we will open an official requirement on this.
Can you point me to the right procedure ?
Thanks in advance. _________________ lmd_at_demey-consulting.fr - http://demey-consulting.fr - Paris, France.
WMQ, WAS & IIB Certified.
#IBMChampion |
|
Back to top |
|
 |
Ivans |
Posted: Mon May 28, 2007 6:26 am Post subject: |
|
|
Apprentice
Joined: 03 Jan 2006 Posts: 48 Location: Hursley
|
|
Back to top |
|
 |
|
|
 |
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|