|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
Redundant Brokers |
« View previous topic :: View next topic » |
Author |
Message
|
PeterPotkay |
Posted: Mon Feb 06, 2012 9:59 am Post subject: Redundant Brokers |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
lancelotlinc wrote: |
Update your WMB runtime to latest version.
For future reference, consider redundant brokers rather than multi-instance brokers, when designing for fault tolerance, performance, or high-availability.
Geographically separate your redundant brokers. |
Honest question: How do you deal with messages that were in flight on the QM / Broker / Server that goes down?
M.I. or hardware clustered brokers should fail over, start up, and complete any messages that were sitting in the queues, although there may be several minutes before that happens. This was always the reason given for a bunch of non MI or non hardware clustered Brokers (load balanced with MQ Clustering, Client Channel Tables and HTTP Load Balancers) not being enough. But sometimes it seems the hardware clustering and/or M.I. software causes more problems than it solves.
Advocates of stand alone QMs / brokers will say the underlying hardware is so much more stable now. If you go to z/Linux or VMware you don't have to worry about hardware failures. And the Broker is so fast you rarely have messages sitting in queues waiting anyway, so if it should go down you risk very few messages being stranded. Of course the one that might get stranded might be a million dollar message.
So, if you want to get rid of MI or hardware clustered Brokers / QMs, how do you deal with messages that were in flight on the QM / Broker / Server that goes down? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
lancelotlinc |
Posted: Mon Feb 06, 2012 10:10 am Post subject: |
|
|
 Jedi Knight
Joined: 22 Mar 2010 Posts: 4941 Location: Bloomington, IL USA
|
Great topic of discussion, Peter. I'm sure we'll trade some thoughts on this thread. I will respond more as the thread progresses, so let me first start the discussion by laying a foundation of information. If the responders of this thread would read the following links first, it will help us have a common background for discussing the pros and cons.
This link discusses CLWLPRTY and NETPRTY and provides a real-world example (quoted in case future readers cannot access the link):
http://www-01.ibm.com/support/docview.wss?uid=swg21234112
Quote: |
Suppose you have a (comprehensive) railway network and a road network. Both can get you where you want to go. You prefer to use the railway because it is cheaper. So, you give the railway a NETPRTY of 7 and the roads a NETPRTY of 3.
Now, for some of your operation you send to town (queue manager) "Charlotte". But you have a standby in "Greensboro". So, you give Charlotte a CLWLPRTY of 8 and Greensboro a CLWLPRTY of 2.
Your system runs. Everything goes to Charlotte by rail. The rail system breaks down and everything now goes to Greensboro by road. The rail system is fixed, so everything goes back to using rail.
Implementing this in MQ terminology, you define four channels using both attributes NETPRTY and CLWLPRTY:
TO.CHARLOTTE_BY_RAIL................. 7 8
TO.GREENSBORO_BY_RAIL ............. 7 2
TO.CHARLOTTE_BY_ROAD .............. 3 8
TO.GREENSBORO_BY_ROAD ........... 3 2
The "rail" and "road" networks are simply different tcp/ip routings to the same destinations. This is the recommended implementation.
You could instead have chosen to implement this using just one of the attributes, NETPRTY:
TO.CHARLOTTE_BY_RAIL............... 9
TO.GREENSBORO_BY_RAIL ............8
TO.CHARLOTTE_BY_ROAD ............ 7
TO.GREENSBORO_BY_ROAD...........6
However, this alternative does not convey the conceptual separation of ideas that using the two attributes achieves, and it would also require renumbering in the event that a third network was added. Note that in the recommended implementation, consistent values of NETPRTY and CLWLPRTY are used across the channels involved. There are a great many other sets of values that could be used to achieve the same result. However, assigning the same priority numbers to each channel is useful in understanding what the numbers mean (for example,. NETPRTY=7 means the rail network).
For reference, here are the best descriptions of these two attributes:
CLWLPRTY channel attribute
-------------------------------------------
To apply a priority factor to a channel for the purposes of cluster workload distribution use the CLWLPRTY attribute. The value must be in the range zero through 9 where zero is the lowest priority and 9 is the highest. This parameter is valid only for channels with a channel type (CHLTYPE) of CLUSSDR or CLUSRCVR.
Use this attribute to ensure that WebSphere MQ selects some destination queue managers in preference to others with a lower priority. WebSphere MQ selects the destinations with the highest priority before selecting destinations with the lowest cluster destination sequence number (or the most recently used one). Where there are two possible destinations, you can use this attribute to allow one queue manager to act as a failover, if the other queue manager becomes unavailable. In this situation, messages go to the highest priority queue manager until it becomes unavailable, they then go to the next priority queue manager. WebSphere MQ obtains the priority of queue managers after checking channel status. This means that only accessible queue managers are available for selection, and it allows WebSphere MQ to prioritize, where multiple destinations are available.
NETPRTY channel attribute
-----------------------------------------
To apply a network priority to a channel for workload management purposes use the NETPRTY attribute. This attribute specifies the priority for the network connection. Use this attribute to make one network the primary network, and another network the backup network that can be used when there is a problem with the primary network. Clustering chooses the path with the highest priority if there are multiple paths available. The value must be in the range zero through 9; zero is the lowest priority. |
Cluster manual:
http://publibfp.boulder.ibm.com/epubs/pdf/csqzah07.pdf
An interesting APAR that talks about how traffic is sent across:
http://www-01.ibm.com/support/docview.wss?uid=swg1PK51439
Lastly, I would advocate that nothing is a silver bullet, without risk, or without design challenge.
It may help us to define business objective, or a sample scenario. _________________ http://leanpub.com/IIB_Tips_and_Tricks
Save $20: Coupon Code: MQSERIES_READER |
|
Back to top |
|
 |
Vitor |
Posted: Mon Feb 06, 2012 10:15 am Post subject: Re: Redundant Brokers |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
PeterPotkay wrote: |
lancelotlinc wrote: |
Update your WMB runtime to latest version.
For future reference, consider redundant brokers rather than multi-instance brokers, when designing for fault tolerance, performance, or high-availability.
Geographically separate your redundant brokers. |
|
To give context, that's a quote. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Vitor |
Posted: Mon Feb 06, 2012 10:22 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
lancelotlinc wrote: |
It may help us to define business objective, or a sample scenario. |
I'll throw one in from my current local situation:
Messages arrive on an inbound queue which can be request/reply or asyncronous. The requests are typically customer inquiries from a human (account balance, credit worthiness, etc, etc) or updates to said customer details (or new customers). SLAs are tight and message values can be high (loans or deposits in the thousands or millions). The howl from the business on message delay is shrill (the asyncronous updates are the same - they push the button to do the update, then keep refreshing the inquiry screen to see if it's changed yet); I shudder at the thought of an update stuck on a downed queue manager. I strongly doubt it would make a difference if it was a $3,000,000 line of credit for a business or some silver haired granny trying to borrow $400 for a plane ticket to visit her grandchildren - all customers here are "high value"!
Because of this we have active / passive on all the components for resiliance, and a geographically separate mirrored DR site (not just for WMQ/WMB, the whole site is active/passive & mirrored). We use WMQ clustering for workload. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mqjeff |
Posted: Mon Feb 06, 2012 11:23 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
Great link!
It fails to address the question of what happens to messages that have already been delivered to a given cluster queue manager, but not yet processed when that cluster queue manager dies. |
|
Back to top |
|
 |
adubya |
Posted: Mon Feb 06, 2012 12:01 pm Post subject: |
|
|
Partisan
Joined: 25 Aug 2011 Posts: 377 Location: GU12, UK
|
We use HACMP with shared storage (where the MQ and broker data/config live). Two servers form the HACMP cluster, each running their own broker, both of which are in an MQ cluster for workload. If a "failure event" occurs on one server then HA starts up the queue manager and broker on the other server which now has two queue managers and two broker instances running. The shared storage means that all queue data from the failed server is available on the remaining server.
We have floating server addresses which HA manages and switches between the servers as appropriate. |
|
Back to top |
|
 |
Vitor |
Posted: Mon Feb 06, 2012 12:35 pm Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
adubya wrote: |
We use HACMP with shared storage (where the MQ and broker data/config live). Two servers form the HACMP cluster, each running their own broker, both of which are in an MQ cluster for workload. If a "failure event" occurs on one server then HA starts up the queue manager and broker on the other server which now has two queue managers and two broker instances running. The shared storage means that all queue data from the failed server is available on the remaining server.
We have floating server addresses which HA manages and switches between the servers as appropriate. |
It's a very common set up. This thread is exploring @lancelotlinc's alternative. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
lancelotlinc |
Posted: Mon Feb 06, 2012 12:40 pm Post subject: |
|
|
 Jedi Knight
Joined: 22 Mar 2010 Posts: 4941 Location: Bloomington, IL USA
|
mqjeff wrote: |
Great link!
It fails to address the question of what happens to messages that have already been delivered to a given cluster queue manager, but not yet processed when that cluster queue manager dies. |
Thanks mqjeff.
I agree with you, this is a sticky sub-topic in our overall discussion. There is the potential of lost messages.
In order to design to the business requirement, there needs to be a ranking of priorities. Some would rate as "absolute" a hard-and-fast requirement that no messages are lost, no messages are duplicated, and no messages are processed out of order in the event of a failure.
Would you agree with this statement? ~~ As an absolute design requirement, this becomes quite expensive no matter which solution is chosen. _________________ http://leanpub.com/IIB_Tips_and_Tricks
Save $20: Coupon Code: MQSERIES_READER |
|
Back to top |
|
 |
Vitor |
Posted: Mon Feb 06, 2012 12:46 pm Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
lancelotlinc wrote: |
In order to design to the business requirement, there needs to be a ranking of priorities. Some would rate as "absolute" a hard-and-fast requirement that no messages are lost, no messages are duplicated, and no messages are processed out of order in the event of a failure. |
I don't agree. We're discussing (or I'm discussing) a message which is not lost, not duplicated and has no affinity with any other message. It's in a known location (the local queue of a downed queue manager), is not duplicated (there's only that one copy) and all other messages before and after it have been processed quite happily in it's absence. The design requirement that's not being met is that it can't be processed until the queue manager comes back up.
This is not an unreasonable requirement for any business to have. It's hardly a priority design requirement for messages to be processed within an SLA. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
Vitor |
Posted: Mon Feb 06, 2012 12:47 pm Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
lancelotlinc wrote: |
There is the potential of lost messages. |
How? Worst case in your solution is that the message is stuck on a queue manager until it comes back up? _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
lancelotlinc |
Posted: Mon Feb 06, 2012 12:54 pm Post subject: |
|
|
 Jedi Knight
Joined: 22 Mar 2010 Posts: 4941 Location: Bloomington, IL USA
|
Vitor wrote: |
lancelotlinc wrote: |
In order to design to the business requirement, there needs to be a ranking of priorities. Some would rate as "absolute" a hard-and-fast requirement that no messages are lost, no messages are duplicated, and no messages are processed out of order in the event of a failure. |
I don't agree. We're discussing (or I'm discussing) a message which is not lost, not duplicated and has no affinity with any other message. It's in a known location (the local queue of a downed queue manager), is not duplicated (there's only that one copy) and all other messages before and after it have been processed quite happily in it's absence. The design requirement that's not being met is that it can't be processed until the queue manager comes back up.
This is not an unreasonable requirement for any business to have. It's hardly a priority design requirement for messages to be processed within an SLA. |
Just coming to your post, Sir Vitor.
There are ways to assure timely delivery of data, even if a message is stuck on a downed QMGR. A potential solution that addresses this, is to have a data management system that tracks the progress of data across the transactional boundaries, much like your human is trying to do, but automated. In the insurance industry package sold by IBM, this is called Activity Condition Place. _________________ http://leanpub.com/IIB_Tips_and_Tricks
Save $20: Coupon Code: MQSERIES_READER |
|
Back to top |
|
 |
mqjeff |
Posted: Mon Feb 06, 2012 12:58 pm Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
lancelotlinc wrote: |
There are ways to assure timely delivery of data, even if a message is stuck on a downed QMGR. A potential solution that addresses this, is to have a data management system that tracks the progress of data across the transactional boundaries, much like your human is trying to do, but automated. In the insurance industry package sold by IBM, this is called Activity Condition Place. |
Yes, so one can design or purchase an expensive application that is capable of re-emitting a message that is know to be delayed outside of it's SLA.
Or one can invest in an HA solution that goes as far as possible to make sure that a queue manager is restarted in enough time to process most messages that are on it within their SLA.
In either case, one still has to design and code one's business applications to do the right thing when a message is not processed within it's SLA, and train one's users to do the right thing in such an occurrence as well. |
|
Back to top |
|
 |
Vitor |
Posted: Mon Feb 06, 2012 1:01 pm Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
lancelotlinc wrote: |
There are ways to assure timely delivery of data, even if a message is stuck on a downed QMGR. A potential solution that addresses this, is to have a data management system that tracks the progress of data across the transactional boundaries, much like your human is trying to do, but automated. In the insurance industry package sold by IBM, this is called Activity Condition Place. |
Ok, so this gadget can determine that a given piece of data has not arrived where it's supposed to. It can't possibly retrieve the actual message from the downed queue manager so how does it rectify the situation without sending another copy, violating the once-and-once-only-delivery which WMQ provides & which the application was designed to expect? Given that the original message will pop up like a Jack-in-the-Box as soon as the queue manager restarts?
And how is buying this different from using HA software? _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
lancelotlinc |
Posted: Mon Feb 06, 2012 1:04 pm Post subject: |
|
|
 Jedi Knight
Joined: 22 Mar 2010 Posts: 4941 Location: Bloomington, IL USA
|
mqjeff wrote: |
lancelotlinc wrote: |
There are ways to assure timely delivery of data, even if a message is stuck on a downed QMGR. A potential solution that addresses this, is to have a data management system that tracks the progress of data across the transactional boundaries, much like your human is trying to do, but automated. In the insurance industry package sold by IBM, this is called Activity Condition Place. |
Yes, so one can design or purchase an expensive application that is capable of re-emitting a message that is know to be delayed outside of it's SLA.
Or one can invest in an HA solution that goes as far as possible to make sure that a queue manager is restarted in enough time to process most messages that are on it within their SLA.
In either case, one still has to design and code one's business applications to do the right thing when a message is not processed within it's SLA, and train one's users to do the right thing in such an occurrence as well. |
Agree, no argument here. Every solution has it's features and benefits which are touted by it's marketing people. Every solution also has it's drawbacks or less-than-optimized compromises, which are usually not shouted about.
What I'm trying to draw from you, is a priority list, that explains under a scenario, what are the priorities.
For example, is geographic dispersion higher in priority than system cost? _________________ http://leanpub.com/IIB_Tips_and_Tricks
Save $20: Coupon Code: MQSERIES_READER |
|
Back to top |
|
 |
mqjeff |
Posted: Mon Feb 06, 2012 1:10 pm Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
lancelotlinc wrote: |
What I'm trying to draw from you, is a priority list, that explains under a scenario, what are the priorities. |
time+money < value. |
|
Back to top |
|
 |
|
|
 |
Goto page 1, 2, 3 Next |
Page 1 of 3 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|