Author |
Message
|
malammik |
Posted: Thu Apr 03, 2008 10:24 am Post subject: Recovery Actions |
|
|
 Partisan
Joined: 27 Jan 2005 Posts: 397 Location: Philadelphia, PA
|
This is a discussion question. I am hoping to get some feedback from the community that will be valuable to all of us. Do you guys think a monitoring tool should attempt to try to fix a problem or provide facilities to do so?
My personal opinion is simple, monitoring is there to detect situations you did not think could happen during design and implementation otherwise you would've accounted for them during design. So if something goes wrong, it does need human intervention.
Let's take a look at some examples:
1. Queue Manager goes down. Recovery action: try to restart it. Most common reason for qm going down in my opinion has been lack of storage so restarting wont do anything and on the downside one might be decreasing resolution time by starting the queue manager which will not come up successfully.
2. Queue full. Recovery action could offload the queue and archive it. But if you have the space to do it why not use it to allow bigger queue depth in first place.
These are just examples. I am more interested in the strategic opinions you guys have whether any kind of automatic recovery is useful vs not useful. _________________ Mikhail Malamud
http://www.netflexity.com
http://groups.google.com/group/qflex |
|
Back to top |
|
 |
Michael Dag |
Posted: Thu Apr 03, 2008 12:36 pm Post subject: |
|
|
 Jedi Knight
Joined: 13 Jun 2002 Posts: 2607 Location: The Netherlands (Amsterdam)
|
malammik,
most failures I have seen were queue not full (dimensions set wrongly), but filesystem full or log full.
It would be interesting to see if the MQ monitor could include something basic as a filesystem monitor for the filesystem the queues and logs are on...
automatic recovery often makes things worse like you said, most problems occur by not intervening in time or simple ignorance...  _________________ Michael
MQSystems Facebook page |
|
Back to top |
|
 |
dgolding |
Posted: Thu Apr 03, 2008 11:43 pm Post subject: |
|
|
 Yatiri
Joined: 16 May 2001 Posts: 668 Location: Switzerland
|
I did implement once a rule that when the MQ filesystem hit a certain percentage (60-70% say) then a housekeep is kicked off to clean-up the linear logs (which was done periodically anyway).
The filesystem monitoring was a Tivoli process owned by the Unix team.
One problem we found was that the "emergency" housekeep got kicked off when the regular one was already running - so I had to build in a check if another instance was running before proceeding. But apart from that it seemed to work quite well.
real solution is not to use linear logs
 |
|
Back to top |
|
 |
Vitor |
Posted: Fri Apr 04, 2008 12:43 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
Another source of queue manager failure is a lack of non-disc machine resource. It's feasible that a badly behaved (or badly coded ) app is hogging sufficient resources that the queue manager can go down, so it's a valid strategy to attempt an auto-restart once. This is where the problems come if the automatic process is not sufficiently bright to know it's beaten & keeps trying to restart the queue manager for several hours unsuccessfully!
This works best when combined with a basket of "likely" problems; housekeeping on the disk, checks on memory, that sort of thing. A lot of it depends on the ethos of the site - a proactive site will likely have tight housekeeping that prevents these common problems, so a crash is likely to be something requireing manual action where a reactive site will use this kind of set up to fix problems when they manifest.
On the other scenario, I've been on a site where a "queue full" error was dealt with by unloading the queue to a backup server. The rational was that a queue full indicated the app reading the queue needed to be restarted (usually by bouncing the app server, done automatically), the "old" contents of the queue were farmed off to allow new stuff through and then drip fed back in once the app was deemed stable.
My personal opinon - if you want true automatic problem resolution, use an HA solution, set to failover in a wide variety of cases. Also don't expect any solution to fit every single case. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
SAFraser |
Posted: Tue Apr 15, 2008 1:25 pm Post subject: |
|
|
 Shaman
Joined: 22 Oct 2003 Posts: 742 Location: Austin, Texas, USA
|
The one type of auto recovery that I like is a reset/restart channel for a short time. If the reset/restart does not succeed after a specified time, then an alert should be generated.
This would be handy when there is a blip on the network and the channel needs a manual reset to recover. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Tue Apr 15, 2008 1:31 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Network blips shouldn't cause channel sequence # errors.
Automaticaly reseting sequence #s mask bigger problems, since channel sequence # errors should not be happening all the time.
I suppose if you get an email anythime the auto reset occurs, it might be OK, but I still don't like.
Its like having a robot that intercepts your crying kid and slaps a band aid on automatically. You probably want to know how they get their booboo, not just have RoboMommy band-aid away! _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
SAFraser |
Posted: Tue Apr 15, 2008 1:59 pm Post subject: |
|
|
 Shaman
Joined: 22 Oct 2003 Posts: 742 Location: Austin, Texas, USA
|
I don't think every retrying channel is evil. If the blip occurs when the message has been sent but the receiver hasn't sent back the ack, the channel will go to retry. Perhaps the channel should auto-recover, but sometimes it does not and needs a manual reset. I am not suggesting that a backout/commit should ever be done automatically, nor would I auto-reset the channel sequence itself. |
|
Back to top |
|
 |
mvic |
Posted: Tue Apr 15, 2008 2:04 pm Post subject: |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
SAFraser wrote: |
This would be handy when there is a blip on the network and the channel needs a manual reset to recover. |
Doesn't a reset imply something is pretty badly wrong? I'm with Peter - it sounds like there's potentially a serious problem in there somewhere.
http://publib.boulder.ibm.com/infocenter/wmqv6/v6r0/topic/com.ibm.mq.csqzae.doc/ic10590_.htm
EDIT:
Just saw your update
Quote: |
nor would I auto-reset the channel sequence itself. |
Ah, so what reset is it that solves the problems? |
|
Back to top |
|
 |
fjb_saper |
Posted: Tue Apr 15, 2008 2:40 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
Maybe Shirley meant recycle?
Stop chl, start chl? _________________ MQ & Broker admin |
|
Back to top |
|
 |
SAFraser |
Posted: Tue Apr 15, 2008 6:31 pm Post subject: |
|
|
 Shaman
Joined: 22 Oct 2003 Posts: 742 Location: Austin, Texas, USA
|
I offered this once before, a long time ago, and some grand master somewhere got snooty with me. But unless Hursley tells me I'm wrong, I'm pretty sure about it....
A sender wants an ack back from the receiver. If it doesn't get it, the sender will try to confirm that the message was received. So a channel in retrying just means that the sender considers the state of receipt unknown. In earlier versions of MQ, 'reset channel' was always needed to force the sender to say "hey, did you get the message?"; and, when the receiver said "sure did", the 'start channel' command would restart the channel. (I've always assumed that the sender is comparing seqno and, when it matches, considers the ack to be good.)
Now in the newer versions, I think that this reset happens on its own because most of the time, a retrying channel will restart itself when the receiver is available again.
If a sender says "hey, did you get the message?", and the receiver says "nope, sure didn't", then the channel goes to indoubt status. Human intervention is a must in this case. Very infrequently, I might reset a channel sequence number; and this is only ever needed when the two queue managers are on different OS platforms. (No, I don't know why and yes, it is just anecdotal, and no, there's no reason why the OS should make a difference -- I'm just telling you what my own experience is.)
But every now and then, even with the newer versions of MQ, I'll find a channel in retrying (but not indoubt). And I'll issue 'reset chl', followed by 'start channel' and all is well. There is never any harm in 'reset chl' and, in fact, by old habit I always issue 'reset' before 'start'. There is no harm in a monitoring tool issuing reset/start, and since it works most of the time (for me), why not?
That's what I was thinking! |
|
Back to top |
|
 |
fjb_saper |
Posted: Tue Apr 15, 2008 8:32 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
Thanks Shirley for clearing that up.
To my shame I must say I never used reset without the seqnum addendum...
We learn every day...  _________________ MQ & Broker admin |
|
Back to top |
|
 |
|