MQSeries.net :: View topic - Recovery Actions

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » IBM MQ Performance Monitoring » Recovery Actions

Recovery Actions

« View previous topic :: View next topic »

Author

Message

malammik

Posted: Thu Apr 03, 2008 10:24 am Post subject: Recovery Actions

Partisan

Joined: 27 Jan 2005
Posts: 397
Location: Philadelphia, PA

This is a discussion question. I am hoping to get some feedback from the community that will be valuable to all of us. Do you guys think a monitoring tool should attempt to try to fix a problem or provide facilities to do so?

My personal opinion is simple, monitoring is there to detect situations you did not think could happen during design and implementation otherwise you would've accounted for them during design. So if something goes wrong, it does need human intervention.

Let's take a look at some examples:

1. Queue Manager goes down. Recovery action: try to restart it. Most common reason for qm going down in my opinion has been lack of storage so restarting wont do anything and on the downside one might be decreasing resolution time by starting the queue manager which will not come up successfully.

2. Queue full. Recovery action could offload the queue and archive it. But if you have the space to do it why not use it to allow bigger queue depth in first place.

These are just examples. I am more interested in the strategic opinions you guys have whether any kind of automatic recovery is useful vs not useful.
_________________
Mikhail Malamud
http://www.netflexity.com
http://groups.google.com/group/qflex

Michael Dag

Posted: Thu Apr 03, 2008 12:36 pm Post subject:

Jedi Knight

Joined: 13 Jun 2002
Posts: 2607
Location: The Netherlands (Amsterdam)

malammik,
most failures I have seen were queue not full (dimensions set wrongly), but filesystem full or log full.

It would be interesting to see if the MQ monitor could include something basic as a filesystem monitor for the filesystem the queues and logs are on...

automatic recovery often makes things worse like you said, most problems occur by not intervening in time or simple ignorance...

_________________
Michael

MQSystems Facebook page

dgolding

Posted: Thu Apr 03, 2008 11:43 pm Post subject:

Yatiri

Joined: 16 May 2001
Posts: 668
Location: Switzerland

I did implement once a rule that when the MQ filesystem hit a certain percentage (60-70% say) then a housekeep is kicked off to clean-up the linear logs (which was done periodically anyway).

The filesystem monitoring was a Tivoli process owned by the Unix team.

One problem we found was that the "emergency" housekeep got kicked off when the regular one was already running - so I had to build in a check if another instance was running before proceeding. But apart from that it seemed to work quite well.

real solution is not to use linear logs

Vitor

Posted: Fri Apr 04, 2008 12:43 am Post subject:

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

Another source of queue manager failure is a lack of non-disc machine resource. It's feasible that a badly behaved (or badly coded ) app is hogging sufficient resources that the queue manager can go down, so it's a valid strategy to attempt an auto-restart once. This is where the problems come if the automatic process is not sufficiently bright to know it's beaten & keeps trying to restart the queue manager for several hours unsuccessfully!

This works best when combined with a basket of "likely" problems; housekeeping on the disk, checks on memory, that sort of thing. A lot of it depends on the ethos of the site - a proactive site will likely have tight housekeeping that prevents these common problems, so a crash is likely to be something requireing manual action where a reactive site will use this kind of set up to fix problems when they manifest.

On the other scenario, I've been on a site where a "queue full" error was dealt with by unloading the queue to a backup server. The rational was that a queue full indicated the app reading the queue needed to be restarted (usually by bouncing the app server, done automatically), the "old" contents of the queue were farmed off to allow new stuff through and then drip fed back in once the app was deemed stable.

My personal opinon - if you want true automatic problem resolution, use an HA solution, set to failover in a wide variety of cases. Also don't expect any solution to fit every single case.
_________________
Honesty is the best policy.
Insanity is the best defence.

SAFraser

Posted: Tue Apr 15, 2008 1:25 pm Post subject:

Shaman

Joined: 22 Oct 2003
Posts: 742
Location: Austin, Texas, USA

The one type of auto recovery that I like is a reset/restart channel for a short time. If the reset/restart does not succeed after a specified time, then an alert should be generated.

This would be handy when there is a blip on the network and the channel needs a manual reset to recover.

PeterPotkay

Posted: Tue Apr 15, 2008 1:31 pm Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

Network blips shouldn't cause channel sequence # errors.

Automaticaly reseting sequence #s mask bigger problems, since channel sequence # errors should not be happening all the time.

I suppose if you get an email anythime the auto reset occurs, it might be OK, but I still don't like.

Its like having a robot that intercepts your crying kid and slaps a band aid on automatically. You probably want to know how they get their booboo, not just have RoboMommy band-aid away!
_________________
Peter Potkay
Keep Calm and MQ On

SAFraser

Posted: Tue Apr 15, 2008 1:59 pm Post subject:

Shaman

Joined: 22 Oct 2003
Posts: 742
Location: Austin, Texas, USA

I don't think every retrying channel is evil. If the blip occurs when the message has been sent but the receiver hasn't sent back the ack, the channel will go to retry. Perhaps the channel should auto-recover, but sometimes it does not and needs a manual reset. I am not suggesting that a backout/commit should ever be done automatically, nor would I auto-reset the channel sequence itself.

mvic

Posted: Tue Apr 15, 2008 2:04 pm Post subject:

Jedi

Joined: 09 Mar 2004
Posts: 2080

SAFraser wrote:

This would be handy when there is a blip on the network and the channel needs a manual reset to recover.

Doesn't a reset imply something is pretty badly wrong? I'm with Peter - it sounds like there's potentially a serious problem in there somewhere.

http://publib.boulder.ibm.com/infocenter/wmqv6/v6r0/topic/com.ibm.mq.csqzae.doc/ic10590_.htm

EDIT:
Just saw your update

Quote:

nor would I auto-reset the channel sequence itself.

Ah, so what reset is it that solves the problems?

fjb_saper

Posted: Tue Apr 15, 2008 2:40 pm Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20763
Location: LI,NY

Maybe Shirley meant recycle?
Stop chl, start chl?
_________________
MQ & Broker admin

SAFraser

Posted: Tue Apr 15, 2008 6:31 pm Post subject:

Shaman

Joined: 22 Oct 2003
Posts: 742
Location: Austin, Texas, USA

I offered this once before, a long time ago, and some grand master somewhere got snooty with me. But unless Hursley tells me I'm wrong, I'm pretty sure about it....

A sender wants an ack back from the receiver. If it doesn't get it, the sender will try to confirm that the message was received. So a channel in retrying just means that the sender considers the state of receipt unknown. In earlier versions of MQ, 'reset channel' was always needed to force the sender to say "hey, did you get the message?"; and, when the receiver said "sure did", the 'start channel' command would restart the channel. (I've always assumed that the sender is comparing seqno and, when it matches, considers the ack to be good.)

Now in the newer versions, I think that this reset happens on its own because most of the time, a retrying channel will restart itself when the receiver is available again.

If a sender says "hey, did you get the message?", and the receiver says "nope, sure didn't", then the channel goes to indoubt status. Human intervention is a must in this case. Very infrequently, I might reset a channel sequence number; and this is only ever needed when the two queue managers are on different OS platforms. (No, I don't know why and yes, it is just anecdotal, and no, there's no reason why the OS should make a difference -- I'm just telling you what my own experience is.)

But every now and then, even with the newer versions of MQ, I'll find a channel in retrying (but not indoubt). And I'll issue 'reset chl', followed by 'start channel' and all is well. There is never any harm in 'reset chl' and, in fact, by old habit I always issue 'reset' before 'start'. There is no harm in a monitoring tool issuing reset/start, and since it works most of the time (for me), why not?

That's what I was thinking!

fjb_saper

Posted: Tue Apr 15, 2008 8:32 pm Post subject:

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20763
Location: LI,NY

Thanks Shirley for clearing that up.
To my shame I must say I never used reset without the seqnum addendum...

We learn every day...

_________________
MQ & Broker admin

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » IBM MQ Performance Monitoring » Recovery Actions

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP