MQSeries.net :: View topic - Identify the reason of Fail over

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » General IBM MQ Support » Identify the reason of Fail over

Identify the reason of Fail over

« View previous topic :: View next topic »

Author

Message

ghoshly

Posted: Thu Jun 03, 2021 7:50 am Post subject: Identify the reason of Fail over

Partisan

Joined: 10 Jan 2008
Posts: 333

Hello,

I have multi instance queue manager and there after multi instance integration node present in AIX environment using MQ v8 and IIB v10. In one of our environments, we observer very frequent fail over happening, while that is not the case in other environments. I have failed to find any pattern so far on this.

I understand that the locks are released by active server from the shared data and log and that is why the passive instance gained the lock to became active. Is there any way through logs or commands by which we can find the actual reason of the fail over? That way, we can work with AIX / Storage / Network team to reduce/fix it?

I appreciate your

exerk

Posted: Thu Jun 03, 2021 8:04 am Post subject:

Jedi Council

Joined: 02 Nov 2006
Posts: 6339

QUESTIONS:

1. Anything indicative in /var/mqm/errors, e.g. FDC files?

2. Anything in the syslog indicative of loss of connection to the file server?

3. Is it always the same server that releases the locks?
_________________
It's puzzling, I don't think I've ever seen anything quite like this before...and it's hard to soar like an eagle when you're surrounded by turkeys.

ghoshly

Posted: Thu Jun 03, 2021 9:53 am Post subject:

Partisan

Joined: 10 Jan 2008
Posts: 333

Thanks for your response.

I do see many FDC files created in environment. In one server, there are multiple queue managers and its not that all the queue managers have failed over. Mentioned below are some of the Probe Ids. I am not sure whether any of them are relevant.

KN549055, ZX165131, RM220005, AL008080

Is there any specific error code / return code that I should look for to figure out exact time period when the fail over has happened? I could not find anything specific in syslog. In MQ error log, I could see multiple occurrences of AMQ6183 and AMQ6184 which only mentions internal error.

Some of our scheduled processed / backup processes only runs via one server. When those jobs gets failed, we get notified and we switch it back to original running server. We have not setup / noticed automatic fail over from server 2 to server 1.

exerk

Posted: Fri Jun 04, 2021 5:06 am Post subject:

Jedi Council

Joined: 02 Nov 2006
Posts: 6339

ghoshly wrote:

..In MQ error log, I could see multiple occurrences of AMQ6183 and AMQ6184 which only mentions internal error...

THIS page has some information in regard to those AMQ codes, but the pages they reference have broken links as good old IBM are moving stuff around again...

ghoshly wrote:

...Some of our scheduled processed / backup processes only runs via one server. When those jobs gets failed, we get notified and we switch it back to original running server. We have not setup / noticed automatic fail over from server 2 to server 1.

If I understand you correctly, the only notification you have of queue manager fail-over is when your scheduled processes fail, and you then switch the queue manager back to the 'Primary' node?

If so, that's somewhat inefficient. The better solution would be to have the jobs set up on both nodes and have the job fail gracefully on the node not hosting the running instance of the queue manager, e.g. if it's a queue manager back-up job, have it check the status of the queue manager and run the back-up only if the status is 'Running'.

Additionally, do a forensic comparison of the environments and look for detail differences.
_________________
It's puzzling, I don't think I've ever seen anything quite like this before...and it's hard to soar like an eagle when you're surrounded by turkeys.

ghoshly

Posted: Fri Jun 04, 2021 5:46 am Post subject:

Partisan

Joined: 10 Jan 2008
Posts: 333

Yes, both AMQ6183 and AMQ6184 is generic IBM MQ Internal error. I would try to learn more through those Probe Ids.

Your understanding is correct about our environment and thank you for the Pro-tip on the job scheduling.

I was looking for some obvious notification or command from IBM regarding fail over, at least in error log or syslog to determine exact time of fail over and underlying reason.

exerk

Posted: Fri Jun 04, 2021 6:12 am Post subject:

Jedi Council

Joined: 02 Nov 2006
Posts: 6339

ProbeID ZX165131 is related to I/O issues, ProbeID RM220005 is related to Queue Manager Cluster Repository issues, ProbeID AL008080 is related to a damaged object.

Fix one issue at a time, and I'd suggest you first look at I/O on the server that's failing the locks.
_________________
It's puzzling, I don't think I've ever seen anything quite like this before...and it's hard to soar like an eagle when you're surrounded by turkeys.

Andyh

Posted: Fri Jun 04, 2021 10:54 am Post subject:

Master

Joined: 29 Jul 2010
Posts: 239

Once the locks have been lost on the old active then further IO attempted on that node will fail, potentially leading to a variety of failure probes on that node.
The queue managers AMQERR01.LOG error log should show what led to the locks being lost on the old active, most likely some NFS issue.

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » General IBM MQ Support » Identify the reason of Fail over

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP