Author |
Message
|
ghoshly |
Posted: Thu Jun 03, 2021 7:50 am Post subject: Identify the reason of Fail over |
|
|
Partisan
Joined: 10 Jan 2008 Posts: 333
|
Hello,
I have multi instance queue manager and there after multi instance integration node present in AIX environment using MQ v8 and IIB v10. In one of our environments, we observer very frequent fail over happening, while that is not the case in other environments. I have failed to find any pattern so far on this.
I understand that the locks are released by active server from the shared data and log and that is why the passive instance gained the lock to became active. Is there any way through logs or commands by which we can find the actual reason of the fail over? That way, we can work with AIX / Storage / Network team to reduce/fix it?
I appreciate your  |
|
Back to top |
|
 |
exerk |
Posted: Thu Jun 03, 2021 8:04 am Post subject: |
|
|
 Jedi Council
Joined: 02 Nov 2006 Posts: 6339
|
QUESTIONS:
1. Anything indicative in /var/mqm/errors, e.g. FDC files?
2. Anything in the syslog indicative of loss of connection to the file server?
3. Is it always the same server that releases the locks? _________________ It's puzzling, I don't think I've ever seen anything quite like this before...and it's hard to soar like an eagle when you're surrounded by turkeys. |
|
Back to top |
|
 |
ghoshly |
Posted: Thu Jun 03, 2021 9:53 am Post subject: |
|
|
Partisan
Joined: 10 Jan 2008 Posts: 333
|
Thanks for your response.
I do see many FDC files created in environment. In one server, there are multiple queue managers and its not that all the queue managers have failed over. Mentioned below are some of the Probe Ids. I am not sure whether any of them are relevant.
KN549055, ZX165131, RM220005, AL008080
Is there any specific error code / return code that I should look for to figure out exact time period when the fail over has happened? I could not find anything specific in syslog. In MQ error log, I could see multiple occurrences of AMQ6183 and AMQ6184 which only mentions internal error.
Some of our scheduled processed / backup processes only runs via one server. When those jobs gets failed, we get notified and we switch it back to original running server. We have not setup / noticed automatic fail over from server 2 to server 1. |
|
Back to top |
|
 |
exerk |
Posted: Fri Jun 04, 2021 5:06 am Post subject: |
|
|
 Jedi Council
Joined: 02 Nov 2006 Posts: 6339
|
ghoshly wrote: |
..In MQ error log, I could see multiple occurrences of AMQ6183 and AMQ6184 which only mentions internal error... |
THIS page has some information in regard to those AMQ codes, but the pages they reference have broken links as good old IBM are moving stuff around again...
ghoshly wrote: |
...Some of our scheduled processed / backup processes only runs via one server. When those jobs gets failed, we get notified and we switch it back to original running server. We have not setup / noticed automatic fail over from server 2 to server 1. |
If I understand you correctly, the only notification you have of queue manager fail-over is when your scheduled processes fail, and you then switch the queue manager back to the 'Primary' node?
If so, that's somewhat inefficient. The better solution would be to have the jobs set up on both nodes and have the job fail gracefully on the node not hosting the running instance of the queue manager, e.g. if it's a queue manager back-up job, have it check the status of the queue manager and run the back-up only if the status is 'Running'.
Additionally, do a forensic comparison of the environments and look for detail differences. _________________ It's puzzling, I don't think I've ever seen anything quite like this before...and it's hard to soar like an eagle when you're surrounded by turkeys. |
|
Back to top |
|
 |
ghoshly |
Posted: Fri Jun 04, 2021 5:46 am Post subject: |
|
|
Partisan
Joined: 10 Jan 2008 Posts: 333
|
Yes, both AMQ6183 and AMQ6184 is generic IBM MQ Internal error. I would try to learn more through those Probe Ids.
Your understanding is correct about our environment and thank you for the Pro-tip on the job scheduling.
I was looking for some obvious notification or command from IBM regarding fail over, at least in error log or syslog to determine exact time of fail over and underlying reason. |
|
Back to top |
|
 |
exerk |
Posted: Fri Jun 04, 2021 6:12 am Post subject: |
|
|
 Jedi Council
Joined: 02 Nov 2006 Posts: 6339
|
ProbeID ZX165131 is related to I/O issues, ProbeID RM220005 is related to Queue Manager Cluster Repository issues, ProbeID AL008080 is related to a damaged object.
Fix one issue at a time, and I'd suggest you first look at I/O on the server that's failing the locks. _________________ It's puzzling, I don't think I've ever seen anything quite like this before...and it's hard to soar like an eagle when you're surrounded by turkeys. |
|
Back to top |
|
 |
Andyh |
Posted: Fri Jun 04, 2021 10:54 am Post subject: |
|
|
Master
Joined: 29 Jul 2010 Posts: 239
|
Once the locks have been lost on the old active then further IO attempted on that node will fail, potentially leading to a variety of failure probes on that node.
The queue managers AMQERR01.LOG error log should show what led to the locks being lost on the old active, most likely some NFS issue. |
|
Back to top |
|
 |
|