ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » General IBM MQ Support » Identify the reason of Fail over

Post new topic  Reply to topic
 Identify the reason of Fail over « View previous topic :: View next topic » 
Author Message
ghoshly
PostPosted: Thu Jun 03, 2021 7:50 am    Post subject: Identify the reason of Fail over Reply with quote

Partisan

Joined: 10 Jan 2008
Posts: 325

Hello,

I have multi instance queue manager and there after multi instance integration node present in AIX environment using MQ v8 and IIB v10. In one of our environments, we observer very frequent fail over happening, while that is not the case in other environments. I have failed to find any pattern so far on this.

I understand that the locks are released by active server from the shared data and log and that is why the passive instance gained the lock to became active. Is there any way through logs or commands by which we can find the actual reason of the fail over? That way, we can work with AIX / Storage / Network team to reduce/fix it?

I appreciate your
Back to top
View user's profile Send private message
exerk
PostPosted: Thu Jun 03, 2021 8:04 am    Post subject: Reply with quote

Jedi Council

Joined: 02 Nov 2006
Posts: 6339

QUESTIONS:

1. Anything indicative in /var/mqm/errors, e.g. FDC files?

2. Anything in the syslog indicative of loss of connection to the file server?

3. Is it always the same server that releases the locks?
_________________
It's puzzling, I don't think I've ever seen anything quite like this before...and it's hard to soar like an eagle when you're surrounded by turkeys.
Back to top
View user's profile Send private message
ghoshly
PostPosted: Thu Jun 03, 2021 9:53 am    Post subject: Reply with quote

Partisan

Joined: 10 Jan 2008
Posts: 325

Thanks for your response.

I do see many FDC files created in environment. In one server, there are multiple queue managers and its not that all the queue managers have failed over. Mentioned below are some of the Probe Ids. I am not sure whether any of them are relevant.

KN549055, ZX165131, RM220005, AL008080

Is there any specific error code / return code that I should look for to figure out exact time period when the fail over has happened? I could not find anything specific in syslog. In MQ error log, I could see multiple occurrences of AMQ6183 and AMQ6184 which only mentions internal error.


Some of our scheduled processed / backup processes only runs via one server. When those jobs gets failed, we get notified and we switch it back to original running server. We have not setup / noticed automatic fail over from server 2 to server 1.
Back to top
View user's profile Send private message
exerk
PostPosted: Fri Jun 04, 2021 5:06 am    Post subject: Reply with quote

Jedi Council

Joined: 02 Nov 2006
Posts: 6339

ghoshly wrote:
..In MQ error log, I could see multiple occurrences of AMQ6183 and AMQ6184 which only mentions internal error...

THIS page has some information in regard to those AMQ codes, but the pages they reference have broken links as good old IBM are moving stuff around again...


ghoshly wrote:
...Some of our scheduled processed / backup processes only runs via one server. When those jobs gets failed, we get notified and we switch it back to original running server. We have not setup / noticed automatic fail over from server 2 to server 1.

If I understand you correctly, the only notification you have of queue manager fail-over is when your scheduled processes fail, and you then switch the queue manager back to the 'Primary' node?

If so, that's somewhat inefficient. The better solution would be to have the jobs set up on both nodes and have the job fail gracefully on the node not hosting the running instance of the queue manager, e.g. if it's a queue manager back-up job, have it check the status of the queue manager and run the back-up only if the status is 'Running'.

Additionally, do a forensic comparison of the environments and look for detail differences.
_________________
It's puzzling, I don't think I've ever seen anything quite like this before...and it's hard to soar like an eagle when you're surrounded by turkeys.
Back to top
View user's profile Send private message
ghoshly
PostPosted: Fri Jun 04, 2021 5:46 am    Post subject: Reply with quote

Partisan

Joined: 10 Jan 2008
Posts: 325

Yes, both AMQ6183 and AMQ6184 is generic IBM MQ Internal error. I would try to learn more through those Probe Ids.

Your understanding is correct about our environment and thank you for the Pro-tip on the job scheduling.

I was looking for some obvious notification or command from IBM regarding fail over, at least in error log or syslog to determine exact time of fail over and underlying reason.
Back to top
View user's profile Send private message
exerk
PostPosted: Fri Jun 04, 2021 6:12 am    Post subject: Reply with quote

Jedi Council

Joined: 02 Nov 2006
Posts: 6339

ProbeID ZX165131 is related to I/O issues, ProbeID RM220005 is related to Queue Manager Cluster Repository issues, ProbeID AL008080 is related to a damaged object.

Fix one issue at a time, and I'd suggest you first look at I/O on the server that's failing the locks.
_________________
It's puzzling, I don't think I've ever seen anything quite like this before...and it's hard to soar like an eagle when you're surrounded by turkeys.
Back to top
View user's profile Send private message
Andyh
PostPosted: Fri Jun 04, 2021 10:54 am    Post subject: Reply with quote

Master

Joined: 29 Jul 2010
Posts: 237

Once the locks have been lost on the old active then further IO attempted on that node will fail, potentially leading to a variety of failure probes on that node.
The queue managers AMQERR01.LOG error log should show what led to the locks being lost on the old active, most likely some NFS issue.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic  Reply to topic Page 1 of 1

MQSeries.net Forum Index » General IBM MQ Support » Identify the reason of Fail over
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.