Author |
Message
|
banjar1 |
Posted: Fri May 28, 2010 1:54 am Post subject: MQ crash, no FDC, no reason, no trace... |
|
|
Acolyte
Joined: 29 Nov 2006 Posts: 54 Location: FRA
|
Hi,
System: MQ 7.0.1.1 (upgraded on 18.05 from 6.0.2.1, that was the last QM re-start), AIX.
Symptoms: remote QM was unable to start sender channel to our HUB. Error message: AMQ9524: Remote queue manager unavailable. I tried to access the QM locally on HUB with runmqsc and got similar message. Checked for 'mqm' processes on HUB and found less then expected. Much less, like all constantly running channels were still there, but no new one could be started. Because it is our production and a lot of our customers depend on us delivering them critical data I wanted to kill all the outstanding processes immediately, but I tried "endmqm -i" first - and it worked :/ How could it work if QM was "unavailable"??
After all processes were gone I started the QM again without any problems. It has been running for 20 hours now.
Now, time for looking for the reason. Found nothing. No *FDC file in /var/mqm/errors, no new entry in /var/mqm/error/AMQERROR*.LOG, no entry in errpt, nothing. Unfortunately the 3 AMQERROR*.LOGs in /var/mqm/qmgrs/name/errors already cycled past the time of crash, so no info from them either (and we have them set to 1MB each :/ ).
Questions:
- what could possibly cause that?
- how to detect such a problem faster? |
|
Back to top |
|
 |
mvic |
Posted: Fri May 28, 2010 3:16 am Post subject: Re: MQ crash, no FDC, no reason, no trace... |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
banjar1 wrote: |
I tried "endmqm -i" first - and it worked :/ How could it work if QM was "unavailable"?? |
It all depends what "unavailable" really means. I don't think there is a precise definition.
But if your endmqm command worked, it must have been sufficiently "available" to allow the endmqm command to do its work.
It would be interesting to know precisely the list of processes that were missing. |
|
Back to top |
|
 |
Vitor |
Posted: Fri May 28, 2010 5:27 am Post subject: Re: MQ crash, no FDC, no reason, no trace... |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
mvic wrote: |
It would be interesting to know precisely the list of processes that were missing. |
It would also be interesting to know the output of a dspmq command. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
banjar1 |
Posted: Fri May 28, 2010 5:30 am Post subject: |
|
|
Acolyte
Joined: 29 Nov 2006 Posts: 54 Location: FRA
|
I know, both these things would be interesting. Unfortunately I don't have them; I was focused on bringing the QM back on-line.
The most important thing for me now is to find a way to detect such a problem in the future faster then only when someone tries to initiate a channel to HUB. Any ideas? |
|
Back to top |
|
 |
mqjeff |
Posted: Fri May 28, 2010 5:47 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
The question is what died or what was otherwise preventing the channel from starting.
It's possible, for example, that you merely hit MaxActiveChannels.
Or that you otherwise hit a resource limitation that prevented the qmgr from starting a new channel instance. |
|
Back to top |
|
 |
mvic |
Posted: Fri May 28, 2010 5:52 am Post subject: |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
banjar1 wrote: |
I know, both these things would be interesting. Unfortunately I don't have them; I was focused on bringing the QM back on-line. |
Not only "interesting", important too, if you want to have any hope of success when presenting the problem on a forum or to IBM.
That being said, I understand completely the situation you were in at the time. But if there is a "next time" that this happens, then store away some details for later.
Storing away outputs from things like "ps -ef", "ipcs -a" and (if you ever needed to get IBM support involved) an MQ trace capturing the failure of the failing program or command, these would help with the later investigation work.
Hope this helps |
|
Back to top |
|
 |
mvic |
Posted: Fri May 28, 2010 5:56 am Post subject: Re: MQ crash, no FDC, no reason, no trace... |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
banjar1 wrote: |
Unfortunately the 3 AMQERROR*.LOGs in /var/mqm/qmgrs/name/errors already cycled past the time of crash, so no info from them either (and we have them set to 1MB each :/ ). |
That's a pity.
I assume this means you have a lot of "chatter" from client applications in those error logs..?
Maybe time to go higher than 1 Mb. |
|
Back to top |
|
 |
bruce2359 |
Posted: Fri May 28, 2010 6:14 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
Quote: |
I know, both these things would be interesting. Unfortunately I don't have them; I was focused on bringing the QM back on-line.
The most important thing for me now is to find a way to detect such a problem in the future faster then only when someone tries to initiate a channel to HUB. Any ideas? |
From these statements, we can appreciate the pressure you feel to get the problem(s) fixed - what you believe the be the most important thing.
I'd suggest that in your (understandable) haste, you missed an important opportunity to gather sufficient information that would lead you to detect such a problem in the future.
There is a difference between a symptom and a problem. Your post described the symptom. Without any details, we are only left to guess and speculate. I'd say that there are dozens or hundreds of possible causes of this symptom. A quick search here for unavailable will demonstate this.
Basic problem-determination demands observation and documentation BEFORE attempting to fix anything. As a parallel, ALT+CTL+DELETE in the Windoze world only fixes the symptom; and never corrects the underlying problem.
While doing basic p-d might slow down the process of resolving the immediate symptom, it is critical to finding the underlying problem. - the underlying thing(s) that need(s) to be fixed.
You might want to begin by downloading and reading the WMQ Problem Determination manual. You might also want to take a WMQ System Administration course. Of course, these are long(er)-term activities, while you are demanding an immediate solution.
You are not alone in the pressure (either yours or your management) to get it fixed as quickly as possible; but cooler heads must prevail if you are to learn from outages like this. _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
warrenJ |
Posted: Tue Jun 01, 2010 9:03 pm Post subject: |
|
|
Apprentice
Joined: 11 Jan 2004 Posts: 29 Location: AUSTRALIA
|
I'd suggest upgrading to v7.0.1.2 as soon as possible.
We have had similiar problems on v7.0.1.1 on AIX with the QManager crashing (with FDC files) and Sender channel terminating (with NO FDC files). There were many other issues that also caused us grief in v7.0.1.1 and v7.0.1.0. |
|
Back to top |
|
 |
banjar1 |
Posted: Thu Jun 03, 2010 11:12 pm Post subject: |
|
|
Acolyte
Joined: 29 Nov 2006 Posts: 54 Location: FRA
|
Thanks warrenJ, that's also my opinion. Considering how much is fixed in 7.0.1.2 we must upgrade asap.
Has anyone any ideas what to monitor to avoid it in the future? |
|
Back to top |
|
 |
mvic |
Posted: Fri Jun 04, 2010 8:17 am Post subject: |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
banjar1 wrote: |
Thanks warrenJ, that's also my opinion. Considering how much is fixed in 7.0.1.2 we must upgrade asap.
Has anyone any ideas what to monitor to avoid it in the future? |
The list of processes.
The qmgr error logs in /var/mqm/qmgrs/QMNAME/errors
The system-wide error logs in /var/mqm/errors
If the failure was because of a bug in MQ, monitoring will not avoid the problem, but will help you know it has happened as soon as possible. |
|
Back to top |
|
 |
mvic |
Posted: Fri Jun 04, 2010 8:22 am Post subject: |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
warrenJ wrote: |
... Sender channel terminating (with NO FDC files). |
There are dozens of reasons a channel can end. Eg. network failures, firewalls breaking connections, etc. FDCs would not be expected in these cases. |
|
Back to top |
|
 |
ramires |
Posted: Fri Jun 04, 2010 11:05 am Post subject: Re: MQ crash, no FDC, no reason, no trace... |
|
|
Knight
Joined: 24 Jun 2001 Posts: 523 Location: Portugal - Lisboa
|
banjar1 wrote: |
I tried to access the QM locally on HUB with runmqsc and got similar message. |
Do you remember the error runmqsc returns? Did you checked file system space? |
|
Back to top |
|
 |
banjar1 |
Posted: Mon Jun 07, 2010 12:31 am Post subject: |
|
|
Acolyte
Joined: 29 Nov 2006 Posts: 54 Location: FRA
|
Another crash just happened this morning. Exactly the same situation - unfortunately I wasn't there to instruct the helpdesk what they should collect.
All I know is that suddenly our (simple) monitoring tool wasn't able to connect to QM, also runmqsc failed with "QM is not currently available".
Funny, but triggering kept working - my publishing script was started, but hung, unable to connect to QM.
No FDC, no entry in any AMQERR*LOG but info that
----- amqrccca.c : 921 --------------------------------------------------------
06/07/10 05:56:09 - Process(19939444.1) User(mqm) Program(runmqchl_nd)
Host(qlhhubfc)
AMQ9508: Program cannot connect to the queue manager.
EXPLANATION:
The connection attempt to queue manager 'QLHHUB' failed with reason code 2059.
ACTION:
Ensure that the queue manager is available and operational.
Well, there is one FDC, but created AFTER the problems were noticed:
+-----------------------------------------------------------------------------+
| |
| WebSphere MQ First Failure Symptom Report |
| ========================================= |
| |
| Date/Time :- Mon June 07 2010 05:30:07 CUT |
| UTC Time :- 1275888607.087334 |
| UTC Time Offset :- 0 (CUT) |
| Host Name :- qlhhubfc |
| Operating System :- AIX 5.3 |
| PIDS :- 5724H7221 |
| LVLS :- 7.0.1.1 |
| Product Long Name :- WebSphere MQ for AIX |
| Vendor :- IBM |
| Probe Id :- XY051170 |
| Application Name :- MQM |
| Component :- InitPrivateServices |
| SCCS Info :- lib/cs/unix/generic/amqxiinx.c, 1.297.1.3 |
| Line Number :- 1212 |
| Build Date :- Dec 22 2009 |
| CMVC level :- p701-101-091221 |
| Build Type :- IKAP - (Production) |
| Effective UserID :- 436 (UNKNOWN) |
| Real UserID :- 0 () |
| Program Name :- runmqsc |
| Addressing mode :- 64-bit |
| Process :- 18628782 |
| Thread(n) :- 1 |
| UserApp :- FALSE |
| Last HQC :- 0.0.0-0 |
| Last HSHMEMB :- 0.0.0-0 |
| Major Errorcode :- OK |
| Minor Errorcode :- OK |
| Probe Type :- INCORROUT |
| Probe Severity :- 4 |
| Probe Description :- AMQ6125: An internal WebSphere MQ error has occurred. |
| FDCSequenceNumber :- 0 |
| Comment1 :- xcsGetpwuid failed to get password entry for process |
| with real uid 326. |
| Comment2 :- Details: getuid() returned 326; getpwuid(326) failed |
| with errno=5. |
| Comment3 :- A user name of "UNKNOWN" will be used, which will |
| likely cause later authorisation failures. Note this FFST can be turned |
| off by exporting env var AMQ_NOFFST_PROCESS_UID. |
| |
+-----------------------------------------------------------------------------+
The user '326' is our administration user, used routinely. |
|
Back to top |
|
 |
mvic |
Posted: Mon Jun 07, 2010 12:57 am Post subject: |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
banjar1 wrote: |
Details: getuid() returned 326; getpwuid(326) failed |
| with errno=5 |
Check the times again.. didn't the FDC happen before the error log entry?
Anyway the getuid/getpwuid failure looks like a failure in the username subsystem.. getuid said 326, and getpwuid (when called with 326 as input) said "fail" with errno=5 which is EIO. Check with your OS admins.
UPDATE: The FDC is also from a different process (pid 18628782) than the one writing the error log entry (19939444).
Last edited by mvic on Mon Jun 07, 2010 2:01 am; edited 1 time in total |
|
Back to top |
|
 |
|