MQSeries.net :: View topic - MQ crash, no FDC, no reason, no trace...

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » General IBM MQ Support » MQ crash, no FDC, no reason, no trace...

Goto page 1, 2 Next

MQ crash, no FDC, no reason, no trace...

« View previous topic :: View next topic »

Author

Message

banjar1

Posted: Fri May 28, 2010 1:54 am Post subject: MQ crash, no FDC, no reason, no trace...

Acolyte

Joined: 29 Nov 2006
Posts: 54
Location: FRA

Hi,

System: MQ 7.0.1.1 (upgraded on 18.05 from 6.0.2.1, that was the last QM re-start), AIX.

Symptoms: remote QM was unable to start sender channel to our HUB. Error message: AMQ9524: Remote queue manager unavailable. I tried to access the QM locally on HUB with runmqsc and got similar message. Checked for 'mqm' processes on HUB and found less then expected. Much less, like all constantly running channels were still there, but no new one could be started. Because it is our production and a lot of our customers depend on us delivering them critical data I wanted to kill all the outstanding processes immediately, but I tried "endmqm -i" first - and it worked :/ How could it work if QM was "unavailable"??

After all processes were gone I started the QM again without any problems. It has been running for 20 hours now.

Now, time for looking for the reason. Found nothing. No *FDC file in /var/mqm/errors, no new entry in /var/mqm/error/AMQERROR*.LOG, no entry in errpt, nothing. Unfortunately the 3 AMQERROR*.LOGs in /var/mqm/qmgrs/name/errors already cycled past the time of crash, so no info from them either (and we have them set to 1MB each :/ ).

Questions:
- what could possibly cause that?
- how to detect such a problem faster?

mvic

Posted: Fri May 28, 2010 3:16 am Post subject: Re: MQ crash, no FDC, no reason, no trace...

Jedi

Joined: 09 Mar 2004
Posts: 2080

banjar1 wrote:

I tried "endmqm -i" first - and it worked :/ How could it work if QM was "unavailable"??

It all depends what "unavailable" really means. I don't think there is a precise definition.

But if your endmqm command worked, it must have been sufficiently "available" to allow the endmqm command to do its work.

It would be interesting to know precisely the list of processes that were missing.

Vitor

Posted: Fri May 28, 2010 5:27 am Post subject: Re: MQ crash, no FDC, no reason, no trace...

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

mvic wrote:

It would be interesting to know precisely the list of processes that were missing.

It would also be interesting to know the output of a dspmq command.
_________________
Honesty is the best policy.
Insanity is the best defence.

banjar1

Posted: Fri May 28, 2010 5:30 am Post subject:

Acolyte

Joined: 29 Nov 2006
Posts: 54
Location: FRA

I know, both these things would be interesting. Unfortunately I don't have them; I was focused on bringing the QM back on-line.
The most important thing for me now is to find a way to detect such a problem in the future faster then only when someone tries to initiate a channel to HUB. Any ideas?

mqjeff

Posted: Fri May 28, 2010 5:47 am Post subject:

Grand Master

Joined: 25 Jun 2008
Posts: 17447

The question is what died or what was otherwise preventing the channel from starting.

It's possible, for example, that you merely hit MaxActiveChannels.

Or that you otherwise hit a resource limitation that prevented the qmgr from starting a new channel instance.

mvic

Posted: Fri May 28, 2010 5:52 am Post subject:

Jedi

Joined: 09 Mar 2004
Posts: 2080

banjar1 wrote:

I know, both these things would be interesting. Unfortunately I don't have them; I was focused on bringing the QM back on-line.

Not only "interesting", important too, if you want to have any hope of success when presenting the problem on a forum or to IBM.

That being said, I understand completely the situation you were in at the time. But if there is a "next time" that this happens, then store away some details for later.

Storing away outputs from things like "ps -ef", "ipcs -a" and (if you ever needed to get IBM support involved) an MQ trace capturing the failure of the failing program or command, these would help with the later investigation work.

Hope this helps

mvic

Posted: Fri May 28, 2010 5:56 am Post subject: Re: MQ crash, no FDC, no reason, no trace...

Jedi

Joined: 09 Mar 2004
Posts: 2080

banjar1 wrote:

Unfortunately the 3 AMQERROR*.LOGs in /var/mqm/qmgrs/name/errors already cycled past the time of crash, so no info from them either (and we have them set to 1MB each :/ ).

That's a pity.

I assume this means you have a lot of "chatter" from client applications in those error logs..?

Maybe time to go higher than 1 Mb.

bruce2359

Posted: Fri May 28, 2010 6:14 am Post subject:

Poobah

Joined: 05 Jan 2008
Posts: 9475
Location: US: west coast, almost. Otherwise, enroute.

Quote:

From these statements, we can appreciate the pressure you feel to get the problem(s) fixed - what you believe the be the most important thing.

I'd suggest that in your (understandable) haste, you missed an important opportunity to gather sufficient information that would lead you to detect such a problem in the future.

There is a difference between a symptom and a problem. Your post described the symptom. Without any details, we are only left to guess and speculate. I'd say that there are dozens or hundreds of possible causes of this symptom. A quick search here for unavailable will demonstate this.

Basic problem-determination demands observation and documentation BEFORE attempting to fix anything. As a parallel, ALT+CTL+DELETE in the Windoze world only fixes the symptom; and never corrects the underlying problem.

While doing basic p-d might slow down the process of resolving the immediate symptom, it is critical to finding the underlying problem. - the underlying thing(s) that need(s) to be fixed.

You might want to begin by downloading and reading the WMQ Problem Determination manual. You might also want to take a WMQ System Administration course. Of course, these are long(er)-term activities, while you are demanding an immediate solution.

You are not alone in the pressure (either yours or your management) to get it fixed as quickly as possible; but cooler heads must prevail if you are to learn from outages like this.
_________________
I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live.

warrenJ

Posted: Tue Jun 01, 2010 9:03 pm Post subject:

Apprentice

Joined: 11 Jan 2004
Posts: 29
Location: AUSTRALIA

I'd suggest upgrading to v7.0.1.2 as soon as possible.

We have had similiar problems on v7.0.1.1 on AIX with the QManager crashing (with FDC files) and Sender channel terminating (with NO FDC files). There were many other issues that also caused us grief in v7.0.1.1 and v7.0.1.0.

banjar1

Posted: Thu Jun 03, 2010 11:12 pm Post subject:

Acolyte

Joined: 29 Nov 2006
Posts: 54
Location: FRA

Thanks warrenJ, that's also my opinion. Considering how much is fixed in 7.0.1.2 we must upgrade asap.

Has anyone any ideas what to monitor to avoid it in the future?

mvic

Posted: Fri Jun 04, 2010 8:17 am Post subject:

Jedi

Joined: 09 Mar 2004
Posts: 2080

banjar1 wrote:

Thanks warrenJ, that's also my opinion. Considering how much is fixed in 7.0.1.2 we must upgrade asap.

Has anyone any ideas what to monitor to avoid it in the future?

The list of processes.
The qmgr error logs in /var/mqm/qmgrs/QMNAME/errors
The system-wide error logs in /var/mqm/errors

If the failure was because of a bug in MQ, monitoring will not avoid the problem, but will help you know it has happened as soon as possible.

mvic

Posted: Fri Jun 04, 2010 8:22 am Post subject:

Jedi

Joined: 09 Mar 2004
Posts: 2080

warrenJ wrote:

... Sender channel terminating (with NO FDC files).

There are dozens of reasons a channel can end. Eg. network failures, firewalls breaking connections, etc. FDCs would not be expected in these cases.

ramires

Posted: Fri Jun 04, 2010 11:05 am Post subject: Re: MQ crash, no FDC, no reason, no trace...

Knight

Joined: 24 Jun 2001
Posts: 523
Location: Portugal - Lisboa

banjar1 wrote:

I tried to access the QM locally on HUB with runmqsc and got similar message.

Do you remember the error runmqsc returns? Did you checked file system space?

banjar1

Posted: Mon Jun 07, 2010 12:31 am Post subject:

Acolyte

Joined: 29 Nov 2006
Posts: 54
Location: FRA

Another crash just happened this morning. Exactly the same situation - unfortunately I wasn't there to instruct the helpdesk what they should collect.
All I know is that suddenly our (simple) monitoring tool wasn't able to connect to QM, also runmqsc failed with "QM is not currently available".
Funny, but triggering kept working - my publishing script was started, but hung, unable to connect to QM.
No FDC, no entry in any AMQERR*LOG but info that

----- amqrccca.c : 921 --------------------------------------------------------
06/07/10 05:56:09 - Process(19939444.1) User(mqm) Program(runmqchl_nd)
Host(qlhhubfc)
AMQ9508: Program cannot connect to the queue manager.

EXPLANATION:
The connection attempt to queue manager 'QLHHUB' failed with reason code 2059.
ACTION:
Ensure that the queue manager is available and operational.

Well, there is one FDC, but created AFTER the problems were noticed:

+-----------------------------------------------------------------------------+
| |
| WebSphere MQ First Failure Symptom Report |
| ========================================= |
| |
| Date/Time :- Mon June 07 2010 05:30:07 CUT |
| UTC Time :- 1275888607.087334 |
| UTC Time Offset :- 0 (CUT) |
| Host Name :- qlhhubfc |
| Operating System :- AIX 5.3 |
| PIDS :- 5724H7221 |
| LVLS :- 7.0.1.1 |
| Product Long Name :- WebSphere MQ for AIX |
| Vendor :- IBM |
| Probe Id :- XY051170 |
| Application Name :- MQM |
| Component :- InitPrivateServices |
| SCCS Info :- lib/cs/unix/generic/amqxiinx.c, 1.297.1.3 |
| Line Number :- 1212 |
| Build Date :- Dec 22 2009 |
| CMVC level :- p701-101-091221 |
| Build Type :- IKAP - (Production) |
| Effective UserID :- 436 (UNKNOWN) |
| Real UserID :- 0 () |
| Program Name :- runmqsc |
| Addressing mode :- 64-bit |
| Process :- 18628782 |
| Thread(n) :- 1 |
| UserApp :- FALSE |
| Last HQC :- 0.0.0-0 |
| Last HSHMEMB :- 0.0.0-0 |
| Major Errorcode :- OK |
| Minor Errorcode :- OK |
| Probe Type :- INCORROUT |
| Probe Severity :- 4 |
| Probe Description :- AMQ6125: An internal WebSphere MQ error has occurred. |
| FDCSequenceNumber :- 0 |
| Comment1 :- xcsGetpwuid failed to get password entry for process |
| with real uid 326. |
| Comment2 :- Details: getuid() returned 326; getpwuid(326) failed |
| with errno=5. |
| Comment3 :- A user name of "UNKNOWN" will be used, which will |
| likely cause later authorisation failures. Note this FFST can be turned |
| off by exporting env var AMQ_NOFFST_PROCESS_UID. |
| |
+-----------------------------------------------------------------------------+

The user '326' is our administration user, used routinely.

mvic

Posted: Mon Jun 07, 2010 12:57 am Post subject:

Jedi

Joined: 09 Mar 2004
Posts: 2080

banjar1 wrote:

Details: getuid() returned 326; getpwuid(326) failed |
| with errno=5

Check the times again.. didn't the FDC happen before the error log entry?

Anyway the getuid/getpwuid failure looks like a failure in the username subsystem.. getuid said 326, and getpwuid (when called with 326 as input) said "fail" with errno=5 which is EIO. Check with your OS admins.

UPDATE: The FDC is also from a different process (pid 18628782) than the one writing the error log entry (19939444).

Last edited by mvic on Mon Jun 07, 2010 2:01 am; edited 1 time in total

Display posts from previous:

Goto page 1, 2 Next

Page 1 of 2

MQSeries.net Forum Index » General IBM MQ Support » MQ crash, no FDC, no reason, no trace...

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP