MQSeries.net :: View topic - Queue manager failure... large numbers of processes.

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » General IBM MQ Support » Queue manager failure... large numbers of processes.

Queue manager failure... large numbers of processes.

« View previous topic :: View next topic »

Author

Message

aboggis

Posted: Mon Sep 20, 2004 3:54 pm Post subject: Queue manager failure... large numbers of processes.

Centurion

Joined: 18 Dec 2001
Posts: 105
Location: Auburn, California

I recently had a system "go down" when it ran out of file descriptors. This is a Solaris 8 system, 12 CPUs, 24Gb RAM, MQ v5.3, CSD05.

The max file descriptor is set to 8192.

We had three HUGE FFST (*.FDC) files generate, but looking into them simply pointed out that the file descriptors were exhausted.

The sysadmin copied the process table from the time of the crash... I counted 847 MQ processes. Mostly instances of amqzlaa0_nd (queue manager agent process) and amqrmppa (channel receiver process).

Has this been seen by anyone before? Is there a bug here? Does anyone know why so many of these processes would be created?

siliconfish

Posted: Mon Sep 20, 2004 5:46 pm Post subject:

Master

Joined: 12 Aug 2002
Posts: 203
Location: USA

Check if the applications connecting to this queue manager are properly closing the connections.

aboggis

Posted: Mon Sep 20, 2004 7:13 pm Post subject:

Centurion

Joined: 18 Dec 2001
Posts: 105
Location: Auburn, California

Good call, but since this is now after the event, it's difficult to tell.

In general, if an application (process/thread) does not cleanly disconnect, how long is it before MQ reclaims the "leaked" resources?

PeterPotkay

Posted: Tue Sep 21, 2004 7:04 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

aboggis wrote:

In general, if an application (process/thread) does not cleanly disconnect, how long is it before MQ reclaims the "leaked" resources?

TCP KeepAlive will recognize the other size is gone. The default is 2 hours.
Alter that to a realistic value and make sure the QM is configured to use it.
_________________
Peter Potkay
Keep Calm and MQ On

aboggis

Posted: Tue Sep 21, 2004 7:54 am Post subject:

Centurion

Joined: 18 Dec 2001
Posts: 105
Location: Auburn, California

You mean KeepAlive at the protocol level?

I have HBINT set, but I'll check what I set for KeepAlive in the ini file.

I should also add that there are no client connections involved. The applications putting/getting messages run on the same host as "their" local queue manager. The *ONLY* channels defined on our queue managers are the cluster sender/receiver channels.

PeterPotkay

Posted: Tue Sep 21, 2004 8:25 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

aboggis wrote:

You mean KeepAlive at the protocol level?

Yes.

aboggis wrote:

I should also add that there are no client connections involved. The applications putting/getting messages run on the same host as "their" local queue manager. The *ONLY* channels defined on our queue managers are the cluster sender/receiver channels.

Then it makes it less likely that orphaned connections caused this.

I responded to the same question on the listserve, where I said I saw the same thing on our Windows 2000 5.3 CSD04 servers, and IBM sent us 2 new dlls to patch the problem until CSD08 came out. When we saw this, the only recourse was to reboot the server, because the QM wouldn't resoppond to anything, not even a shutdown attempt. Same symptoms:hundreds of MQ proccesses.
_________________
Peter Potkay
Keep Calm and MQ On

aboggis

Posted: Tue Sep 21, 2004 9:39 am Post subject:

Centurion

Joined: 18 Dec 2001
Posts: 105
Location: Auburn, California

Well, I shall be putting in a call to support shortly, since this is starting to happen across multiple hosts now.

The only change with regards to MQ configs has been to reduce the values of HBINT (changed to 5), BATCHHB (to 5 also) and NPMSPEED (to NORMAL).

I set these values lower because we need to "know" about channel failures as quickly as possible (using a cluster workload exit) so that our applications can re-route messages over an available channel.

I had overlooked the KeepAlive in qm.ini.

PeterPotkay

Posted: Tue Sep 21, 2004 11:20 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

Yeah, we just bounced the server the first 2 times we saw it. At that point we called IBM.

RE: HB set to 5,
The receiver will actually time-out if no data is received within twice the Heartbeat interval if the negotiated Heartbeat Interval is less than 60 seconds, or 60 seconds beyond the negotiated heartbeat interval if it is greater than or equal to 60 seconds, by default, before assuming there has been a communications failure. The RCVR will go INACTIVE.

Make sure both sides of the channel have the same #, else the larger # of the 2 is used.
_________________
Peter Potkay
Keep Calm and MQ On

aboggis

Posted: Tue Sep 21, 2004 1:25 pm Post subject:

Centurion

Joined: 18 Dec 2001
Posts: 105
Location: Auburn, California

Right now I have a system that has almost 500 MQ processes active... mostly instances of amqrmppa (channel receiver) and amqzlaa0 (qmgr agent). To all intents the queue manager is "unavailable" (all remote qmgrs with channels to this qmgr are "binding") and runmqsc is non-responsive.

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » General IBM MQ Support » Queue manager failure... large numbers of processes.

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP