Author |
Message
|
aboggis |
Posted: Mon Sep 20, 2004 3:54 pm Post subject: Queue manager failure... large numbers of processes. |
|
|
 Centurion
Joined: 18 Dec 2001 Posts: 105 Location: Auburn, California
|
I recently had a system "go down" when it ran out of file descriptors. This is a Solaris 8 system, 12 CPUs, 24Gb RAM, MQ v5.3, CSD05.
The max file descriptor is set to 8192.
We had three HUGE FFST (*.FDC) files generate, but looking into them simply pointed out that the file descriptors were exhausted.
The sysadmin copied the process table from the time of the crash... I counted 847 MQ processes. Mostly instances of amqzlaa0_nd (queue manager agent process) and amqrmppa (channel receiver process).
Has this been seen by anyone before? Is there a bug here? Does anyone know why so many of these processes would be created? |
|
Back to top |
|
 |
siliconfish |
Posted: Mon Sep 20, 2004 5:46 pm Post subject: |
|
|
 Master
Joined: 12 Aug 2002 Posts: 203 Location: USA
|
Check if the applications connecting to this queue manager are properly closing the connections. |
|
Back to top |
|
 |
aboggis |
Posted: Mon Sep 20, 2004 7:13 pm Post subject: |
|
|
 Centurion
Joined: 18 Dec 2001 Posts: 105 Location: Auburn, California
|
Good call, but since this is now after the event, it's difficult to tell.
In general, if an application (process/thread) does not cleanly disconnect, how long is it before MQ reclaims the "leaked" resources? |
|
Back to top |
|
 |
PeterPotkay |
Posted: Tue Sep 21, 2004 7:04 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
aboggis wrote: |
In general, if an application (process/thread) does not cleanly disconnect, how long is it before MQ reclaims the "leaked" resources? |
TCP KeepAlive will recognize the other size is gone. The default is 2 hours.
Alter that to a realistic value and make sure the QM is configured to use it. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
aboggis |
Posted: Tue Sep 21, 2004 7:54 am Post subject: |
|
|
 Centurion
Joined: 18 Dec 2001 Posts: 105 Location: Auburn, California
|
You mean KeepAlive at the protocol level?
I have HBINT set, but I'll check what I set for KeepAlive in the ini file.
I should also add that there are no client connections involved. The applications putting/getting messages run on the same host as "their" local queue manager. The *ONLY* channels defined on our queue managers are the cluster sender/receiver channels. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Tue Sep 21, 2004 8:25 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
aboggis wrote: |
You mean KeepAlive at the protocol level?
|
Yes.
aboggis wrote: |
I should also add that there are no client connections involved. The applications putting/getting messages run on the same host as "their" local queue manager. The *ONLY* channels defined on our queue managers are the cluster sender/receiver channels. |
Then it makes it less likely that orphaned connections caused this.
I responded to the same question on the listserve, where I said I saw the same thing on our Windows 2000 5.3 CSD04 servers, and IBM sent us 2 new dlls to patch the problem until CSD08 came out. When we saw this, the only recourse was to reboot the server, because the QM wouldn't resoppond to anything, not even a shutdown attempt. Same symptoms:hundreds of MQ proccesses. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
aboggis |
Posted: Tue Sep 21, 2004 9:39 am Post subject: |
|
|
 Centurion
Joined: 18 Dec 2001 Posts: 105 Location: Auburn, California
|
Well, I shall be putting in a call to support shortly, since this is starting to happen across multiple hosts now.
The only change with regards to MQ configs has been to reduce the values of HBINT (changed to 5), BATCHHB (to 5 also) and NPMSPEED (to NORMAL).
I set these values lower because we need to "know" about channel failures as quickly as possible (using a cluster workload exit) so that our applications can re-route messages over an available channel.
I had overlooked the KeepAlive in qm.ini. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Tue Sep 21, 2004 11:20 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Yeah, we just bounced the server the first 2 times we saw it. At that point we called IBM.
RE: HB set to 5,
The receiver will actually time-out if no data is received within twice the Heartbeat interval if the negotiated Heartbeat Interval is less than 60 seconds, or 60 seconds beyond the negotiated heartbeat interval if it is greater than or equal to 60 seconds, by default, before assuming there has been a communications failure. The RCVR will go INACTIVE.
Make sure both sides of the channel have the same #, else the larger # of the 2 is used. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
aboggis |
Posted: Tue Sep 21, 2004 1:25 pm Post subject: |
|
|
 Centurion
Joined: 18 Dec 2001 Posts: 105 Location: Auburn, California
|
Right now I have a system that has almost 500 MQ processes active... mostly instances of amqrmppa (channel receiver) and amqzlaa0 (qmgr agent). To all intents the queue manager is "unavailable" (all remote qmgrs with channels to this qmgr are "binding") and runmqsc is non-responsive. |
|
Back to top |
|
 |
|