Author |
Message
|
GFORCE |
Posted: Fri Feb 15, 2008 9:34 am Post subject: AMQRMPPA in AIX |
|
|
 Voyager
Joined: 16 Jun 2003 Posts: 78 Location: WISCONSIN
|
We recylce our test AIX box weekly and everytime I have to logon and kill the AMQRMPPA process to recycle MQ. Is there any way around this besides killing the process? _________________ THANKS |
|
Back to top |
|
 |
PeterPotkay |
Posted: Fri Feb 15, 2008 10:13 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
I got the same problem on my Linux x86 32bit QMs. It happened at MQ 6.0.1.0, 6.0.2.0 and 6.0.2.1. Going to 6.0.2.3 soon hoping the problem goes away. Its annoying. It doesn't happen every time. Sometimes if I wait 5-10 minutes they eventually stop, but usually when you are restarting a QM you don't have time to sit there and wait who knows how long.
Turning trace on makes the problem go away.  _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
PeterPotkay |
Posted: Mon Feb 25, 2008 6:05 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Anyone else got this problem? Our MQ shutdown scripts endmqlsr first, then endmqm -i. Yet if the QM has more than a few running client channels (i.e. there is more than one amqrmppa process running) more often than not we have to kill those amqrmppa processes. Even when they do go down on their own it takes 10-15 minutes. As I said before running trace seems to make the problem go away. I just upgraded to 6.0.2.3 and the problem is still there.
 _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
GFORCE |
Posted: Mon Mar 03, 2008 8:36 am Post subject: AMQRMPPA in AIX |
|
|
 Voyager
Joined: 16 Jun 2003 Posts: 78 Location: WISCONSIN
|
I set the TCP keep alive parm in the QM.INI and it appears to work also. I am still trying several options and I will try the trace option as you stated, but I have to go through our change control with every change..... _________________ THANKS |
|
Back to top |
|
 |
PeterPotkay |
Posted: Tue Mar 11, 2008 9:28 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
IBM identified a problem and will be creating an interim fix. Tracing fixed the problem each time. They gave us a command to run every few seconds while the QM was taking forever to come down that produced an FDC each time. That finally highlited the problem.
Quote: |
Based on the supplied data, we have narrowed down the cause of the
endmqm delay to two parts of the code which stops channels. We are doing
some further testing in order to better understand the delay and hope to
produce an interim fix later this week. I expect to report back with
more information tomorrow.
************
Update non 11th March:
Further to yesterday's update, the FDCs supplied again showed that it
was the ending of channels which delayed the endmqm process. In
particular, channel process 16634 kept running for a very long time
after the queue manager was asked to end. The endmqm process was waiting
for channel process 16634 to end before it could finish.
.
When we look at what the FDCs showed for 16634, it seems that there were
a number of channel threads (e.g. threads 82, 85, 92 and 94) still
active inside it. Since these threads had not finished processing, the
process had not ended.
.
The FFST shows what these last threads had been doing as the queue
manager ended. We see that most threads had noticed that the queue
manager had ended, but then carried on regardless. For example, here is
an excerpt from thread 94's history:
.
----} zstMQGET rc=lrcE_Q_MGR_STOPPING
---} MQGET rc=lrcE_Q_MGR_STOPPING
...
---{ MQCLOSE
----{ zstMQCLOSE
-----{ zstVerifyPCD
-----} zstVerifyPCD rc=OK
-----{ zutCallApiExitsBeforeClose
------{ APIExit
-------{ MQGET
--------{ zstMQGET
---------{ zstVerifyPCD
---------} zstVerifyPCD rc=OK
---------{ ziiBreakConnection
---------} ziiBreakConnection rc=OK
--------} zstMQGET rc=lrcE_CONNECTION_BROKEN
-------} MQGET rc=lrcE_CONNECTION_BROKEN
------} APIExit rc=OK
-----} zutCallApiExitsBeforeClose rc=OK
-----{ zutCallApiExitsAfterClose
------{ APIExit
------} APIExit rc=lrcE_CONNECTION_BROKEN
-----} zutCallApiExitsAfterClose rc=OK
-----{ ziiBreakConnection
-----} ziiBreakConnection rc=OK
----} zstMQCLOSE rc=lrcE_CONNECTION_BROKEN
---} MQCLOSE rc=lrcE_CONNECTION_BROKEN
...
---{ ccxReceive
----{ cciTcpReceive
-----{ ccxAllocMem
-----} ccxAllocMem rc=OK
-----{ recv
-----} recv rc=Unknown(FFFF)
-----{ xcsWaitFd
------{ poll
------} poll rc=Unknown(1)
-----} xcsWaitFd rc=Unknown(1)
-----{ recv
-----} recv rc=Unknown(FFFF)
-----{ xcsWaitFd
------{ poll
------} poll rc=Unknown(1)
.
Despite knowing that the queue manager is ending and that its own
connection to the queue manager has been broken, the thread continued to
run and poll its network socket for more MQI calls from the client.
However, even if such a call arrived there would be nothing useful that
the channel could do with it because its connection has gone. So the
thread should really have ended at that point. It is only after multiple
failed poll() calls that the channel threads finally time out and end,
which allows endmqm processing to complete.
.
We should point out that client applications should specify the
appropriate FAIL_IF_QUIESCING option on all of their MQI calls in order
to speed up endmqm processing. The trace supplied on 3rd March shows
some clients which are not using the "fail if quiescing" option.
However, I believe that endmqm -i should still end the queue manager
within a reasonable time regardless of the MQI options. For this reason,
I think the queue manager should try harder to end client channels than
it currently does.
.
Based on the sequence of events in the FFSTs, it is clear that all of
the threads which failed to end had recieved MQRC_Q_MGR_STOPPING and
MQRC_CONNECTION_BROKEN as early as 08:12:39. Had they detected this fact
they would have ended much sooner, instead of hanging around until
18:18:55 when endmqm finally finished.
.
We are building a test fix which adds extra checking to the server
(SVRCONN) end of the channel in order to better handle shutdown in cases
where MQI calls report that the queue manager is ending. I will also
include additional FFST diagnostics in the code so as to produce better
SIGUSR2 FDC files in cases of future delays.
|
_________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
jefflowrey |
Posted: Tue Mar 11, 2008 9:34 am Post subject: |
|
|
Grand Poobah
Joined: 16 Oct 2002 Posts: 19981
|
That seems to point the finger in two places: a) client apps that don't use FAIL_ON_QUIESCE, and b) the channel which should kill itself after it's sent at least one FAIL_ON_QUIESCE.
So I'd a) wait for the fix, and b) fatten your trout for those app teams that aren't using FAIL_ON_QUIESCE. _________________ I am *not* the model of the modern major general. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Tue Mar 11, 2008 9:43 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Even with a Monty Python sized trout you'll never be able to guarantee that every app uses FAIL_ON_QUIESCE. Even if they say the use it. Even if you see some code that uses it, its not proof that that's what's running in PROD. That's why we rely on endmqm -i. I'm glad IBM found the problem. Waiting 10 minutes for the QM to come down is an eternity in the middle of the night with the change window's end time approaching. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
Toronto_MQ |
Posted: Wed Mar 12, 2008 7:46 am Post subject: |
|
|
 Master
Joined: 10 Jul 2002 Posts: 263 Location: read my name
|
I'm glad you've gotten somewhere with this. We have the same problem (on Solaris) and our PMRs got us nowhere. We have taken to issuing the endmqm -i, waiting a minute, then a -p, another minute, then we start killing the amqrmppa processes. Nice to see a fix may eventually come around.
I agree in an ideal world we would have the apps code fail_if_quiesce. And we always stress this as rule #1. But I think we all know we don't live in an ideal world. If I have to listen to "this is vendor code, we can't change that" one more time... |
|
Back to top |
|
 |
GFORCE |
Posted: Tue Mar 18, 2008 5:04 am Post subject: |
|
|
 Voyager
Joined: 16 Jun 2003 Posts: 78 Location: WISCONSIN
|
I am glad this resulted in a fix from IBM. I hope the PTF is available soon.
Thanks for your help...this forum is great!!!! _________________ THANKS |
|
Back to top |
|
 |
PeterPotkay |
Posted: Thu Mar 20, 2008 11:56 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Contact IBM Support if you need the interim fix for this. Its called IZ18142. Its past the cutoff for being included in 6.0.2.4. The earliest it would be in is 6.0.2.5.
I only tested the fix for Linux. I informed them that Solaris and AIX appears to have the same bug based on this thread. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
|