Author |
Message
|
belchman |
Posted: Fri Mar 31, 2006 9:10 am Post subject: Recurring QM Outage that Gens Probid = XC130003 and ZX005022 |
|
|
Partisan
Joined: 31 Mar 2006 Posts: 386 Location: Ohio, USA
|
My questions are at the end. I have some possibly useful data after the FDC text. Thanks in advance for any bread crumbs you might throw my way.
Current Situation:
AIX Version = 5.2.00
MQ Server Version = 530.12 CSD12
MQ CMVC level = p530-12-L051208
Every Thursday or Friday for the last few months we have experienced incidents where all ServerConn channels are forced down by the queue manager. Forcing all MQ clients to reconnect.
The following FDC's are created (in this order)
***************************************
+-----------------------------------------------------------------------------+
| |
| WebSphere MQ First Failure Symptom Report |
| ========================================= |
| |
| Date/Time :- Thursday March 30 15:27:39 EST 2006 |
| Host Name :- xxxxx (AIX 5.2) |
| PIDS :- 5724B4101 |
| LVLS :- 530.12 CSD12 |
| Product Long Name :- WebSphere MQ for AIX |
| Vendor :- IBM |
| Probe Id :- XC130003 |
| Application Name :- MQM |
| Component :- xehExceptionHandler |
| Build Date :- Dec 8 2005 |
| CMVC level :- p530-12-L051208 |
| Build Type :- IKAP - (Production) |
| UserID :- 00000250 (mqm) |
| Program Name :- amqrmppa |
| Process :- 00618638 |
| Thread :- 00000001 |
| QueueManager :- OQPEGW01 |
| Major Errorcode :- STOP |
| Minor Errorcode :- OK |
| Probe Type :- HALT6109 |
| Probe Severity :- 1 |
| Probe Description :- AMQ6109: An internal WebSphere MQ error has occurred. |
| FDCSequenceNumber :- 0 |
| Arith1 :- 4 4 |
| Comment1 :- SIGILL |
| |
+-----------------------------------------------------------------------------+
MQM Function Stack
rppPoolMain
cccJobMonitor
xcsFFST
MQM Trace History
.
.
.
and
+-----------------------------------------------------------------------------+
| |
| WebSphere MQ First Failure Symptom Report |
| ========================================= |
| |
| Date/Time :- Thursday March 30 15:27:50 EST 2006 |
| Host Name :- mqentpr01 (AIX 5.2) |
| PIDS :- 5724B4101 |
| LVLS :- 530.12 CSD12 |
| Product Long Name :- WebSphere MQ for AIX |
| Vendor :- IBM |
| Probe Id :- ZX005022 |
| Application Name :- MQM |
| Component :- zxcProcessChildren |
| Build Date :- Dec 8 2005 |
| CMVC level :- p530-12-L051208 |
| Build Type :- IKAP - (Production) |
| UserID :- 00000250 (mqm) |
| Program Name :- amqzxma0_nd |
| Process :- 00454856 |
| Thread :- 00000001 |
| QueueManager :- OQPEGW01 |
| Major Errorcode :- lrcW_S_FAST_PATH_APP_DEAD |
| Minor Errorcode :- OK |
| Probe Type :- MSGAMQ7159 |
| Probe Severity :- 3 |
| Probe Description :- AMQ7159: A FASTPATH application has ended |
| unexpectedly. |
| FDCSequenceNumber :- 0 |
| Arith1 :- 618638 9708e |
| Arith2 :- 26221 666d |
| |
+-----------------------------------------------------------------------------+
MQM Function Stack
zxcProcessChildren
xcsFFST
MQM Trace History
.
.
.
**************************
I have opened a PMR with IBM. Here is the info I provided IBM:
**************************
Other relevant information:
We have FASTPATH binding enabled on the queue manager.
We have a node that is continuously spamming the queue manager log with AMQ9208. However, apps hosted on the node do not report loss of MQ functionality.
In diagnosing spamming node, it hosts 4 mq client apps of which 1 is java, 2 are VB.net and 1 is VB6.
In production, the MQ_CONNECT_TYPE is not set.
At least 1 of the apps is compiled with MQ ver 5.2 versions of cmqb.bas, cmqbb.bas,...,cmqxb.bas.
This host supports high TPS customer facing application so I assume they are using MQCONNX. We have other MQ client nodes writing similar errors to QM log but not to the same degree. I assume this is based on TPS.
I also assume that all spamming mq clients are sharing 1 or more of the compiled objects without having MQ_CONNECT_TYPE set, hence are binding FASTPATH.
We have had 5 or 6 similar incidents since October 2005 and have noticed that all incidents have occurred on a Thursday or Friday.
We bounce the MQ host every Sunday morning.
Corrective actions already taken:
I am attempting to reproduce the problem in QA. We were able to get the QA MQ client node to write the same error to the QA QM log.
What I have done so far in QA:
-> Created and set the MQ_CONNECT_TYPE Env Var to FASTPATH.
-> Bounced 1 suspected MQ client app
-> Put messages that generated the log entries
-> Set MQ_CONNECT_TYPE Env Var to STANDARD
-> Bounced MQ client app
-> Put messages that did not generate log entries
-> Set MQ_CONNECT_TYPE Env Var to FASTPATH
-> Bounced MQ client app
-> Put messages that did not generate log entries I am confused as to why last FASTPATH test didn't gen errors.
Question: Can anyone shed some light on what is causing this incident to occur?
Question: Can anyone shed some light on how to replicate problem?/b]
[b]Question: Can anyone provide their experiences with using Windows client in large shop? |
|
Back to top |
|
 |
jefflowrey |
Posted: Fri Mar 31, 2006 1:49 pm Post subject: |
|
|
Grand Poobah
Joined: 16 Oct 2002 Posts: 19981
|
I bet your firewall doesn't support long running connections as well as you think it does. _________________ I am *not* the model of the modern major general. |
|
Back to top |
|
 |
belchman |
Posted: Tue Apr 04, 2006 4:49 am Post subject: |
|
|
Partisan
Joined: 31 Mar 2006 Posts: 386 Location: Ohio, USA
|
Thank jefflowrey,
I will look into our firewall settings. However, the big problem is that the queue manager just "closes all serverconn channels" and writes FDCs. The queue manager recovers by itself, but not until we experience outages on MQ applications as they reconnect. The node that is spamming our logs is simply a suspect as the root cause of the problem because of its logging behavior. Prior to incidents of this nature, our logs are filled with spam from the node. After the queue manager recovers, the node starts spamming again.
1) Do you think it is appropriate to consider this node as a primary suspect RE root cause?
2) What would you check (and how) if your queue manager was behaving this way?
MQ Trace is difficult because TPS is affected and we are hot/cold HA/CMP with shared disk between. When this occurs, we do not have the luxury of extending outage to diagnose.
I would really appreciate any guidance on this. This problem has been ongoing for at least 6 months. The spamming node has been going on for over a year.
Thank you. |
|
Back to top |
|
 |
jefflowrey |
Posted: Tue Apr 04, 2006 5:03 am Post subject: |
|
|
Grand Poobah
Joined: 16 Oct 2002 Posts: 19981
|
FASTPATH is always tricky. As you say, it could be causing the queue manager to reset the channel processes.
You say that the "spamming" node resumes spamming immediately after everything is restored. Is the rate of errors constant, regardless of the state of the queue manager? Or does the rate of errors change over time, and start increasing sometime before the crash?
AMQ9208 implies network issues of some kind, though. What are the inserts with the 9208? Are they always the same, or different? _________________ I am *not* the model of the modern major general. |
|
Back to top |
|
 |
belchman |
Posted: Tue Apr 04, 2006 10:37 am Post subject: |
|
|
Partisan
Joined: 31 Mar 2006 Posts: 386 Location: Ohio, USA
|
Between 4/3/06 @ 16:39 and 4/3/06 @ 18:02, 479 entries were made at 11 second intervals where the largest interval was 13 seconds and the smallest interval was 10 seconds.
More to follow... |
|
Back to top |
|
 |
jefflowrey |
Posted: Tue Apr 04, 2006 10:39 am Post subject: |
|
|
Grand Poobah
Joined: 16 Oct 2002 Posts: 19981
|
And what about before 16:39 and after 18:02? _________________ I am *not* the model of the modern major general. |
|
Back to top |
|
 |
belchman |
Posted: Tue Apr 04, 2006 10:50 am Post subject: |
|
|
Partisan
Joined: 31 Mar 2006 Posts: 386 Location: Ohio, USA
|
The aforementioned data was from AMQERR01.LOG. The other data would be in the 02 and 03. Give me a sec and I will analyse that also.
The problem is that we did not have an incident of this nature before all three logs were filled with this spam.
I may be able to go back in time and see if there were any interval changes immediately prior and just after the incident. It will take a few minutes. I am juggling things like a circus clown. |
|
Back to top |
|
 |
jefflowrey |
Posted: Tue Apr 04, 2006 11:04 am Post subject: |
|
|
Grand Poobah
Joined: 16 Oct 2002 Posts: 19981
|
Don't rush on my account.  _________________ I am *not* the model of the modern major general. |
|
Back to top |
|
 |
belchman |
Posted: Tue Apr 04, 2006 11:05 am Post subject: |
|
|
Partisan
Joined: 31 Mar 2006 Posts: 386 Location: Ohio, USA
|
There is no significant difference it the remaining 2 queue manager logs. It looks like 11 seconds is the interval.
Here is a snippet of what is being logged every 11 seconds...
04/04/06 14:14:37
AMQ9208: Error on receive from host 170.128.171.54.
EXPLANATION:
An error occurred receiving data from 170.128.171.54 over TCP/IP. This may be
due to a communications failure.
ACTION:
The return code from the TCP/IP (read) call was 73 (X'49'). Record these values
and tell the systems administrator.
----- amqccita.c : 2718 ------------------------------------------------------- |
|
Back to top |
|
 |
mvic |
Posted: Tue Apr 04, 2006 12:22 pm Post subject: |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
belchman wrote: |
The return code from the TCP/IP (read) call was 73 (X'49'). Record these values
and tell the systems administrator. |
73 on AIX is ECONNRESET. In words, "connection reset by peer". It's a comms problem, probably originating in some pretty strict firewall settings on 170.128.171.54 or between the local machine and 170.128.171.54.
What do your network admins say about this? |
|
Back to top |
|
 |
fjb_saper |
Posted: Tue Apr 04, 2006 1:04 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
Could it be that you run out of bandwidth at those times ?
 _________________ MQ & Broker admin |
|
Back to top |
|
 |
jefflowrey |
Posted: Tue Apr 04, 2006 1:06 pm Post subject: |
|
|
Grand Poobah
Joined: 16 Oct 2002 Posts: 19981
|
fjb_saper wrote: |
Could it be that you run out of bandwidth at those times ?
 |
Every 11 seconds? _________________ I am *not* the model of the modern major general. |
|
Back to top |
|
 |
fjb_saper |
Posted: Tue Apr 04, 2006 1:13 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
Retrying every 11 seconds, while some huge FTP is clogging the bandwidth ? As this seems to be happening periodically like clockwork I would suspect some culprit in a batch environment. Now if the batch environment is transmitting on a pulse base (1,000ds of files in very little time, multiple threads) you could have brief availability between the files until the next "pulse" from the batch environment...
But then what do I know about what batch processing you are doing around those times...
You really need to get your network folks involved.  _________________ MQ & Broker admin
Last edited by fjb_saper on Tue Apr 04, 2006 1:18 pm; edited 1 time in total |
|
Back to top |
|
 |
jefflowrey |
Posted: Tue Apr 04, 2006 1:16 pm Post subject: |
|
|
Grand Poobah
Joined: 16 Oct 2002 Posts: 19981
|
fjb_saper wrote: |
Retrying every 11 seconds, while some huge FTP is clogging the bandwidth ? |
My reading of what's been said so far is that the message is produced constantly, every 11 seconds or so, and belchman only had log entries for the time period he mentioned (due to AMQERR roll-overs). _________________ I am *not* the model of the modern major general. |
|
Back to top |
|
 |
fjb_saper |
Posted: Tue Apr 04, 2006 1:20 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
jefflowrey wrote: |
fjb_saper wrote: |
Retrying every 11 seconds, while some huge FTP is clogging the bandwidth ? |
My reading of what's been said so far is that the message is produced constantly, every 11 seconds or so, and belchman only had log entries for the time period he mentioned (due to AMQERR roll-overs). |
How about a faulty switch that kicks in only in certain conditions? _________________ MQ & Broker admin |
|
Back to top |
|
 |
|