ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » General IBM MQ Support » Recurring QM Outage that Gens Probid = XC130003 and ZX005022

Post new topic  Reply to topic Goto page 1, 2  Next
 Recurring QM Outage that Gens Probid = XC130003 and ZX005022 « View previous topic :: View next topic » 
Author Message
belchman
PostPosted: Fri Mar 31, 2006 9:10 am    Post subject: Recurring QM Outage that Gens Probid = XC130003 and ZX005022 Reply with quote

Partisan

Joined: 31 Mar 2006
Posts: 386
Location: Ohio, USA

My questions are at the end. I have some possibly useful data after the FDC text. Thanks in advance for any bread crumbs you might throw my way.

Current Situation:

AIX Version = 5.2.00
MQ Server Version = 530.12 CSD12
MQ CMVC level = p530-12-L051208

Every Thursday or Friday for the last few months we have experienced incidents where all ServerConn channels are forced down by the queue manager. Forcing all MQ clients to reconnect.

The following FDC's are created (in this order)
***************************************
+-----------------------------------------------------------------------------+
| |
| WebSphere MQ First Failure Symptom Report |
| ========================================= |
| |
| Date/Time :- Thursday March 30 15:27:39 EST 2006 |
| Host Name :- xxxxx (AIX 5.2) |
| PIDS :- 5724B4101 |
| LVLS :- 530.12 CSD12 |
| Product Long Name :- WebSphere MQ for AIX |
| Vendor :- IBM |
| Probe Id :- XC130003 |
| Application Name :- MQM |
| Component :- xehExceptionHandler |
| Build Date :- Dec 8 2005 |
| CMVC level :- p530-12-L051208 |
| Build Type :- IKAP - (Production) |
| UserID :- 00000250 (mqm) |
| Program Name :- amqrmppa |
| Process :- 00618638 |
| Thread :- 00000001 |
| QueueManager :- OQPEGW01 |
| Major Errorcode :- STOP |
| Minor Errorcode :- OK |
| Probe Type :- HALT6109 |
| Probe Severity :- 1 |
| Probe Description :- AMQ6109: An internal WebSphere MQ error has occurred. |
| FDCSequenceNumber :- 0 |
| Arith1 :- 4 4 |
| Comment1 :- SIGILL |
| |
+-----------------------------------------------------------------------------+

MQM Function Stack
rppPoolMain
cccJobMonitor
xcsFFST

MQM Trace History
.
.
.

and

+-----------------------------------------------------------------------------+
| |
| WebSphere MQ First Failure Symptom Report |
| ========================================= |
| |
| Date/Time :- Thursday March 30 15:27:50 EST 2006 |
| Host Name :- mqentpr01 (AIX 5.2) |
| PIDS :- 5724B4101 |
| LVLS :- 530.12 CSD12 |
| Product Long Name :- WebSphere MQ for AIX |
| Vendor :- IBM |
| Probe Id :- ZX005022 |
| Application Name :- MQM |
| Component :- zxcProcessChildren |
| Build Date :- Dec 8 2005 |
| CMVC level :- p530-12-L051208 |
| Build Type :- IKAP - (Production) |
| UserID :- 00000250 (mqm) |
| Program Name :- amqzxma0_nd |
| Process :- 00454856 |
| Thread :- 00000001 |
| QueueManager :- OQPEGW01 |
| Major Errorcode :- lrcW_S_FAST_PATH_APP_DEAD |
| Minor Errorcode :- OK |
| Probe Type :- MSGAMQ7159 |
| Probe Severity :- 3 |
| Probe Description :- AMQ7159: A FASTPATH application has ended |
| unexpectedly. |
| FDCSequenceNumber :- 0 |
| Arith1 :- 618638 9708e |
| Arith2 :- 26221 666d |
| |
+-----------------------------------------------------------------------------+

MQM Function Stack
zxcProcessChildren
xcsFFST

MQM Trace History
.
.
.
**************************

I have opened a PMR with IBM. Here is the info I provided IBM:

**************************
Other relevant information:

We have FASTPATH binding enabled on the queue manager.

We have a node that is continuously spamming the queue manager log with AMQ9208. However, apps hosted on the node do not report loss of MQ functionality.

In diagnosing spamming node, it hosts 4 mq client apps of which 1 is java, 2 are VB.net and 1 is VB6.

In production, the MQ_CONNECT_TYPE is not set.

At least 1 of the apps is compiled with MQ ver 5.2 versions of cmqb.bas, cmqbb.bas,...,cmqxb.bas.

This host supports high TPS customer facing application so I assume they are using MQCONNX. We have other MQ client nodes writing similar errors to QM log but not to the same degree. I assume this is based on TPS.

I also assume that all spamming mq clients are sharing 1 or more of the compiled objects without having MQ_CONNECT_TYPE set, hence are binding FASTPATH.

We have had 5 or 6 similar incidents since October 2005 and have noticed that all incidents have occurred on a Thursday or Friday.

We bounce the MQ host every Sunday morning.

Corrective actions already taken:

I am attempting to reproduce the problem in QA. We were able to get the QA MQ client node to write the same error to the QA QM log.

What I have done so far in QA:
-> Created and set the MQ_CONNECT_TYPE Env Var to FASTPATH.
-> Bounced 1 suspected MQ client app
-> Put messages that generated the log entries
-> Set MQ_CONNECT_TYPE Env Var to STANDARD
-> Bounced MQ client app
-> Put messages that did not generate log entries
-> Set MQ_CONNECT_TYPE Env Var to FASTPATH
-> Bounced MQ client app
-> Put messages that did not generate log entries I am confused as to why last FASTPATH test didn't gen errors.

Question: Can anyone shed some light on what is causing this incident to occur?
Question: Can anyone shed some light on how to replicate problem?/b]
[b]Question: Can anyone provide their experiences with using Windows client in large shop?
Back to top
View user's profile Send private message
jefflowrey
PostPosted: Fri Mar 31, 2006 1:49 pm    Post subject: Reply with quote

Grand Poobah

Joined: 16 Oct 2002
Posts: 19981

I bet your firewall doesn't support long running connections as well as you think it does.
_________________
I am *not* the model of the modern major general.
Back to top
View user's profile Send private message
belchman
PostPosted: Tue Apr 04, 2006 4:49 am    Post subject: Reply with quote

Partisan

Joined: 31 Mar 2006
Posts: 386
Location: Ohio, USA

Thank jefflowrey,

I will look into our firewall settings. However, the big problem is that the queue manager just "closes all serverconn channels" and writes FDCs. The queue manager recovers by itself, but not until we experience outages on MQ applications as they reconnect. The node that is spamming our logs is simply a suspect as the root cause of the problem because of its logging behavior. Prior to incidents of this nature, our logs are filled with spam from the node. After the queue manager recovers, the node starts spamming again.

1) Do you think it is appropriate to consider this node as a primary suspect RE root cause?

2) What would you check (and how) if your queue manager was behaving this way?

MQ Trace is difficult because TPS is affected and we are hot/cold HA/CMP with shared disk between. When this occurs, we do not have the luxury of extending outage to diagnose.

I would really appreciate any guidance on this. This problem has been ongoing for at least 6 months. The spamming node has been going on for over a year.

Thank you.
Back to top
View user's profile Send private message
jefflowrey
PostPosted: Tue Apr 04, 2006 5:03 am    Post subject: Reply with quote

Grand Poobah

Joined: 16 Oct 2002
Posts: 19981

FASTPATH is always tricky. As you say, it could be causing the queue manager to reset the channel processes.

You say that the "spamming" node resumes spamming immediately after everything is restored. Is the rate of errors constant, regardless of the state of the queue manager? Or does the rate of errors change over time, and start increasing sometime before the crash?

AMQ9208 implies network issues of some kind, though. What are the inserts with the 9208? Are they always the same, or different?
_________________
I am *not* the model of the modern major general.
Back to top
View user's profile Send private message
belchman
PostPosted: Tue Apr 04, 2006 10:37 am    Post subject: Reply with quote

Partisan

Joined: 31 Mar 2006
Posts: 386
Location: Ohio, USA

Between 4/3/06 @ 16:39 and 4/3/06 @ 18:02, 479 entries were made at 11 second intervals where the largest interval was 13 seconds and the smallest interval was 10 seconds.

More to follow...
Back to top
View user's profile Send private message
jefflowrey
PostPosted: Tue Apr 04, 2006 10:39 am    Post subject: Reply with quote

Grand Poobah

Joined: 16 Oct 2002
Posts: 19981

And what about before 16:39 and after 18:02?
_________________
I am *not* the model of the modern major general.
Back to top
View user's profile Send private message
belchman
PostPosted: Tue Apr 04, 2006 10:50 am    Post subject: Reply with quote

Partisan

Joined: 31 Mar 2006
Posts: 386
Location: Ohio, USA

The aforementioned data was from AMQERR01.LOG. The other data would be in the 02 and 03. Give me a sec and I will analyse that also.

The problem is that we did not have an incident of this nature before all three logs were filled with this spam.

I may be able to go back in time and see if there were any interval changes immediately prior and just after the incident. It will take a few minutes. I am juggling things like a circus clown.
Back to top
View user's profile Send private message
jefflowrey
PostPosted: Tue Apr 04, 2006 11:04 am    Post subject: Reply with quote

Grand Poobah

Joined: 16 Oct 2002
Posts: 19981

Don't rush on my account.
_________________
I am *not* the model of the modern major general.
Back to top
View user's profile Send private message
belchman
PostPosted: Tue Apr 04, 2006 11:05 am    Post subject: Reply with quote

Partisan

Joined: 31 Mar 2006
Posts: 386
Location: Ohio, USA

There is no significant difference it the remaining 2 queue manager logs. It looks like 11 seconds is the interval.

Here is a snippet of what is being logged every 11 seconds...

04/04/06 14:14:37
AMQ9208: Error on receive from host 170.128.171.54.

EXPLANATION:
An error occurred receiving data from 170.128.171.54 over TCP/IP. This may be
due to a communications failure.
ACTION:
The return code from the TCP/IP (read) call was 73 (X'49'). Record these values
and tell the systems administrator.
----- amqccita.c : 2718 -------------------------------------------------------
Back to top
View user's profile Send private message
mvic
PostPosted: Tue Apr 04, 2006 12:22 pm    Post subject: Reply with quote

Jedi

Joined: 09 Mar 2004
Posts: 2080

belchman wrote:
The return code from the TCP/IP (read) call was 73 (X'49'). Record these values
and tell the systems administrator.

73 on AIX is ECONNRESET. In words, "connection reset by peer". It's a comms problem, probably originating in some pretty strict firewall settings on 170.128.171.54 or between the local machine and 170.128.171.54.

What do your network admins say about this?
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Tue Apr 04, 2006 1:04 pm    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20756
Location: LI,NY

Could it be that you run out of bandwidth at those times ?

_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
jefflowrey
PostPosted: Tue Apr 04, 2006 1:06 pm    Post subject: Reply with quote

Grand Poobah

Joined: 16 Oct 2002
Posts: 19981

fjb_saper wrote:
Could it be that you run out of bandwidth at those times ?


Every 11 seconds?
_________________
I am *not* the model of the modern major general.
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Tue Apr 04, 2006 1:13 pm    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20756
Location: LI,NY

Retrying every 11 seconds, while some huge FTP is clogging the bandwidth ? As this seems to be happening periodically like clockwork I would suspect some culprit in a batch environment. Now if the batch environment is transmitting on a pulse base (1,000ds of files in very little time, multiple threads) you could have brief availability between the files until the next "pulse" from the batch environment...

But then what do I know about what batch processing you are doing around those times...

You really need to get your network folks involved.
_________________
MQ & Broker admin


Last edited by fjb_saper on Tue Apr 04, 2006 1:18 pm; edited 1 time in total
Back to top
View user's profile Send private message Send e-mail
jefflowrey
PostPosted: Tue Apr 04, 2006 1:16 pm    Post subject: Reply with quote

Grand Poobah

Joined: 16 Oct 2002
Posts: 19981

fjb_saper wrote:
Retrying every 11 seconds, while some huge FTP is clogging the bandwidth ?


My reading of what's been said so far is that the message is produced constantly, every 11 seconds or so, and belchman only had log entries for the time period he mentioned (due to AMQERR roll-overs).
_________________
I am *not* the model of the modern major general.
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Tue Apr 04, 2006 1:20 pm    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20756
Location: LI,NY

jefflowrey wrote:
fjb_saper wrote:
Retrying every 11 seconds, while some huge FTP is clogging the bandwidth ?


My reading of what's been said so far is that the message is produced constantly, every 11 seconds or so, and belchman only had log entries for the time period he mentioned (due to AMQERR roll-overs).


How about a faulty switch that kicks in only in certain conditions?
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
Display posts from previous:   
Post new topic  Reply to topic Goto page 1, 2  Next Page 1 of 2

MQSeries.net Forum Index » General IBM MQ Support » Recurring QM Outage that Gens Probid = XC130003 and ZX005022
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.