MQSeries.net :: View topic - 4.5 hours on a bridge because of a "/"

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » General Discussion » 4.5 hours on a bridge because of a "/"

4.5 hours on a bridge because of a "/"

« View previous topic :: View next topic »

Author

Message

csmith28

Posted: Sun Nov 28, 2004 11:55 am Post subject: 4.5 hours on a bridge because of a "/"

Grand Master

Joined: 15 Jul 2003
Posts: 1196
Location: Arizona

Code:

/etc/rc.ventd -start

It started at 03:39 am with the first of 300 "AMQ9999 Channel ended abnormally" pages before my skytel mailbox filled up.

I logged on to discover that the MaxActiveChannels threshold of 2000 had been reached on an MQManager that usually has about 900 active channels during peak usage.

Just so happens that there were over 1400 instance of a certain SVRCONN Channel that was defined for a particular remote Client application that had just coincidently completed a WebSphere Console Code Deploy, less than two hours before the problems started, but would they back out the new code? Oh no.

"NO!" It can't be the code. The application is working just fine in DEV and QA.

That's right, FOUR and ONE HALF HOURS on a bridge trying to convince the application programmers to check their code.

As usual I was out numbered, "Must be something wrong with MQ because the errors in the stdout.log have an M and a Q in them."

Under protest they allow the individual application instances to be bounced. Still 2000 channels running.

Ok so I issue "stop chl(XX.XXXX.SVRCL01) mode(force) status(inactive)". The active channels go down to 585 and in less than 30 seconds the MaxActiveChannel threshold of 2000 is reached again.

I asked them to have a look at the new code and properties files.

Oh no they say, "If it were the code or properties files we would have most certainly seen the same problem in QA. It has to be a problem with MQ."

They suggest stopping and restarting the MQManager.

"NO!"

I try to explain, SVRCONN channels only start in response to a request from a Client application....

MQManager don't just whimsically start SVRCONN Channels.

The MQManager is up. No .FDC files have been created. The AMQERR0*.LOGS are wrapping in less than a minute filled with AMQ9999 channel XX.XXXX.SVRCL01 ended abnormally errors. Please have a look at the new code I say. There has to be something wrong with the code, a "," instead of a "." a missplaced "`", a "[" that doesn't have a matching "]" a ";" instead of a ":" something like that........

Turns out I was right.

Turns out there is this JMS entry in one of the properties.jar files that references:

Code:

/jmscontext/application/functionality/mq/Q_APP_FUNC_REPLY

When it is supposed to point to:

Code:

jmscontext/application/functionality/mq/Q_APP_FUNC_REPLY

"/"

Further, the application is coded to retry upon failure to connect without issueing an MQDISC for the connection that failed. So every time they do an MQCONN an instance of the XX.XXXX.SVRCL01 gets started in this vicious loop.

<sighs>

Then they accuse me of finger pointing because I want the Problem Record Assigned to them for Root Cause Analysis and Closure.

Their code error created almost 5 hours of down time for their Application and adversly impacted every application that uses my MQManager and they want me to take ownership of the problem.

How about "NO!".
Does "NO" work for you?

Code:

/etc/rc.ventd ended with reason code 0.

I feel much better now.
_________________
Yes, I am an agent of Satan but my duties are largely ceremonial.

Last edited by csmith28 on Sun Nov 28, 2004 1:11 pm; edited 6 times in total

PeterPotkay

Posted: Sun Nov 28, 2004 12:41 pm Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

My favorite? Network outage of about 30 seconds because some network switchy thingamajig somewhere got screwed up. But since the app didn't code reconnect logic, they didn't reconnect after 30 seconds. Nope, 5 hours later someone took it upon themselves to bounce the app.

The next day.....

"I heard the network was down for 30 seconds, but MQ was down for 5 hours, causing our app to be down for 5 hours."

The next week, I announce that all QMs in the DEV environment that accept client connections will be stopped and restarted every night at 10 PM, to allow (force?) programmers to code and test reconnect logic.

The programmers that dont take the time and effort to code reconnect logic are never the ones getting paged at 2 in the morning when things go wrong.
_________________
Peter Potkay
Keep Calm and MQ On

kevinf2349

Posted: Mon Nov 29, 2004 7:31 am Post subject: War story

Grand Master

Joined: 28 Feb 2003
Posts: 1311
Location: USA

My own particular favorite is when one of our Windows servers (with a perfectly functioning queue manager) is having an issue our Windows folks decide to yo-yo the box. Of course their stuff is so urgent that they don't take the time to shut things down nicely. So they give the box the old 3 finger salute with the MQ channels in 'running' status.

Back comes the box and eventually they call me and tell me that the application isn't working and it is all mainframe MQ's fault. I quickly reset and restart the channels at both ends and all is well again.

Then my PHB says..."What was wrong with MQ this morning?"...*sigh*

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » General Discussion » 4.5 hours on a bridge because of a "/"

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP