Author |
Message
|
smeunier |
Posted: Tue Apr 17, 2012 11:29 am Post subject: Trace/Logging of issued commands |
|
|
 Partisan
Joined: 19 Aug 2002 Posts: 305 Location: Green Mountains of Vermont
|
I have an environment that has 4 servers(2 z/os and 2 unix) in a cluster. The 2 unix servers contain the full repository. A SUSPEND QMGR was issued from one of the z/os server, which it promptly performed. A few dys later the RESUME QMGR command was issued. Other than a command accepted message in the SYSLOG, there was not indicator as to whether is worked or not. A week later, the second z/os server was suspended........all hell broke loose, as effectively both z/os server were now suspended.
Obviously there was should have been some due diligence that should have been performed by the practitioner, to insure, that when the previous qmgr was RESUMED, they should have confirmed its state. My task is now to try find why the RESUME command did not complete or failed.
I have looked at the UNIX MQ logs but do not see any errors. If there was a failure, or a success (apparently not) would this have gotten logged somewhere. What logs? The z/os MQ logs or some other location?
Any idea on where I could find success/failure of the RESUME Command? |
|
Back to top |
|
 |
bruce2359 |
Posted: Tue Apr 17, 2012 12:21 pm Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
Why did you/they do the SUSPEND?
What changes to qmgr(s) took place on the suspended qmgr?
Did the RESUME take place on the exact same qmgr whee you did the SUSPEND?
What version/release of mq? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
smeunier |
Posted: Tue Apr 17, 2012 12:41 pm Post subject: |
|
|
 Partisan
Joined: 19 Aug 2002 Posts: 305 Location: Green Mountains of Vermont
|
The suspend was issued because that server was coming down for maintenance. It is part of a dual z/os logistics system instance. It is a procedural step to insure that no transaction will be caught in flight to the server coming offline while the secondary assume full workload. The resume was issued from the same server as the suspend. There was no work done to the QMGR, but to the OS for security fixes and general support upgrades to z/os. An IPL followed.
MQ V7.0.1 is installed |
|
Back to top |
|
 |
PeterPotkay |
Posted: Tue Apr 17, 2012 3:38 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Whenever I suspened or resumed QMs in clusters, I would use the DISPLAY CLUSQMGR command on the suspended QM to see what the SUSPEND attribute in the output of the command said. I am not aware of any logs where this info is captured. So our scripts to suspend or resume the QM always included the display clusqmgr command to validate what happened.
I have opened a PMR in the past where they confirmed it is safe to issue the RESUME command for a QM not suspended, and safe to issue the SUSPEND command to a QM already suspended.
You comment that all hell broke loose once the 2nd QM was suspended is perplexing to me. A suspended QM is not going to refuse traffic, unless you use the FORCE option on the SUSPEND command. But I must admit I never tried suspended all the QMs in a cluster that hosted a particular queue. If all the destinations for a message in a cluster only included suspended QMs, I don't know. Define 'all hell'. Were messages going to DLQs - what were the reason codes in the dead letter header? Were apps getting failed MQPUTs - what was the reason code?
Perhaps this offers a clue where to look:
http://publib.boulder.ibm.com/infocenter/wmqv7/v7r0/topic/com.ibm.mq.csqzaj.doc/sc12930_.htm
Quote: |
On z/OS, if you define CLUSTER or CLUSNL:
The command fails if the channel initiator has not been started.
Any errors are reported to the console on the system where the channel initiator is running; they are not reported to the system that issued the command.
On z/OS, you cannot issue RESUME QMGR CLUSTER(clustername) or RESUME QMGR FACILITY commands from CSQINP2.
|
Its a good idea to have test queues defined on all QMs in a cluster and to have scripts that pump dummy messages to these queues to validate all QMs in the cluster get their expected messages after maintenance. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
fjb_saper |
Posted: Tue Apr 17, 2012 8:07 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
If I want to verify that a queue manager correctly is in resumed state, I usually do 2 checks
Code: |
dis clusqmgr(<qmgrname>) suspend |
Once on the qmgr that was suspended,
Once on each of the FR's.
If it does not show up correctly on the FR's trouble shoot the communications.
If it does not show up on correctly one of the FR's trouble shoot communications between the FR's...
And as a added precautions check through monitoring that the cluster queues are seeing traffic (selected sample of cluster queues)
Never had surprises this way...
Have fun  _________________ MQ & Broker admin |
|
Back to top |
|
 |
smeunier |
Posted: Wed Apr 18, 2012 6:01 am Post subject: |
|
|
 Partisan
Joined: 19 Aug 2002 Posts: 305 Location: Green Mountains of Vermont
|
First thanks for all the replies. I think the consensus is what it should have been, and that is check, check, then check again, if it is a critical path. The DISPLAY CLUSQMGR(*) SUSPEND will be added to the checkout process. This we knew we would have to do. You can only be lucky so many times, before you run out.
Let me explain the "all Hell" statement.
The destination queues in the cluster are only defined on the z/os servers as this is the endpoint processing. With one QMGR already suspended from the cluster and presumably resumed , the second qmgr was suspended from the cluster. Thus, effectively removing all the endpoint QMGRS from the cluster. The "all hell" refers to the fact that these are real-time logistics transactions, which if not processed withing 60 seconds are stale. They have message expiry for self cleanup, and the sending application has a thread timer, that reports failure. Since these are logistics transactions at a very high volume rate, the factory had essentially stalled. Time is money.
This procedure is one that we have effectively executed for over 5 yrs and give us a high degree of flexibility. This time it showed a flaw in out procedure. Its never to late to LEARN!!!!!
Thanks again. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Wed Apr 18, 2012 8:17 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Were messages going to DLQs - what were the reason codes in the dead letter header? Were apps getting failed MQPUTs - what was the reason code? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
fjb_saper |
Posted: Wed Apr 18, 2012 3:56 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
All Hell broke loose:
The manual says (somewhere) that if all destinations are on a suspended qmgr, it is like if none of the destination was on a suspended qmgr.
So my guess is that he had a lot of transactions that expired on the queue without being processed as he probably stopped the application before checking (monitoring) that no more transactions were flowing through the qmgr...  _________________ MQ & Broker admin |
|
Back to top |
|
 |
mqjeff |
Posted: Wed Apr 18, 2012 5:28 pm Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
fjb_saper wrote: |
All Hell broke loose:
The manual says (somewhere) that if all destinations are on a suspended qmgr, it is like if none of the destination was on a suspended qmgr |
Suspended just means "make the best effort to ignore this queue manager, unless it's the only queue manager we can talk to!". |
|
Back to top |
|
 |
|