|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
Automating MQ troubleshooting |
« View previous topic :: View next topic » |
Author |
Message
|
jeevan |
Posted: Thu Jan 27, 2011 12:25 pm Post subject: Automating MQ troubleshooting |
|
|
Grand Master
Joined: 12 Nov 2005 Posts: 1432
|
Hi folks outhere,
I am working on MQ troubleshooting script. I am doing this:
check qmgr is runing ( ok)
check listener is running (ok)
check command server is running if not, if hung stop and start etc
Now, or earlier when I check the qmgr is running, like to check qmgr is responsive. I can use ping qmgr commaind. but my problem IS this.
If I run this command on main flow, if the qmgr is hung or non responsive, my script stocks. I can put that single command in another file and call it but still if this runs as a normal script, the main flow still waits for it to complete. I have to make it running in background so the amin flows procgresses and I can check whether the script is still running and determine whether the ping hung or not. But I could not find a way to do this in perl or batch script.
I tried this:
main script looks something like this.
#!/usr/bin/perl
#require "count.pl";
#system("perl c:/temp1000/count.pl >/dev/null 2>&1 &");
#system("perl c:/temp1000/count.pl &");
my $test= `perl c:/temp1000/count.pl &"`;
print " PID of child: $test\n";
print "Finished Counting!";
Whatever way ( either of 3 ways shown above), I call the child script, the main script waits for it to complete. Is there anyway I can achieve it?
Further information:
MQ version : MQ 7.0.1.3
perl dist :Active perl 5.10
platform : Windows
Note: I think I can achive this in Unix/linux easily by runing the script in background and checking whether it completed or not.l
Thanks
Last edited by jeevan on Thu Jan 27, 2011 1:59 pm; edited 3 times in total |
|
Back to top |
|
 |
fatherjack |
Posted: Thu Jan 27, 2011 12:42 pm Post subject: |
|
|
 Knight
Joined: 14 Apr 2010 Posts: 522 Location: Craggy Island
|
Or you could look at any of the many MQ systems management tools out there. _________________ Never let the facts get in the way of a good theory. |
|
Back to top |
|
 |
gbaddeley |
Posted: Thu Jan 27, 2011 4:17 pm Post subject: |
|
|
 Jedi Knight
Joined: 25 Mar 2003 Posts: 2538 Location: Melbourne, Australia
|
On Windows, you can examine the output of the the "tasklist" command to check that the MQ processes and service is running. This is pretty low level stuff. It would be much better and more robust to use a proper MQ monitoring tool. Search this forum. _________________ Glenn |
|
Back to top |
|
 |
jeevan |
Posted: Thu Jan 27, 2011 8:06 pm Post subject: |
|
|
Grand Master
Joined: 12 Nov 2005 Posts: 1432
|
gbaddeley wrote: |
On Windows, you can examine the output of the the "tasklist" command to check that the MQ processes and service is running. This is pretty low level stuff. It would be much better and more robust to use a proper MQ monitoring tool. Search this forum. |
I am trying to develop troubleshooting script, which does all basic test not a monitoring. we have monitoring in place. |
|
Back to top |
|
 |
bruce2359 |
Posted: Thu Jan 27, 2011 10:17 pm Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
Hmmmm, interesting.
Monitoring identifies potential problems, while troubleshooting discovers the cause(s) of problems. Monitoring is observational, while troubleshooting is analytical.
What do you expect your script to discover that monitoring tools will not? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
santnmq |
Posted: Fri Jan 28, 2011 3:27 am Post subject: |
|
|
Centurion
Joined: 11 Jan 2011 Posts: 125
|
It's confusing, You probably want to monitor MQ through script not troubleshoot.
It's easy to write scripts for your constomized monitoring and get alerted whenever there is an issue OR as stated earlier there are number of monitoring tools which can be used for this purpose. |
|
Back to top |
|
 |
jeevan |
Posted: Fri Jan 28, 2011 4:50 am Post subject: |
|
|
Grand Master
Joined: 12 Nov 2005 Posts: 1432
|
santnmq wrote: |
It's confusing, You probably want to monitor MQ through script not troubleshoot.
It's easy to write scripts for your constomized monitoring and get alerted whenever there is an issue OR as stated earlier there are number of monitoring tools which can be used for this purpose. |
why you guess probably I want monitoring scirpt ? We have ITM and Nagios monitoring tool in place. I want and developing troubleshooting script. |
|
Back to top |
|
 |
Vitor |
Posted: Fri Jan 28, 2011 5:15 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
jeevan wrote: |
We have ITM and Nagios monitoring tool in place. I want and developing troubleshooting script. |
The question I'm struggling with here is the distinction you're making. To me, a monitoring tool monitors a software system (WMQ, database, application, whatever) and alerts in the event of a problem. That problem could be a response out of SLA but is more typically a failed component like a downed queue manager or a channel in retry. This is certainly what ITM does.
So given that, what additional information are you attempting to determine with this script? How will it help you troubleshoot in a way that the existing monitoring doesn't? _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mqjeff |
Posted: Fri Jan 28, 2011 5:19 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
I suspect that jeevan is attempting to write something similar to the functionality provided with ISALite - where it will automatically assemble a predefined and fixed set of troubleshooting data at the current point in time - like saving out a copy of AMQERR logs and collecting FFST files (and maybe running ffstsummary) and etc. etc. etc.
So that *after* the monitoring tool has CREATED an alert, then something can be run to make a first pass at DIAGNOSING the issue.
It's not a bad idea, really.
In terms of the actual *problem* mentioned - how to fork a process that might potentially hang and then detect if it has or has not actually HUNG.... That's by definition Turing Complete in the general case... In this case, I'd redirect the stdout of the command back into the code and use a read or seek that times out. If I haven't gotten any data after a timeout, I'd assume that the command had hung, and kill it. |
|
Back to top |
|
 |
jeevan |
Posted: Fri Jan 28, 2011 6:01 am Post subject: |
|
|
Grand Master
Joined: 12 Nov 2005 Posts: 1432
|
mqjeff wrote: |
I suspect that jeevan is attempting to write something similar to the functionality provided with ISALite - where it will automatically assemble a predefined and fixed set of troubleshooting data at the current point in time - like saving out a copy of AMQERR logs and collecting FFST files (and maybe running ffstsummary) and etc. etc. etc.
So that *after* the monitoring tool has CREATED an alert, then something can be run to make a first pass at DIAGNOSING the issue.
It's not a bad idea, really.
In terms of the actual *problem* mentioned - how to fork a process that might potentially hang and then detect if it has or has not actually HUNG.... That's by definition Turing Complete in the general case... In this case, I'd redirect the stdout of the command back into the code and use a read or seek that times out. If I haven't gotten any data after a timeout, I'd assume that the command had hung, and kill it. |
mqjeff ,
You are almost correct in first part. I want to develop a script which does two things: collect stuff as ISAlite does and does basics health check and recommand next step( saves time, typing, missing to check some thing etc) in crucial situation)
Let me explain the purpose. Let's say we have reported a problem in our qmgr/MQ app or broker flows, that message are not being processed or can not put message, or database are not being updated etc) or is an trouble ticket, what do we do? How do we process?
We jump in to the server, check qmgr status, let's say it is in non responsive( one of the many possible situation) state. We start it( our priority is to restore the service asap).
We check a few stuff. If we figure out the cause good but if not, if it is a late night, we think, I will collect qmgr log etc, and go to sleep.
Tomorrow, we may will spend sometime or it or forget.
Even if we spend time, and lets say we could not fingure out, we may approach IBM.
By, then,
The qmgr log would have been overwritten,
The contect ( process running, memory, space etc at the time of crash) would have changed.
IBM may ask, " can you recreate the scenario?" probablly not as we might have lost all state of the os, mq, qmgr , app when it crashes.
This said, let me come back to the purpose of my script:
I want to do things we do when we log in to troubleshoot. Thus, it helps to troubleshoot faster, and collect and the must gather for later usage:
for ecample,
check qmgr status eg
dspmq
if it is running, check it is responding( ping qmgr, where I can stock now)
if qmgr is responding, then check whether the listener is running
check, command server running
check the memory, disck space, space in /tmp directory ( in some cases it coud be a problem)
scan qmgr log to see if there are any uncommon AMQ error
check to see there are FDc generated
In both of the above cases, copy them to a temp location ( not anywhere even create a fixed place so we do not forget later on)
Again, i just want to be more systematic ( not missing anyting to chec) by follwoing a check list ( best practice)
Save time in crucial situation
Help junior administrator/engineer do basic things withoug asking senior people
create a knowledge ( If I put this as a script and develop the documentation or explain each step, a knowledge is created(DIKW).
Also, my way is, I basically develp a script if I have to do same thing or simialr thing again and again. That is what really means an automation or adminstrator or support.
In fact, I am still coming up with document( requrement ) and flow. Now both are in my mind. |
|
Back to top |
|
 |
jeevan |
Posted: Fri Jan 28, 2011 6:20 am Post subject: |
|
|
Grand Master
Joined: 12 Nov 2005 Posts: 1432
|
Vitor wrote: |
jeevan wrote: |
We have ITM and Nagios monitoring tool in place. I want and developing troubleshooting script. |
The question I'm struggling with here is the distinction you're making. To me, a monitoring tool monitors a software system (WMQ, database, application, whatever) and alerts in the event of a problem. That problem could be a response out of SLA but is more typically a failed component like a downed queue manager or a channel in retry. This is certainly what ITM does.
So given that, what additional information are you attempting to determine with this script? How will it help you troubleshoot in a way that the existing monitoring doesn't? |
Vitor,
I see an alert is a warning that a threshold has been reached and troubleshooting is an art that figures out why it happend.
The first is done by the tool you mentioned and which is already automated and I am not rying to reinventing the wheel, but I want to achieve the second more systematic way.
Lets say, even if it is a response out of SLA, we do check certain things. In fact, I want to follow a checklist in scripted way to do this check and come out with a consistent result.
If it is a trouble situation, we definitely check certain thhings. I wiould like to check these and, and based on our knowledge recommend next step.
I would say it is a coding knowledge and implementing the knoeldge.
This helps save crucial informaiton for later usages ( eg qmgr log , FDC etc) systematic way
Tt helps figuring out the problem faster and correct manner ( we may miss to check someting while doing remembering from top of our head)
Hope I answer your question. I do apprecaite more questions so that i can come up with better script or give up as it may not require or useful. |
|
Back to top |
|
 |
Vitor |
Posted: Fri Jan 28, 2011 6:53 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
jeevan wrote: |
I see an alert is a warning that a threshold has been reached and troubleshooting is an art that figures out why it happend.
The first is done by the tool you mentioned and which is already automated and I am not rying to reinventing the wheel, but I want to achieve the second more systematic way. |
That's a fair point. I remain unconvinced that a script could determine anything that the existing monitoring tool has not already recorded, but do see that the checklist concept has value & maybe could be scripted as you indicate. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
mqjeff |
Posted: Fri Jan 28, 2011 6:54 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
The part where you automatically gather data is, as I said, not a bad idea. In fact, it's a really good idea.
The challenge is to build enough of an expert system on top of that that you can actually make any recommendations about what to do next. Given, for example, a channel ending with an MQRC2009 - is it the fault of the network? Is it the fault of the qmgr? Is it the fault of the network hardware in the server? How do you determine what is the "next" step in this case? Using what information?
This gets to be a complicated, hard problem.
It's worth examining at least, but it gets complicated and hard very quickly. |
|
Back to top |
|
 |
bruce2359 |
Posted: Fri Jan 28, 2011 6:56 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
Before you begin coding a troubleshooting script, document in flowchart fashion exactly what event/situation (symptom) will lead to which steps in the troubleshooting process. Do this as if you were training your replacement to diagnose suspected problems.
An example symptom: a queue that has no messages (current depth) - but should. There are a variety of possible causes for this. The underlying cause (problem) may be very different from another queue with the same symptom. _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
jeevan |
Posted: Fri Jan 28, 2011 7:02 am Post subject: |
|
|
Grand Master
Joined: 12 Nov 2005 Posts: 1432
|
mqjeff wrote: |
The part where you automatically gather data is, as I said, not a bad idea. In fact, it's a really good idea.
The challenge is to build enough of an expert system on top of that that you can actually make any recommendations about what to do next. Given, for example, a channel ending with an MQRC2009 - is it the fault of the network? Is it the fault of the qmgr? Is it the fault of the network hardware in the server? How do you determine what is the "next" step in this case? Using what information?
This gets to be a complicated, hard problem.
It's worth examining at least, but it gets complicated and hard very quickly. |
I agree. But I want to come out it is not a mq problem, it is network problem. etc. so I can reach out to network folks and /or create a trouble ticket.
I am not trying to go beyond my own area.
Thank you very much for the insight. I am pretty much sure, I have to witness a lot of hurdle in this journey. |
|
Back to top |
|
 |
|
|
 |
Goto page 1, 2 Next |
Page 1 of 2 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|