Author |
Message
|
PeterPotkay |
Posted: Wed Mar 23, 2005 1:51 pm Post subject: Cluster Channels Hang |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
MQ 5.3 CSD8
Windows 2000
Microsoft Hardware clustering.
QMA, QM1 and QM2 are in a MQ cluster. There are no other QMs in this cluster.
QM1 and QM2 are the Full Repositories.
Every few days, a CLUSSNDR hangs, usually in STARTING status. Messages destined for that channel queue up in the S.C.T.Q., but other cluster messages to the other QM flow fine. it seems random as to which CLUSSNDR hangs, but it is always to or from QMA.
If we try and stop the channel, hoping to manually restart it, it stays stuck in STOPPING. If we try to take the QM offline in MSCS, it then stays stuck in Offline Pending (eventually going to a FAILED status). The only solution is to reboot, which fixes the problem.
It does not happen in our LAB, DEV, or Production. Only QA.
Got a ticket with IBM, but no joy yet.
The following FDC is thrown when the channel gets stuck.
Code: |
+-----------------------------------------------------------------------------+
| |
| WebSphere MQ First Failure Symptom Report |
| ========================================= |
| |
| Date/Time :- Wed March 23 10:29:56 Eastern Standard Time 2005 |
| Host Name :- ERDSIMMQS001 (Windows 2000 Build 2195: Service Pack 4) |
| PIDS :- 5724B4100 |
| LVLS :- 530.8 CSD08 |
| Product Long Name :- WebSphere MQ for Windows |
| Vendor :- IBM |
| Probe Id :- XC130031 |
| Application Name :- MQM |
| Component :- xehExceptionHandler |
| Build Date :- Sep 22 2004 |
| CMVC level :- p530-08-L040921 |
| Build Type :- IKAP - (Production) |
| UserID :- MUSR_MQADMIN |
| Process Name :- E:\programs\MQSeries\amqrmppa\amqrmppa.exe |
| Process :- 00003860 |
| Thread :- 00001154 |
| QueueManager :- HIGHUBQB |
| Major Errorcode :- xecF_E_UNEXPECTED_SYSTEM_RC |
| Minor Errorcode :- OK |
| Probe Type :- MSGAMQ6119 |
| Probe Severity :- 2 |
| Probe Description :- AMQ6119: An internal WebSphere MQ error has occurred |
| (Access Violation at address 01C1C000 when reading) |
| FDCSequenceNumber :- 0 |
| Comment1 :- Access Violation at address 01C1C000 when reading |
| |
| |
+-----------------------------------------------------------------------------+
|
The following is thrown to system MQ error log:
Code: |
-------------------------------------------------------------------------------
03/23/2005 10:29:55
AMQ6119: An internal WebSphere MQ error has occurred (Access Violation at
address 01C1C000 when reading)
EXPLANATION:
MQ detected an unexpected error when calling the operating system. The MQ error
recording routine has been called.
ACTION:
Use the standard facilities supplied with your system to record the problem
identifier, and to save the generated output files. Contact your IBM support
center. Do not discard these files until the problem has been resolved.
----- amqxfdcp.c : 631 --------------------------------------------------------
03/23/2005 10:29:56
AMQ6184: An internal WebSphere MQ error has occurred on queue manager HIGHUBQB.
EXPLANATION:
An error has been detected, and the WebSphere MQ error recording routine has
been called. The failing process is process 3860.
ACTION:
Use the standard facilities supplied with your system to record the problem
identifier, and to save the generated output files. Contact your IBM support
center. Do not discard these files until the problem has been resolved.
----- amqxfdcp.c : 665 --------------------------------------------------------
|
_________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
csmith28 |
Posted: Wed Mar 23, 2005 5:48 pm Post subject: |
|
|
 Grand Master
Joined: 15 Jul 2003 Posts: 1196 Location: Arizona
|
Ok I'll take a shot at this at the risk of being proven wrong.
I assume you have already searched the IBM Support, Google and this site for similar problems to no avail.
This looks like it is most likely related to some obscure Security issue related to the (Bill Gates/Spawn of Satan) MS Operating System and the MUSR_MQADMIN user ID.
Perhaps there is some secondary Domain Server that is suddenly deciding that MUSR_MQADMIN is a threat or shouldn't have permission to start a Channel on the QMA Server.
Or maybe Adress 01C1C000 on the local disk is a corrupted sector. Maybe.
Have you ran defrag or chkdsk /R to see if that helps?
If all else fails, call IBM.  _________________ Yes, I am an agent of Satan but my duties are largely ceremonial. |
|
Back to top |
|
 |
fjb_saper |
Posted: Thu Mar 24, 2005 2:42 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
Possible memory leak /problem on the server.
Does it still do this after a reboot ?
Have you installed additional (inconpatible) software recently ?
Changed tuning on firewall or privacy or virus protectors ?
Enjoy  |
|
Back to top |
|
 |
kman |
Posted: Thu Mar 24, 2005 7:36 pm Post subject: |
|
|
Partisan
Joined: 21 Jan 2003 Posts: 309 Location: Kuala Lumpur, Malaysia
|
Since you already raised a PMR with IBM on this, I hope they will get back with the resolution soon and you can share it with us.
But barring any network issue, I put my bet on the memory leak issue.
One suggestion is to revert to CSD07.
If it is a memory issue, you can see from the system if it really eats up, or going up.
What's your problem ticket number? |
|
Back to top |
|
 |
PeterPotkay |
Posted: Fri Mar 25, 2005 5:09 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
PMR 09786 L6Q
We have CSD8 on because it specifically fixed a problem, so we can't roll back.
I'll look at memory next time it happens. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
vennela |
Posted: Fri Mar 25, 2005 11:15 am Post subject: |
|
|
 Jedi Knight
Joined: 11 Aug 2002 Posts: 4055 Location: Hyderabad, India
|
|
Back to top |
|
 |
PeterPotkay |
Posted: Fri Mar 25, 2005 2:10 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
venella, if you can get that PMR, that would be great. But I kinda doubt it, because when this problem effects our Gateway QM, the channels to other QMs from this QM keep working, and a second cluster channel in another overlapping cluster also keeps working.
We got the lads in Hursley involved now as well. I bumped up my DISCINTs in the QA environment, so the channels stay running, as I suspect its a problem that occuers when the channels are either ready to go Inactive on their own, or are triggering back up. Dev I left as is, so the problem hopefully happens again to establish a pattern. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
jefflowrey |
Posted: Fri Mar 25, 2005 2:27 pm Post subject: |
|
|
Grand Poobah
Joined: 16 Oct 2002 Posts: 19981
|
PeterPotkay wrote: |
But I kinda doubt it, because when this problem effects our Gateway QM, the channels to other QMs from this QM keep working, |
That sounds kinda like a reverse DNS lookup issue for a single IP address...? _________________ I am *not* the model of the modern major general. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Fri Mar 25, 2005 2:30 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Next time this bites me in the arse, b4 I reboot the machine, I will try and telnet from the problem box to the destination box.
But, note that I have 2 cluster senders going to the same destination, and only one gets a problem (and its not always the same one), so maybe not, UNLESS, it is a combo of a DNS problem at the same time a channel wants to start. Hmmmmm.... _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
fjb_saper |
Posted: Sat Mar 26, 2005 7:23 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
I've seen bizarre things happen with DNS
com from A to B no prob but from B to A works only if a link from A to B is active at the same time....
Enjoy |
|
Back to top |
|
 |
|