|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
 |
|
Sudden Cluster resolution errors in a healthy cluster |
« View previous topic :: View next topic » |
Author |
Message
|
mqdev |
Posted: Thu Feb 10, 2005 11:31 am Post subject: Sudden Cluster resolution errors in a healthy cluster |
|
|
Centurion
Joined: 21 Jan 2003 Posts: 136
|
Hello Folks,
We have a production cluster consisting of 1200+ QMgrs. The cluster was good without any problems untill we bounced the Full Repo QMs as part of MQ maintenance. All hell broke loose then with multiple QMs unable to resolve cluster Queues throwing up 2189 errors (The channels to and from Full Repository QMs were running - some had to be started manually though). Here were the symptoms of the problem:
An app running off a member QM throws an error 2189 for a Cluster Queue.
Issuing REFRESH CLUSTER(name) on the member QM does not start channels to the Full Repos
MAnually start the channels to and From Full Repos. They start and are in RUNNING status
Again issue a REFRESH CLUSTER on the member QM
Though the Channels to and from Full Repo to the member QM are RUNNING, still the Cluster Queues are unresolved - app gets 2189 error.
When we do a "dis ql(*) CURDEPTH" on the member QM, it has been observed that on the ClusterXMit Queue, there are some messages sitting and for each REFRESH CLUSTER command, about 10 messages are being added to the ClusterXmitQueue. The amqrrmfa(Repository Manager process) on the member QM is running.
We are running MQv5.3CSD05 on AIX 5.2.
Anyone has any idea as to what might be the prob?
Thanks
mqdev |
|
Back to top |
|
 |
PeterPotkay |
Posted: Thu Feb 10, 2005 3:02 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
The channels from the partials can't communicate to the fulls.
Exactly what did you do on the fulls as part of "MQ maintenance"?
As a test, create a new local queue on one of the problem partials, cluster it, and go look on one of the fulls to see if the queue shows up (use the command line or MO71 -do not use MQExplorer).
If that works, create a queue on the full, cluster it, and the go do an amqsput to that queue on one of the partials. Does the message make it?
You could also verify that the manuals CLUSSNDR on the partial is 100% correct pointing to one of the fulls, that the manual CLUSRCVR on the same partial is 100% correct, that the CLUSRCVR on the Full is 100% correct (verify those IP addresses!), then issue the following on the partial:
Code: |
REFRESH CLUSTER(yourClusterName) REPOS(YES)
|
That REPOS(YES) part is important. It will force the partial to make a 100% fresh start in the cluster, which it will do if all the defs are correct.
But before you take this battle axe approach, try those tests I mentioned first. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
mqdev |
Posted: Fri Feb 11, 2005 8:48 am Post subject: Reply to Peter |
|
|
Centurion
Joined: 21 Jan 2003 Posts: 136
|
Peter,
First off thanks for taking time to reply to my problem.
Now as for your questions -
"MQ Maintenance" was bouncing the Full Repo(FR) QMs (stop and start them). Actually the Maintenance had other steps in it - but since we saw issues after this first step we didnt get any further in the maintenance.
The Full Repos for all practical purposes "seem" to be OK - we have multiple member QMs (remember we have about 1200 of them!) talking to them and resolving Cluster Queues successfully. Only cause for concern is that we seem to be hitting a lot more 2189 errors then the usual, post FR bouncing, and hence this query.
Also I just wanted to know if anyone has faced the situation of cluster SDR and RCVR Channels between a member QM and FR being in RUNNING status and still the member being unable to resolve cluster objects. If YES, how did you solve it (I solved it by just deleting the member QM and recreating it - a bit drastic but it worked...)
Thanks
-mqdev |
|
Back to top |
|
 |
PeterPotkay |
Posted: Fri Feb 11, 2005 1:53 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Recreating the QM is extreme.
Did you look in the error logs on the QMs in question? Usually, but not always, "cluster" problems are really just basic MQ channel problems. If channels are having problems, the apps that rely on them fail. In the case of cluster channels, that app is the cluster.
PeterPotkay wrote: |
You could also verify that the manuals CLUSSNDR on the partial is 100% correct pointing to one of the fulls, that the manual CLUSRCVR on the same partial is 100% correct, that the CLUSRCVR on the Full is 100% correct (verify those IP addresses!), then issue the following on the partial:
Code: |
REFRESH CLUSTER(yourClusterName) REPOS(YES)
|
That REPOS(YES) part is important. It will force the partial to make a 100% fresh start in the cluster, which it will do if all the defs are correct.
But before you take this battle axe approach, try those tests I mentioned first. |
_________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
mqdev |
Posted: Fri Feb 18, 2005 8:41 am Post subject: Resoved!! |
|
|
Centurion
Joined: 21 Jan 2003 Posts: 136
|
Hello,
An update on the prob:
After turning on IPTrace and MQ Trace on both partial and full repo QM, we found that the data packets being sent by FR arent making it to the partial due to Network prob. Once that was fixed, all is working well.
thanks all for your time (esp. Peter).
mqdev |
|
Back to top |
|
 |
Nigelg |
Posted: Fri Feb 18, 2005 10:59 pm Post subject: |
|
|
Grand Master
Joined: 02 Aug 2004 Posts: 1046
|
Generally, a 2189 error is caused when a msg has to be put to a cluster queue for which a subscription older than 10 seconds already exists. When a msg is put to a cluster queue that the qmgr does not know about, a subscription request for the queue is sent to the full repositories. Normally, the reply returns from the repos well within 10 seconds, and the put can proceed. If the reply takes longer than 10 seconds, the qmgr returns 2189 because the cluster queue cannot be found in the cluster, probably because the channel to the repos is down. Subsequent puts to the same queue are returned the 2189 much quicker, until the reply to teh sub request has ben received from the full repos. So, if you are getting 2189 on a regular basis, you need to check the comms between the paretial and full repos.
Note that issuing REFRESH CLUSTER in this circumstance is probably the least sensible action you can take; the full repos has not replied to a single sub request, so it is not going to reply to a mass of requests, which is what REFRESH CLUSTER does. |
|
Back to top |
|
 |
mqdev |
Posted: Mon Feb 21, 2005 7:41 am Post subject: 2085 or 2189? |
|
|
Centurion
Joined: 21 Jan 2003 Posts: 136
|
Nigel,
Please help us understand this: I have recreated the problem QM. After this, when I try to access a Cluster Queue, it was failing with reason code 2085(MQRC_UNKNOWN_OBJECT_NAME). Our CLUSSDR and CLUSRCVR chls are defined using DNS resolvable aliases and we have verified that the DNS resolution (hostname -> IP address conversion) is not a problem.
When I changed the CLUSSDR and CLUSRCVR definitions to hardcoded IP address of the Full Repos and the IP of the QM respectively, and restarted the problem QM, I was getting 2189(MQRC_CLUSTER_RESOLUTION_ERROR).
In the initial scenario, the QM doesnt even recognize that the Queue being accessed is a cluster object while it does in the second case (with Full Repo IP hardcoded into the CLUSSDR & CLUSRCVR chl defs)? Why are we getting different Reason codes for something that is essentially the same (imperfect TCP/IP connectivity - we were seeing continuous TCP/IP timeout errors in the problem QM error logs in all cases)
Thanks
-mqdev |
|
Back to top |
|
 |
Nigelg |
Posted: Mon Feb 21, 2005 7:49 am Post subject: |
|
|
Grand Master
Joined: 02 Aug 2004 Posts: 1046
|
The difference may be that in the 2085 case the subscription request for info about the cluster queue has been sent to the repos qmgr, but no reply has been received, and in the 2189 case the subscription request is not making it out of the partial repos.
It does not really matter what the return code, the underlying problem is the same, as you say. |
|
Back to top |
|
 |
mqdev |
Posted: Mon Feb 21, 2005 8:21 am Post subject: another question |
|
|
Centurion
Joined: 21 Jan 2003 Posts: 136
|
|
Back to top |
|
 |
|
|
 |
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|