ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » Clustering » Sudden Cluster resolution errors in a healthy cluster

Post new topic  Reply to topic
 Sudden Cluster resolution errors in a healthy cluster « View previous topic :: View next topic » 
Author Message
mqdev
PostPosted: Thu Feb 10, 2005 11:31 am    Post subject: Sudden Cluster resolution errors in a healthy cluster Reply with quote

Centurion

Joined: 21 Jan 2003
Posts: 136

Hello Folks,
We have a production cluster consisting of 1200+ QMgrs. The cluster was good without any problems untill we bounced the Full Repo QMs as part of MQ maintenance. All hell broke loose then with multiple QMs unable to resolve cluster Queues throwing up 2189 errors (The channels to and from Full Repository QMs were running - some had to be started manually though). Here were the symptoms of the problem:

An app running off a member QM throws an error 2189 for a Cluster Queue.
Issuing REFRESH CLUSTER(name) on the member QM does not start channels to the Full Repos
MAnually start the channels to and From Full Repos. They start and are in RUNNING status
Again issue a REFRESH CLUSTER on the member QM
Though the Channels to and from Full Repo to the member QM are RUNNING, still the Cluster Queues are unresolved - app gets 2189 error.
When we do a "dis ql(*) CURDEPTH" on the member QM, it has been observed that on the ClusterXMit Queue, there are some messages sitting and for each REFRESH CLUSTER command, about 10 messages are being added to the ClusterXmitQueue. The amqrrmfa(Repository Manager process) on the member QM is running.
We are running MQv5.3CSD05 on AIX 5.2.

Anyone has any idea as to what might be the prob?

Thanks
mqdev
Back to top
View user's profile Send private message
PeterPotkay
PostPosted: Thu Feb 10, 2005 3:02 pm    Post subject: Reply with quote

Poobah

Joined: 15 May 2001
Posts: 7716

The channels from the partials can't communicate to the fulls.

Exactly what did you do on the fulls as part of "MQ maintenance"?

As a test, create a new local queue on one of the problem partials, cluster it, and go look on one of the fulls to see if the queue shows up (use the command line or MO71 -do not use MQExplorer).

If that works, create a queue on the full, cluster it, and the go do an amqsput to that queue on one of the partials. Does the message make it?

You could also verify that the manuals CLUSSNDR on the partial is 100% correct pointing to one of the fulls, that the manual CLUSRCVR on the same partial is 100% correct, that the CLUSRCVR on the Full is 100% correct (verify those IP addresses!), then issue the following on the partial:
Code:

REFRESH CLUSTER(yourClusterName) REPOS(YES)


That REPOS(YES) part is important. It will force the partial to make a 100% fresh start in the cluster, which it will do if all the defs are correct.

But before you take this battle axe approach, try those tests I mentioned first.
_________________
Peter Potkay
Keep Calm and MQ On
Back to top
View user's profile Send private message
mqdev
PostPosted: Fri Feb 11, 2005 8:48 am    Post subject: Reply to Peter Reply with quote

Centurion

Joined: 21 Jan 2003
Posts: 136

Peter,
First off thanks for taking time to reply to my problem.

Now as for your questions -

"MQ Maintenance" was bouncing the Full Repo(FR) QMs (stop and start them). Actually the Maintenance had other steps in it - but since we saw issues after this first step we didnt get any further in the maintenance.
The Full Repos for all practical purposes "seem" to be OK - we have multiple member QMs (remember we have about 1200 of them!) talking to them and resolving Cluster Queues successfully. Only cause for concern is that we seem to be hitting a lot more 2189 errors then the usual, post FR bouncing, and hence this query.
Also I just wanted to know if anyone has faced the situation of cluster SDR and RCVR Channels between a member QM and FR being in RUNNING status and still the member being unable to resolve cluster objects. If YES, how did you solve it (I solved it by just deleting the member QM and recreating it - a bit drastic but it worked...)

Thanks
-mqdev
Back to top
View user's profile Send private message
PeterPotkay
PostPosted: Fri Feb 11, 2005 1:53 pm    Post subject: Reply with quote

Poobah

Joined: 15 May 2001
Posts: 7716

Recreating the QM is extreme.

Did you look in the error logs on the QMs in question? Usually, but not always, "cluster" problems are really just basic MQ channel problems. If channels are having problems, the apps that rely on them fail. In the case of cluster channels, that app is the cluster.


PeterPotkay wrote:

You could also verify that the manuals CLUSSNDR on the partial is 100% correct pointing to one of the fulls, that the manual CLUSRCVR on the same partial is 100% correct, that the CLUSRCVR on the Full is 100% correct (verify those IP addresses!), then issue the following on the partial:
Code:

REFRESH CLUSTER(yourClusterName) REPOS(YES)


That REPOS(YES) part is important. It will force the partial to make a 100% fresh start in the cluster, which it will do if all the defs are correct.

But before you take this battle axe approach, try those tests I mentioned first.

_________________
Peter Potkay
Keep Calm and MQ On
Back to top
View user's profile Send private message
mqdev
PostPosted: Fri Feb 18, 2005 8:41 am    Post subject: Resoved!! Reply with quote

Centurion

Joined: 21 Jan 2003
Posts: 136

Hello,
An update on the prob:

After turning on IPTrace and MQ Trace on both partial and full repo QM, we found that the data packets being sent by FR arent making it to the partial due to Network prob. Once that was fixed, all is working well.
thanks all for your time (esp. Peter).

mqdev
Back to top
View user's profile Send private message
Nigelg
PostPosted: Fri Feb 18, 2005 10:59 pm    Post subject: Reply with quote

Grand Master

Joined: 02 Aug 2004
Posts: 1046

Generally, a 2189 error is caused when a msg has to be put to a cluster queue for which a subscription older than 10 seconds already exists. When a msg is put to a cluster queue that the qmgr does not know about, a subscription request for the queue is sent to the full repositories. Normally, the reply returns from the repos well within 10 seconds, and the put can proceed. If the reply takes longer than 10 seconds, the qmgr returns 2189 because the cluster queue cannot be found in the cluster, probably because the channel to the repos is down. Subsequent puts to the same queue are returned the 2189 much quicker, until the reply to teh sub request has ben received from the full repos. So, if you are getting 2189 on a regular basis, you need to check the comms between the paretial and full repos.
Note that issuing REFRESH CLUSTER in this circumstance is probably the least sensible action you can take; the full repos has not replied to a single sub request, so it is not going to reply to a mass of requests, which is what REFRESH CLUSTER does.
Back to top
View user's profile Send private message
mqdev
PostPosted: Mon Feb 21, 2005 7:41 am    Post subject: 2085 or 2189? Reply with quote

Centurion

Joined: 21 Jan 2003
Posts: 136

Nigel,
Please help us understand this: I have recreated the problem QM. After this, when I try to access a Cluster Queue, it was failing with reason code 2085(MQRC_UNKNOWN_OBJECT_NAME). Our CLUSSDR and CLUSRCVR chls are defined using DNS resolvable aliases and we have verified that the DNS resolution (hostname -> IP address conversion) is not a problem.

When I changed the CLUSSDR and CLUSRCVR definitions to hardcoded IP address of the Full Repos and the IP of the QM respectively, and restarted the problem QM, I was getting 2189(MQRC_CLUSTER_RESOLUTION_ERROR).

In the initial scenario, the QM doesnt even recognize that the Queue being accessed is a cluster object while it does in the second case (with Full Repo IP hardcoded into the CLUSSDR & CLUSRCVR chl defs)? Why are we getting different Reason codes for something that is essentially the same (imperfect TCP/IP connectivity - we were seeing continuous TCP/IP timeout errors in the problem QM error logs in all cases)

Thanks
-mqdev
Back to top
View user's profile Send private message
Nigelg
PostPosted: Mon Feb 21, 2005 7:49 am    Post subject: Reply with quote

Grand Master

Joined: 02 Aug 2004
Posts: 1046

The difference may be that in the 2085 case the subscription request for info about the cluster queue has been sent to the repos qmgr, but no reply has been received, and in the 2189 case the subscription request is not making it out of the partial repos.
It does not really matter what the return code, the underlying problem is the same, as you say.
Back to top
View user's profile Send private message
mqdev
PostPosted: Mon Feb 21, 2005 8:21 am    Post subject: another question Reply with quote

Centurion

Joined: 21 Jan 2003
Posts: 136

Thanks Nigelg for your time!

Could you look at (Auto CLUSSDR channels not starting up!)
http://www.mqseries.net/phpBB2/viewtopic.php?t=20584
when you get a chance? Would greatly appreciate any help on that!

-mqdev
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic  Reply to topic Page 1 of 1

MQSeries.net Forum Index » Clustering » Sudden Cluster resolution errors in a healthy cluster
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.