MQSeries.net :: View topic - 2085 on a queue that was already in the cluster and working

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » Clustering » 2085 on a queue that was already in the cluster and working

2085 on a queue that was already in the cluster and working

« View previous topic :: View next topic »

Author

Message

vennela

Posted: Thu Sep 23, 2004 4:11 am Post subject: 2085 on a queue that was already in the cluster and working

Jedi Knight

Joined: 11 Aug 2002
Posts: 4055
Location: Hyderabad, India

The QMGR on the mainframe is the repository

The QMGR on the unix side (HP 11.11 - MQ 5.3 CSD 06) is the partial repository.

The queue is defined and is clustered on the unix QMGR. The PUTting
application is on the mainframe. This queue will be used by the
mainframe application to put messages and the messages flow to the UNIX
QMGR.

BUT, today we got 2085 on the mainframe side when the PUTting
application tried to PUT the message. Then I issued a refresh cluster
and then the PUTting application ran just fine.

want to know why the cluster queue was just not being shown.
If the queue was just created and it was not being shown would have been OK. But the queue was already in use and it was not seen in the cluster today.

PeterPotkay

Posted: Thu Sep 23, 2004 4:52 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

We are seeing the same problem on Windows 5.3 CSD04. Our Gateway QM sometimes throws the message to the DLQ with a 2085, when the queeu is definitly in the cluster and has been used before.

Instead of REFRESHing the cluster, I just run the DLQ Handler, and the message goes just fine the second time.

My guess is that when the gateway QM gets a message for a queue it hasn't used for a while, it needs to resubcribe to the Full Repositories to get the current info. I have 4 overlapping clusters, so it has 8 FR to subscribe to. If it chooses to ask the FRs of the clusters that do not host the queue first, and ask the FRs of the cluster that does host the queue last, it may take so long to do that that the CLUSSNDR channel assumes this message is not going to find a home, and puts it to the DLQ. A split second later the current queue info finally arrives, and now when I replay the message, it goes.

I have upped the Message Retry to 3 from 1 (Message Interval is still 1000). This means the CLUSSNDR will retry the message for 3 seconds. I am hoping that if my guess at what is wrong is true, this will fix it.
_________________
Peter Potkay
Keep Calm and MQ On

vennela

Posted: Thu Sep 23, 2004 1:27 pm Post subject:

Jedi Knight

Joined: 11 Aug 2002
Posts: 4055
Location: Hyderabad, India

Peter:

Do you have any duplicate QMIDs for the cluster queues?
Which DLQ are the messages ending up on. On the PUTting side of DLQ or on the QMGR where the queue is hosted?

I have opened a PMR

PeterPotkay

Posted: Thu Sep 23, 2004 6:42 pm Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

Quote:

Do you have any duplicate QMIDs for the cluster queues?

Nope. DIS QCLUSTER(MyQueueName) shows only 1 QMID.

Quote:

Which DLQ are the messages ending up on.

QM1 sends a message to QMGateway over regular channels. QMGateway should deliver the message to the queue on one of the clustere queue managers over a cluster channel, but once in a very great while, it goes to the DLQ on QMGateway. But I can then immediatly replay it.
_________________
Peter Potkay
Keep Calm and MQ On

fitzcaraldo

Posted: Wed Aug 24, 2005 5:13 pm Post subject:

Voyager

Joined: 05 May 2003
Posts: 98

Sorry to revive this after so long but I am experiencing the same problem.

I am testing two gateway machines that are also the FRs for a cluster. On another machine in the cluster I start 3000 processes that attempt to send to 3000 different target queues outside the cluster. All QRemote definitions of these target queues are exposed in the cluster on BOTH gateway QMs.

Sometimes under heavy startup load (when the sender's partial repos is empty), I get intermittent 2085 and 2189 errors on the sending machine and I'm trying to understand why. (All definitions do exist and resending the same message works fine)

When the sending machine needs to resolve a target it puts a message on the SYSTEM.CLUSTER.TRANSMIT.QUEUE and at some stage receives a reply.

The sending machine is under enormous strain (due to the 3000 sending processes) whereas the gateways/repos machines are cruising.

Can anyone explain the mechanics of these errors?

Is the 2189 the result of not processing the reply from the REPOS within a time limit? Can this be adjusted? Did the request even get there?
What about the 2085? Any ideas?

(PS I've seen the SYSTEM.CLUSTER.TRANSIT.QUEUE get to more than 50000 messages deep!)

Regards.

PeterPotkay

Posted: Wed Aug 24, 2005 6:20 pm Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

Whats the SYSTEM.CLUSTER.COMMAND.QUEUE look like on the machine in the cluster under strain? What about the SYSTEM.CLUSTER.TRANSMIT.QUEUES on the FRs (that might have cluster info trying to get back to the PR)?

If that QM is asked to put to a queue it does not yet know about, it needs to subscribe to a FR to get that info. That takes a little time and a little CPU. If there is no CPU left over to handle the response back from the FR in a timely fashion, then the strained PR is left wondering if anyone will ever tell it aboutthis mystery queue, and so gives up on the message. By the time you replay it, the subscription has been processed, so the QM knows about it the 2nd time around.

This is just pure speculation.

We are at 5.3.0.8 now, and the problem seems to have gone away, although it did still occasionally appear even after the upgrade to CSD08. Our machine throwing to the DLQ also happened to be a machine under extreme stress (dinky DEV box, Hub QM, WB-IMB QM, Config Manager).
_________________
Peter Potkay
Keep Calm and MQ On

fitzcaraldo

Posted: Wed Aug 24, 2005 6:49 pm Post subject:

Voyager

Joined: 05 May 2003
Posts: 98

Thanks Peter.
This highlights a gap in my knowledge regarding the mechanics of cluster name resolution - particularly the role of the SYSTEM.CLUSTER.COMMAND.QUEUE.

When the QM subscribes to the FR presumably this is done via the SYSTEM.CLUSTER.TRANSMIT.QUEUE? Does the response come to the SYSTEM.CLUSTER.COMMAND.QUEUE?

On the sending machine (the one in the cluster that is under so much strain) the SYSTEM.CLUSTER.TRANSMIT.QUEUE gets to depths like 80000+. I will try again and look at the COMMAND queue as well. And on the FR as well.

I can understand the circumstances where I get a 2189 - when the sending queue manager gives up - but a 2085? A 2085 seems much more definitive, almost as if it has been told by something that the target does not exist as opposed to just timed out.

Nothing in the logs to indicate that anything went wrong.

BTW I am ver 530.9 CSD09

PeterPotkay

Posted: Thu Aug 25, 2005 8:20 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

Yes, I am 99.99% sure that when the FR sends back info to the PR, the FR puts it into its own Cluster XMITQ, and it gets shipped to the CLuster command queue of the PR, where the PR's clustere Repository Manager picks it up, processes it, and updates the cluster repository queue. I highly doubt the FR is sending directly to the PR's repository queues.

Yeah, your comment about the 2085 being a different "type" of error than 2189 makes sense. I dunno - the magic and mystery of clusters I guess!
_________________
Peter Potkay
Keep Calm and MQ On

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » Clustering » 2085 on a queue that was already in the cluster and working

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP