|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
 |
|
2035s for 150 seconds??? |
« View previous topic :: View next topic » |
Author |
Message
|
PeterPotkay |
Posted: Mon Aug 29, 2005 6:56 am Post subject: 2035s for 150 seconds??? |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
MQ 5.3.0.8
Windows 2000 SP4
Microsoft Stretch Cluster
QMA runs on Node1 primarily. No problems. We swing it over to Node2, and it runs there no problem. We reboot Node1, and swing QMA back to its primary server, Node1. For 150 seconds, applications get 2035s! Both my QM error logs and their application logs agree on this point. And then the errors go away on their own! What the?!?!?!?
Any ideas? No other errors in the QM logs besides the 2035s on the connect call. No FDCs. Nothing in the system level MQ error log. Nothing odd in the Windows Event Viewer. This is production.
Grasping at straws, but I wonder if we tried to move QMA back to Node1 to fast after the reboot, and maybe the server hadn't had it's morning coffee yet and was not fully functional? It was about 5 minutes after the reboot that the 2035s errors just stopped on there own.
QMB runs primarily on Node2 in this cluster pair. As we moved it back and forth during these reboots, it never had a problem with 2035s on either node. Only QMA for the first 150 seconds after it returned to Node1. Weird. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
bower5932 |
Posted: Mon Aug 29, 2005 7:39 am Post subject: |
|
|
 Jedi Knight
Joined: 27 Aug 2001 Posts: 3023 Location: Dallas, TX, USA
|
I'd be suspicious of something in the network not being quite up and having an indirect effect on WMQ. You might be able to take a trace and see something in it that sheds some light. |
|
Back to top |
|
 |
wschutz |
Posted: Mon Aug 29, 2005 7:42 am Post subject: |
|
|
 Jedi Knight
Joined: 02 Jun 2005 Posts: 3316 Location: IBM (retired)
|
Peter...are you able to do a quick dspmqaut during this 150 second interval? and is amqzfuma running? _________________ -wayne |
|
Back to top |
|
 |
kingsley |
Posted: Mon Aug 29, 2005 8:31 am Post subject: |
|
|
Disciple
Joined: 30 Sep 2001 Posts: 175 Location: Hursley
|
It takes a while before the last process comes up. I'd say, your app did'nt wait till the status is online in Cluster admin. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Mon Aug 29, 2005 11:45 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Yes, the app tries to reconnect 5 times 1 second apart, 5 times 10 seconds apart, then once a minute after that. It does not wait for us to say MQ is up.
But note that the app connects succesfully within seconds of the QM saying it is started (message in AMQERRO1.LOG) when that QM was moved to Node2. And the other QM had no connection issues on either node. I have to say that if the QM wrote the "I am Started" message to the log, and it is online, apps should be able to connect. Immediatly, and not wait 2.5 minutes longer (that is an eternity to keep getting 2035s). I looked at the app's logs, and see the 2059's while the QM is coming up, and then see the 2035's for 2.5 minutes.
We did not do a dspmqaut while this was happening, as we only learned about it after the fact. I wonder what that would throw.
Our Windows SysAdmin said it is entirely possible that we moved the QM over to the server so close to the reboot that some other O/S services had not yet started, and were thus unable to serve MQ's request to validate the domain ID.
The next time we reboot that server, we are going to wait 5 minutes before swinging the QM back, to see if that makes a diff.
Sure would like to know what is really going on. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
hopsala |
Posted: Mon Aug 29, 2005 12:21 pm Post subject: |
|
|
 Guardian
Joined: 24 Sep 2004 Posts: 960
|
I'd go with wschutz offer, possibly the authorization service process (amqzfuma) hasn't come up yet (though it is silly that the QM allows MQCONNs before it goes up), or with your admin's suggestion... I think that the fact it happened in Node1 and not Node2 is strictly coincidental.
Another option worth checking - in the past, though possibly nowadays, clustering products inhibited access to "swung" (using peter's term ) resources by using security checks. That is, if you switched a cluster resource to comp2, then in comp1 the cluster would restrict access to it using normal security auths.
I don't think this happens in today's products, since it's not a very good way of handling cluster resources, but i'm not familiar with exactly how microsoft clusters work... It is possible that the microsoft cluster simply didn't have time to renew authorizations to cluster resource (mq, or possibly some MQ file) |
|
Back to top |
|
 |
|
|
 |
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|