ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » General IBM MQ Support » 9.4.0.0 Short/Long Retry Bug?

Post new topic  Reply to topic
 9.4.0.0 Short/Long Retry Bug? « View previous topic :: View next topic » 
Author Message
Jeff.VT
PostPosted: Mon Aug 19, 2024 1:44 pm    Post subject: 9.4.0.0 Short/Long Retry Bug? Reply with quote

Acolyte

Joined: 02 Mar 2017
Posts: 71

There's probably going to be a lot of "but why are you doing that" in this post, but the answer is "because that's the way we do it".

3 Queue Managers on the same 2-node Windows Failover Cluster.

QM1 - The 'gateway' / hub
- Is an MQ Cluster Full Repository
- QM2 and QM3 are not part of that Cluster

QM2
- QM2-SND1 - Connects to QM1 via Sender Channel for production traffic
- QM2-RCV1 - Connected to by QM1 for production traffic
- QM2-SND2 - Connects to QM1 via Sender Channel for 'Administrative' traffic
- also connects to an MQ Cluster at a third party, if it matters

QM3
- QM3-SND1 - Connects to QM1 via Sender Channel for production traffic
- QM3-RCV1 - Connected to by QM1 for production traffic
- QM3-SND2 - Connects to QM1 via Sender Channel for 'Administrative' traffic
- also connects to a different MQ Cluster at a third party, if it matters

We recently upgraded from 9.3.0.17 to 9.4.0.0

When we came back up, QM2-SND1, QM2-SND2 QM3-SND1, and QM3-SND2 all failed to connect to QM1 right when they came back up. Unknown why, just some weird network fluke. All 4 of these channels connect to the same DNS Name & Port.

QM2-RCV1, QM3-RCV1 both connected fine.

Here's what's strange...

Both QM2-SND1 and QM3-SND1 are configured for
- 60 Second Short Retry Interval
- 10 Count Short Retries
- 1200 Second Long Retry Interval

Both QM2-SND2 and QM3-SND2 are configured for
- 60 Second Short Retry Interval
- 10 Count Short Retries
- 300 Second Long Retry Interval

On *BOTH* QM2 and QM3, SND2 channel retried for 60 seconds, the short retry interval. And both connected.

BUT on *BOTH* QM2 and QM3, SND1 channel retried for the LONG retry Interval, and reconnected after 20 minutes.

All other clustered and non-clustered channels started without a hitch. And those that didn't on the QM1, all used Short Retry interval.

After a restart - where I guess I may have assumed all Retry Counts should be reset to their defaults.

I've since just changed these values to 10x10sec, and 60 seconds long retry, so it shouldn't ever cause a major problem in the future. I just never thought that a Queue Manager connecting to another Queue Manager hosted on the same server as itself would have a network problem where it would need to ever get to long retries - it was certainly a miss by me, but still seems weird that it got into long retries at all.

Is this expected behavior? Or is this a bug?

Edit: I was looking into the past behavior, and it seems like QM2 and QM3 often have trouble connecting to QM1 after a failover, and they go into Short Retries to connect. That was on 9.3.0.17.
Back to top
View user's profile Send private message
gbaddeley
PostPosted: Mon Aug 19, 2024 4:35 pm    Post subject: Reply with quote

Jedi Knight

Joined: 25 Mar 2003
Posts: 2538
Location: Melbourne, Australia

Every time a channel start retry occurs, there should be entries in the qmgr's error log, giving the reason why the sender channel couldn't start. What was being logged for the 10 short retries, before it eventually started running?
_________________
Glenn
Back to top
View user's profile Send private message
Jeff.VT
PostPosted: Wed Aug 21, 2024 7:43 am    Post subject: Reply with quote

Acolyte

Joined: 02 Mar 2017
Posts: 71

The Error Logs is where I found this information.

For the Production channel, the first attempt was immediately after the queue manager came up. And the second attempt was 20 minutes (long retry) later.

Where the Administrative channel, the first attempt was also immediately after the queue manager came up. But it's second attempt was after 1 minute (short retry).

I was looking up previous failovers, and it seems like it always has trouble reconnecting right when it comes up - probably just some lag in re-assigning the IP or something.

But before we upgraded to 9.4.0.0, all channels always used the short retry to reconnect and were reconnected after 60 seconds.

I would just write it off as a fluke, but *BOTH* queue managers QM2 and QM3 connected using short retry on the Admin data channel, and the long retry on the Production data channel.
Back to top
View user's profile Send private message
bruce2359
PostPosted: Wed Aug 21, 2024 11:20 am    Post subject: Reply with quote

Poobah

Joined: 05 Jan 2008
Posts: 9469
Location: US: west coast, almost. Otherwise, enroute.

You were asked what you discovered in the error log. Please copy the relevant error log data and paste it here - include AMQnnnn error messages and error message text.
_________________
I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live.
Back to top
View user's profile Send private message
gbaddeley
PostPosted: Wed Aug 21, 2024 3:47 pm    Post subject: Reply with quote

Jedi Knight

Joined: 25 Mar 2003
Posts: 2538
Location: Melbourne, Australia

Was the production channel already in its long retry state?
ie. Look at long retries left and short retries left in the channel status.
_________________
Glenn
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Thu Aug 22, 2024 10:52 am    Post subject: Reply with quote

Grand High Poobah

Joined: 18 Nov 2003
Posts: 20756
Location: LI,NY

Are the sender channels set up with in the local address the VIP of the mscs cluster ?
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
Jeff.VT
PostPosted: Thu Aug 22, 2024 12:45 pm    Post subject: Reply with quote

Acolyte

Joined: 02 Mar 2017
Posts: 71

Quote:
You were asked what you discovered in the error log. Please copy the relevant error log data and paste it here - include AMQnnnn error messages and error message text.


QM2 Logs:
02:38:45 - AMQ9542W: Queue manager is ending.
02:38:55 - First Event starting up
02:39:02 - AMQ9002I: Channel 'QM2-SND1' is starting.
02:39:03 - AMQ9002I: Channel 'QM2-SND2' is starting.
02:39:08 - AMQ9202E: Remote host not available, retry later. (QM2.SND1)
02:39:08 - AMQ9202E: Remote host not available, retry later. (QM2.SND2)

02:40:03 - AMQ9002I: Channel 'QM2-SND2' is starting.
02:40:03 - AMQ9299I: Channel 'QM2-SND2' has started.

02:59:02 - AMQ9002I: Channel 'QM2-SND1' is starting.
02:59:02 - AMQ9299I: Channel 'QM2-SND1' has started.

I don't see anywhere that the errors say if they're in short or long retries.

Quote:
Was the production channel already in its long retry state?
ie. Look at long retries left and short retries left in the channel status.


It's not now. It says it has 10 short retries remaining. But I'd expect that anyway since it connected successfully.

Quote:
Are the sender channels set up with in the local address the VIP of the mscs cluster?


They're set up to point to VIP Host Names. They resolved fine, so I'd guess if the IP's were in there instead, it would have still errored. I'm just chalking it up to it being a little slow to move the Host & VIP to the other Node.

We're not going to be using Windows VMs forever - at the moment we are because we haven't moved to a Linux Container just yet after moving to Azure... Azure's weird Load Balanced Failover Cluster VIP thing is certainly not as fast as an on-prem VM Failover Cluster VIP. We've seen that in a few other places now too - namely a Listener tries to listen too quickly, so we have to use the IP rather than the Name to listen on.

It's almost certainly the cause for the initial failure, but I'm less worried about that than I am about it using long retries upon a fresh restart.
Back to top
View user's profile Send private message
gbaddeley
PostPosted: Thu Aug 22, 2024 5:36 pm    Post subject: Reply with quote

Jedi Knight

Joined: 25 Mar 2003
Posts: 2538
Location: Melbourne, Australia

Setting the IBM MQ Service to Automatic (Delayed Start) may help the situation. It gives some time for other services to stabilize before MQ queue managers start up.
_________________
Glenn
Back to top
View user's profile Send private message
hughson
PostPosted: Wed Aug 28, 2024 2:39 am    Post subject: Reply with quote

Padawan

Joined: 09 May 2013
Posts: 1959
Location: Bay of Plenty, New Zealand

I don't if it is relevant, but remember that channel retry counts are only set back to their initial defined values once a channel has successfully connected to the partner AND successfully put a message across the channel. If there has been no message traffic after a reconnection and then you have another network outage the channel will continue retrying from where it left off last time.

This could explain what you are seeing, but I don't know enough about your setup to know whether or not that is what has happened.

Cheers,
Morag
_________________
Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic  Reply to topic Page 1 of 1

MQSeries.net Forum Index » General IBM MQ Support » 9.4.0.0 Short/Long Retry Bug?
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.