|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
 |
|
9.4.0.0 Short/Long Retry Bug? |
« View previous topic :: View next topic » |
Author |
Message
|
Jeff.VT |
Posted: Mon Aug 19, 2024 1:44 pm Post subject: 9.4.0.0 Short/Long Retry Bug? |
|
|
Acolyte
Joined: 02 Mar 2017 Posts: 71
|
There's probably going to be a lot of "but why are you doing that" in this post, but the answer is "because that's the way we do it".
3 Queue Managers on the same 2-node Windows Failover Cluster.
QM1 - The 'gateway' / hub
- Is an MQ Cluster Full Repository
- QM2 and QM3 are not part of that Cluster
QM2
- QM2-SND1 - Connects to QM1 via Sender Channel for production traffic
- QM2-RCV1 - Connected to by QM1 for production traffic
- QM2-SND2 - Connects to QM1 via Sender Channel for 'Administrative' traffic
- also connects to an MQ Cluster at a third party, if it matters
QM3
- QM3-SND1 - Connects to QM1 via Sender Channel for production traffic
- QM3-RCV1 - Connected to by QM1 for production traffic
- QM3-SND2 - Connects to QM1 via Sender Channel for 'Administrative' traffic
- also connects to a different MQ Cluster at a third party, if it matters
We recently upgraded from 9.3.0.17 to 9.4.0.0
When we came back up, QM2-SND1, QM2-SND2 QM3-SND1, and QM3-SND2 all failed to connect to QM1 right when they came back up. Unknown why, just some weird network fluke. All 4 of these channels connect to the same DNS Name & Port.
QM2-RCV1, QM3-RCV1 both connected fine.
Here's what's strange...
Both QM2-SND1 and QM3-SND1 are configured for
- 60 Second Short Retry Interval
- 10 Count Short Retries
- 1200 Second Long Retry Interval
Both QM2-SND2 and QM3-SND2 are configured for
- 60 Second Short Retry Interval
- 10 Count Short Retries
- 300 Second Long Retry Interval
On *BOTH* QM2 and QM3, SND2 channel retried for 60 seconds, the short retry interval. And both connected.
BUT on *BOTH* QM2 and QM3, SND1 channel retried for the LONG retry Interval, and reconnected after 20 minutes.
All other clustered and non-clustered channels started without a hitch. And those that didn't on the QM1, all used Short Retry interval.
After a restart - where I guess I may have assumed all Retry Counts should be reset to their defaults.
I've since just changed these values to 10x10sec, and 60 seconds long retry, so it shouldn't ever cause a major problem in the future. I just never thought that a Queue Manager connecting to another Queue Manager hosted on the same server as itself would have a network problem where it would need to ever get to long retries - it was certainly a miss by me, but still seems weird that it got into long retries at all.
Is this expected behavior? Or is this a bug?
Edit: I was looking into the past behavior, and it seems like QM2 and QM3 often have trouble connecting to QM1 after a failover, and they go into Short Retries to connect. That was on 9.3.0.17. |
|
Back to top |
|
 |
gbaddeley |
Posted: Mon Aug 19, 2024 4:35 pm Post subject: |
|
|
 Jedi Knight
Joined: 25 Mar 2003 Posts: 2538 Location: Melbourne, Australia
|
Every time a channel start retry occurs, there should be entries in the qmgr's error log, giving the reason why the sender channel couldn't start. What was being logged for the 10 short retries, before it eventually started running? _________________ Glenn |
|
Back to top |
|
 |
Jeff.VT |
Posted: Wed Aug 21, 2024 7:43 am Post subject: |
|
|
Acolyte
Joined: 02 Mar 2017 Posts: 71
|
The Error Logs is where I found this information.
For the Production channel, the first attempt was immediately after the queue manager came up. And the second attempt was 20 minutes (long retry) later.
Where the Administrative channel, the first attempt was also immediately after the queue manager came up. But it's second attempt was after 1 minute (short retry).
I was looking up previous failovers, and it seems like it always has trouble reconnecting right when it comes up - probably just some lag in re-assigning the IP or something.
But before we upgraded to 9.4.0.0, all channels always used the short retry to reconnect and were reconnected after 60 seconds.
I would just write it off as a fluke, but *BOTH* queue managers QM2 and QM3 connected using short retry on the Admin data channel, and the long retry on the Production data channel. |
|
Back to top |
|
 |
bruce2359 |
Posted: Wed Aug 21, 2024 11:20 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
You were asked what you discovered in the error log. Please copy the relevant error log data and paste it here - include AMQnnnn error messages and error message text. _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
gbaddeley |
Posted: Wed Aug 21, 2024 3:47 pm Post subject: |
|
|
 Jedi Knight
Joined: 25 Mar 2003 Posts: 2538 Location: Melbourne, Australia
|
Was the production channel already in its long retry state?
ie. Look at long retries left and short retries left in the channel status. _________________ Glenn |
|
Back to top |
|
 |
fjb_saper |
Posted: Thu Aug 22, 2024 10:52 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
Are the sender channels set up with in the local address the VIP of the mscs cluster ?  _________________ MQ & Broker admin |
|
Back to top |
|
 |
Jeff.VT |
Posted: Thu Aug 22, 2024 12:45 pm Post subject: |
|
|
Acolyte
Joined: 02 Mar 2017 Posts: 71
|
Quote: |
You were asked what you discovered in the error log. Please copy the relevant error log data and paste it here - include AMQnnnn error messages and error message text. |
QM2 Logs:
02:38:45 - AMQ9542W: Queue manager is ending.
02:38:55 - First Event starting up
02:39:02 - AMQ9002I: Channel 'QM2-SND1' is starting.
02:39:03 - AMQ9002I: Channel 'QM2-SND2' is starting.
02:39:08 - AMQ9202E: Remote host not available, retry later. (QM2.SND1)
02:39:08 - AMQ9202E: Remote host not available, retry later. (QM2.SND2)
02:40:03 - AMQ9002I: Channel 'QM2-SND2' is starting.
02:40:03 - AMQ9299I: Channel 'QM2-SND2' has started.
02:59:02 - AMQ9002I: Channel 'QM2-SND1' is starting.
02:59:02 - AMQ9299I: Channel 'QM2-SND1' has started.
I don't see anywhere that the errors say if they're in short or long retries.
Quote: |
Was the production channel already in its long retry state?
ie. Look at long retries left and short retries left in the channel status. |
It's not now. It says it has 10 short retries remaining. But I'd expect that anyway since it connected successfully.
Quote: |
Are the sender channels set up with in the local address the VIP of the mscs cluster? |
They're set up to point to VIP Host Names. They resolved fine, so I'd guess if the IP's were in there instead, it would have still errored. I'm just chalking it up to it being a little slow to move the Host & VIP to the other Node.
We're not going to be using Windows VMs forever - at the moment we are because we haven't moved to a Linux Container just yet after moving to Azure... Azure's weird Load Balanced Failover Cluster VIP thing is certainly not as fast as an on-prem VM Failover Cluster VIP. We've seen that in a few other places now too - namely a Listener tries to listen too quickly, so we have to use the IP rather than the Name to listen on.
It's almost certainly the cause for the initial failure, but I'm less worried about that than I am about it using long retries upon a fresh restart. |
|
Back to top |
|
 |
gbaddeley |
Posted: Thu Aug 22, 2024 5:36 pm Post subject: |
|
|
 Jedi Knight
Joined: 25 Mar 2003 Posts: 2538 Location: Melbourne, Australia
|
Setting the IBM MQ Service to Automatic (Delayed Start) may help the situation. It gives some time for other services to stabilize before MQ queue managers start up. _________________ Glenn |
|
Back to top |
|
 |
hughson |
Posted: Wed Aug 28, 2024 2:39 am Post subject: |
|
|
 Padawan
Joined: 09 May 2013 Posts: 1959 Location: Bay of Plenty, New Zealand
|
I don't if it is relevant, but remember that channel retry counts are only set back to their initial defined values once a channel has successfully connected to the partner AND successfully put a message across the channel. If there has been no message traffic after a reconnection and then you have another network outage the channel will continue retrying from where it left off last time.
This could explain what you are seeing, but I don't know enough about your setup to know whether or not that is what has happened.
Cheers,
Morag _________________ Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software |
|
Back to top |
|
 |
|
|
 |
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|