MQSeries.net :: View topic

Jeff.VT · Posted: Mon Aug 19, 2024 4:35 pm Post subject:

There's probably going to be a lot of "but why are you doing that" in this post, but the answer is "because that's the way we do it".

3 Queue Managers on the same 2-node Windows Failover Cluster.

QM1 - The 'gateway' / hub
- Is an MQ Cluster Full Repository
- QM2 and QM3 are not part of that Cluster

QM2
- QM2-SND1 - Connects to QM1 via Sender Channel for production traffic
- QM2-RCV1 - Connected to by QM1 for production traffic
- QM2-SND2 - Connects to QM1 via Sender Channel for 'Administrative' traffic
- also connects to an MQ Cluster at a third party, if it matters

QM3
- QM3-SND1 - Connects to QM1 via Sender Channel for production traffic
- QM3-RCV1 - Connected to by QM1 for production traffic
- QM3-SND2 - Connects to QM1 via Sender Channel for 'Administrative' traffic
- also connects to a different MQ Cluster at a third party, if it matters

We recently upgraded from 9.3.0.17 to 9.4.0.0

When we came back up, QM2-SND1, QM2-SND2 QM3-SND1, and QM3-SND2 all failed to connect to QM1 right when they came back up. Unknown why, just some weird network fluke. All 4 of these channels connect to the same DNS Name & Port.

QM2-RCV1, QM3-RCV1 both connected fine.

Here's what's strange...

Both QM2-SND1 and QM3-SND1 are configured for
- 60 Second Short Retry Interval
- 10 Count Short Retries
- 1200 Second Long Retry Interval

Both QM2-SND2 and QM3-SND2 are configured for
- 60 Second Short Retry Interval
- 10 Count Short Retries
- 300 Second Long Retry Interval

On *BOTH* QM2 and QM3, SND2 channel retried for 60 seconds, the short retry interval. And both connected.

BUT on *BOTH* QM2 and QM3, SND1 channel retried for the LONG retry Interval, and reconnected after 20 minutes.

All other clustered and non-clustered channels started without a hitch. And those that didn't on the QM1, all used Short Retry interval.

After a restart - where I guess I may have assumed all Retry Counts should be reset to their defaults.

I've since just changed these values to 10x10sec, and 60 seconds long retry, so it shouldn't ever cause a major problem in the future. I just never thought that a Queue Manager connecting to another Queue Manager hosted on the same server as itself would have a network problem where it would need to ever get to long retries - it was certainly a miss by me, but still seems weird that it got into long retries at all.

Is this expected behavior? Or is this a bug?

Edit: I was looking into the past behavior, and it seems like QM2 and QM3 often have trouble connecting to QM1 after a failover, and they go into Short Retries to connect. That was on 9.3.0.17.