Posted: Fri Jun 04, 2004 9:22 am Post subject: Firewall outage, Retry Intervals and heartbeats
Grand Master
Joined: 28 Feb 2003 Posts: 1311 Location: USA
We have the following:
Win2k MQ 5.3 CSD6
z/OS MQ 5.3.1
Yesterday we took a hit on the firewall that the sender/receiver channels communicate through.
On Win2K the channels were:
Sender channel Win.to.ZOS was in RETRY
Recvr channel ZOS.to.Win was INACTIVE
On z/OS both channels showed as 'RUN'
The channels were stopped, reset and we attempted to restart them but they were 'stuck' in this state for 50 minutes. Then, magically, they started working again. We have identified the cause of the problem (the firewall decided to restart itself) but I am anxious to shorten the outage and to get things up and running again...automatically if possible.
All our channels are triggered and we have a disconnect interval of 600 set so we aren't too worried about the inactive channel.
We also have short retry interval set to 60 and a short retry count of 10. My understanding was that this would cause MQ to retry a failing channel every 1 minute for 10 minutes, then it would 'bubble down' to the long retry (in our case this is 1200 for 99999999).
The heartbeat interval is set at 300.
I am a little bit confused as to why it took so long for recovery to take place. Is is because the firewall is somehow holding the socket open on one side (the z/OS) end so z/OS doesn't think it has a problem? Short of bouncing the CHIN is there any other way of force a TCP/IP 'retry'.
Would it help to reduce the long retry interval (1200 seems a little excessive)?
Tthe only thing that seems to make sense is that the long retry tripped in after 10 minutes of the short retry sequence failing to reconnect and the long retry had to check twice (in 20 minute intervals) before successfully reconnecting. I am not sure where the TCPIP keep alive fits into all this if at all but it seems to me that the keep alive interval could be causing MQ to think it still has a valid connection at the z/OS end.
I am open to suggestions/comments but I would really like to shorten the recovery time if possible
I assume the 50 minutes 'stuck' time started from after the firewall restarted and not before (i.e. part of the 50 minutes wasn't caused by the firewall not running).
You might want to look at the ADOPTMCA parameter. It fixed similar problems I had with firewalls and channels.
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum