Author |
Message
|
hguapluas |
Posted: Fri Aug 26, 2005 10:01 am Post subject: MQ xmit channel failure from hard NW break... |
|
|
Centurion
Joined: 05 Aug 2004 Posts: 105 Location: San Diego
|
Hi all,
Just wondering if anybody else has run into a situation similar to below. Have run into this a few times now and pattern for restoration seems to be same in each case.
In a large multi-network environment running through two or more firewalls when there is a firewall shutdown or other hard failure in network connectivity, when connections are restored the xmit channels on one or both ends will not reconnect and resume traffic flow. What is usually required is manual intervention on one or both sides, usually on the side where messages are getting queued up. MQ guy has to bounce the xmit channel several times and usually send a fresh (new) message and then channel status goes active and traffic flow resumes. Just doing a simple channel start on one or both ends does not resume traffic flow. It always seems to take a series of channel starts on the sender channel to resume traffic flow.
In all cases, there is an associated complete network outage tied in with this between the sender/receiver. Other channels on either end that are on network paths that were not affected by the outage continue as normal. It is only the channels that were impacted by the hard network break.
MQ vers on Windows side is 5.3 CSD's range from 06-09. On mainframe side, they are using zOS and not sure what version of MQ but it is an older version. In almost all cases of this, it was an issue of Mainframe to Windows connectivity. I do not think this has happened on any Windows-to-Windows MQ connections which would have all been at v5.3.
Some of the outages were caused by hardware failure. Others were caused by decision to shutdown firewall connections suddenly to prevent spread of recent worm attack from propagating between networks. The worm itself did not impact MQ traffic.
Curious minds want to know your experiences and what you've done to remedy the situation. Is there a more elegant way to restore service short of manually bouncing the sender channel multiple times?!?
Thanks. |
|
Back to top |
|
 |
wschutz |
Posted: Fri Aug 26, 2005 10:09 am Post subject: |
|
|
 Jedi Knight
Joined: 02 Jun 2005 Posts: 3316 Location: IBM (retired)
|
I assume you're saying the sender end of the channel goes into (or remains in) retry? Have the MQ guys looked at the MQ logs on windows and zOS (xxxxCHIN log)? Are there any interesting messages? _________________ -wayne |
|
Back to top |
|
 |
jefflowrey |
Posted: Fri Aug 26, 2005 10:11 am Post subject: |
|
|
Grand Poobah
Joined: 16 Oct 2002 Posts: 19981
|
In all cases that I have ever seen, when a channel has failed to start, an error message has been produced that described why that channel failed to start.
In all cases that I have ever seen, that error message was usually a very good hint if not a complete description of what needed to be done to solve the problem.
In no case that I know of, with one exception, would restarting a channel multiple times cause the channel to start working on it's own.
The one exception to this is incomplete recovery from a network failure - and the error message would indicate that the network wasn't up yet. _________________ I am *not* the model of the modern major general. |
|
Back to top |
|
 |
hguapluas |
Posted: Fri Aug 26, 2005 10:35 am Post subject: |
|
|
Centurion
Joined: 05 Aug 2004 Posts: 105 Location: San Diego
|
I was third party to most of this except for one instance where it was my receiver channel on my side. In my logs, no errors were shown and I was told that the sender channels were going into a "bind" state on the other side and would not complete the connection. After they cycled their xmit channel a couple of times, the channel started successfully and all messages backlogged in the queue were transmitted.
Afraid I don't have any more info than that. |
|
Back to top |
|
 |
wschutz |
Posted: Fri Aug 26, 2005 11:01 am Post subject: |
|
|
 Jedi Knight
Joined: 02 Jun 2005 Posts: 3316 Location: IBM (retired)
|
So they senders went into bind and remained there until a manual stop / start of the channel? And are you saying there was nothing in the log on the senders end (or you don't know)? _________________ -wayne |
|
Back to top |
|
 |
fjb_saper |
Posted: Fri Aug 26, 2005 11:41 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
We had a similar problem with MQ 5.3 on Unix and MQ 2.1 on MF.
After a network interruption we would see the channel in retry mode for way longer than the retry period.
The culprit was the MF end of the channel (receiver) which seemed to believe it was still connected.
Don't rememember what the logs said.
The way we resolved it was:
shut down sender channel. (Unix)
shut down (force) receiver channel. (MF)
start receiver channel. (MF)
start sender channel. (Unix)
Enjoy  |
|
Back to top |
|
 |
sradiraju |
Posted: Fri Aug 26, 2005 1:55 pm Post subject: |
|
|
 Apprentice
Joined: 08 Sep 2002 Posts: 34 Location: Chicago,IL
|
Fjb_saper's solution is the right one and will prevent multiple restarts of the channels to get the communication working. However, there is a much elegant solution.
hguapluas, if you have carefully observed the error usually occur when network breaks between MQ on distributed servers ( windows or UNIX) & Mainframe. The solution is to use right combination of Heartbeat & DISCNT intervals and most importantly you need to set a parameter on mainframe called AdoptNewMCA = YES. This will enable the receiver to adopt the new incoming MCA from the sender channel after the DISCNT expires.
Hope this helps.
SOmesh |
|
Back to top |
|
 |
PeterPotkay |
Posted: Fri Aug 26, 2005 2:48 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
sradiraju wrote: |
This will enable the receiver to adopt the new incoming MCA from the sender channel after the DISCNT expires.
|
AdoptMCA and AdoptNewMCA will kick in when required regardless of DISCINT. The RCVR does not have to wait for it to pass. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
hguapluas |
Posted: Fri Aug 26, 2005 2:51 pm Post subject: |
|
|
Centurion
Joined: 05 Aug 2004 Posts: 105 Location: San Diego
|
Thanks for your feedback on this. Fjb_saper's description is about what I remember. I don't have any details on the mainframe side since that is controlled by a different organization and I have no influence on how they decide to configure their channels Can only respond on my end with best efforts to restore connection unless able to get a hold of their support staff and work together on it. And they frequently refuse to admit a problem exists on their end since according to their monitoring tools, the "queue" is connected. They don't always dig down into the problem to find out the root cause. Their reliance on old troubleshooting scripts can be a royal pain sometimes. |
|
Back to top |
|
 |
|