Author |
Message
|
hughson |
Posted: Tue Mar 05, 2019 7:47 pm Post subject: Re: Where is TCP timeout value set on z/OS |
|
|
 Padawan
Joined: 09 May 2013 Posts: 1959 Location: Bay of Plenty, New Zealand
|
bruce2359 wrote: |
zpat wrote: |
On a sender channel between two z/OS QMs (QMR1 at v7.1, QMR2 at v8.0) - we see a timeout from TCP. |
Backing up a bit... please describe the network configuration.
Are these two qmgrs on the same z/OS instance?
Are these z/OS instances in the same physical z box?
Over what type of channel are they communicating? HiperSockets? CTC? Copper cat 5 or 6? What type of adapters?
What z/OS releases at both ends of the channel? |
He did say this:-
zpat wrote: |
The two QMs are located in different organisations. |
_________________ Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software |
|
Back to top |
|
 |
bruce2359 |
Posted: Tue Mar 05, 2019 8:34 pm Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
Ooops. Missed that. So much for my investment in speed-reading course.
What are the CHINUT settings at both ends? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
zpat |
Posted: Wed Mar 06, 2019 12:08 am Post subject: |
|
|
 Jedi Council
Joined: 19 May 2001 Posts: 5866 Location: UK
|
As mentioned. They are in different organisations.
Ours is Zos 2.2. The issue is likely with the external network but we cant prove it.
CHINUT ? _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
 |
fjb_saper |
Posted: Wed Mar 06, 2019 3:46 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
zpat wrote: |
As mentioned. They are in different organisations.
Ours is Zos 2.2. The issue is likely with the external network but we cant prove it.
CHINUT ? |
I figure he meant Channel INIT or CHINIT...  _________________ MQ & Broker admin |
|
Back to top |
|
 |
bruce2359 |
Posted: Wed Mar 06, 2019 5:28 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
Yes, CHINIT channel initiator address space. _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
bruce2359 |
Posted: Wed Mar 06, 2019 12:57 pm Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
What are adapters and dispatchers values? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
hughson |
Posted: Wed Mar 06, 2019 1:23 pm Post subject: |
|
|
 Padawan
Joined: 09 May 2013 Posts: 1959 Location: Bay of Plenty, New Zealand
|
bruce2359 wrote: |
What are adapters and dispatchers values? |
Are you thinking that the network slow-down is based on having too few dispatchers?
We've already ruled out commit slow down since NETTIME is seen to increase just before timeout is seen, so I don't think the number of adapters is at fault. _________________ Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software |
|
Back to top |
|
 |
bruce2359 |
Posted: Wed Mar 06, 2019 5:14 pm Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
hughson wrote: |
bruce2359 wrote: |
What are adapters and dispatchers values? |
Are you thinking that the network slow-down is based on having too few dispatchers?
We've already ruled out commit slow down since NETTIME is seen to increase just before timeout is seen, so I don't think the number of adapters is at fault. |
It's a possibility. I've seen test system small values accidentally percolate into production. Dispatchers face the network. Adapters face inward to support MQI calls.
Generally, 300 msgs/sec is not a very heavy load for z/OS MQ. I'm always suspicious of firewalls.
What else is going on in the entire network at the time of the failure? Is someone FTPing huge files? Streaming video?
EREP reporting anything with NIC cards? RMF reporting anything? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
zpat |
Posted: Thu Mar 07, 2019 6:18 am Post subject: |
|
|
 Jedi Council
Joined: 19 May 2001 Posts: 5866 Location: UK
|
60 dispatchers started.
Some big FTPs on the same adapter but not over the same external network link. No obvious corelation on the time of restart.
No apparent hardware errors and timeouts have happened on this channel which has CHLDISP of SHARED on both sides of the QSG which are on different sites and hardware. _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
 |
bruce2359 |
Posted: Thu Mar 07, 2019 7:50 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
I’m guessing that your replies re EREP and RMF are about your end of the channel. Do you have access to SYSLOG or a helpful sysprog at the other end? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
zpat |
Posted: Tue Mar 12, 2019 6:01 am Post subject: |
|
|
 Jedi Council
Joined: 19 May 2001 Posts: 5866 Location: UK
|
Other end can't see any issues.
We've now been seeing relatively high network latency on this link recently without actual timeouts.
Can't seem to find the cause of this latency as seen in the sender channel NETTIME value.
Network guys can't see any issues. But the nettime values are almost 10 times higher than usual.
Could anything in z/OS TCP stack cause delays? - seems unlikely to me.
Restarting the channel resumed normal latency.
 _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
 |
hughson |
Posted: Tue Mar 12, 2019 4:25 pm Post subject: |
|
|
 Padawan
Joined: 09 May 2013 Posts: 1959 Location: Bay of Plenty, New Zealand
|
zpat wrote: |
Restarting the channel resumed normal latency. |
Closing the old socket and making a new one causes the the network to return to normal latency suggests that the socket had perhaps gone into re-transmission mode. Perhaps a router in the network has been having issues.
Just a guess though. _________________ Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software |
|
Back to top |
|
 |
zpat |
Posted: Wed Mar 13, 2019 12:01 pm Post subject: |
|
|
 Jedi Council
Joined: 19 May 2001 Posts: 5866 Location: UK
|
That's what I have been trying to convince the network team of.
So just to be totally clear. What things can cause NETTIME to increase?
z/OS TCP software layer (causes?)
Internal network (VIPA sysplex adapter)
Our firewall
Virtual Circuit from telecom company
Firewall at 3rd Party
Network inside 3rd Party
z/OS TCP at 3rd Party
Are all these possible?
None of these are MQ itself - can we rule out z/OS MQ on the two QMs as a cause of latency as measured by the NETTIME? Is there any point taking a MQ trace?
Sorry to be pedantic - but what exactly is NETTIME measuring?
When working normally it is around 30 millisecs, when slow it is consistently up at 250 millisecs which leads to delays as it can't process the messages fast enough.
After stop/start it's been running fine, but this slow down has happened occasionally so will probably re-occur unless we can find the root cause.
Thanks. _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
 |
hughson |
Posted: Wed Mar 13, 2019 2:19 pm Post subject: |
|
|
 Padawan
Joined: 09 May 2013 Posts: 1959 Location: Bay of Plenty, New Zealand
|
zpat wrote: |
What exactly is NETTIME measuring? |
An MQ channel, when it is doing a round-trip (end of batch or a heartbeat) remembers the time it sent the "Request for confirmation" flow, and when it gets back the "Acknowledgement" flow, it takes the time again. Inside the "Acknowledgement" flow is an amount of time that the partner end spent doing the MQCMIT (if it was an end of batch), and this value is removed from the time taken to do the round-trip.
So NETTIME is as close to only measuring the time spent in the network as it can be (from the perspective of MQ the owner of the socket).
It's intent was to give MQ Administrators some ammunition when talking to the network team to point out to them that there was a problem on the network.
Cheers,
Morag _________________ Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software |
|
Back to top |
|
 |
zpat |
Posted: Thu Mar 14, 2019 1:22 pm Post subject: |
|
|
 Jedi Council
Joined: 19 May 2001 Posts: 5866 Location: UK
|
Thanks, looking at the KC, I see this
Quote: |
The NETTIME value is the amount of time, displayed in microseconds, taken to send an end of batch request to the remote end of the channel and receive a response minus the time to process the end of batch request. This value can be large for either of the following reasons:
The network is slow.
A slow network can affect the time it takes to complete a batch. The measurements that result in the indicators for the NETTIME field are measured at the end of a batch. However, the first batch affected by a slowdown in the network is not indicated with a change in the NETTIME value because it is measured at the end of the batch.
Requests are queued at the remote end, for example a channel can be retrying a put, or a put request may be slow due to page set I/O. Once any queued requests have completed, the duration of the end of batch request is measured. So if you get a large NETTIME value, check for unusual processing at the remote end.
|
I am confused by the last paragraph which suggests that MQ processing delays at the remote end are included in NETTIME. _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
 |
|