Author |
Message
|
BBM |
Posted: Tue Aug 16, 2011 7:16 am Post subject: Channel in-doubt |
|
|
Master
Joined: 10 Nov 2005 Posts: 217 Location: London, UK
|
Hi,
We are testing with an external party and the sender receiver channel pair between us is intermittently not passing any data.
We went to the lengths of deleting and re-creating the channel pair which enabled the sender end to send us 4 messages. We also noticed that the channel is intermittently going into in-doubt status.
We have set the sender channel HBBATCHB to 30000, then 5000 but this makes no difference. The channel at both ends is showing as running but the sequence numbers have a discrepancy which is the same as the number of messages that haven't been delivered. The channel status also shows that 0 messages have been sent.
I'm wondering what to try next, I'm guessing it must be network related as we can send in the other direction with no issues.
Any ideas? |
|
Back to top |
|
 |
mqjeff |
Posted: Tue Aug 16, 2011 7:32 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
I would confirm that the other end is not doing something like trying to use BigIP to load balance between qmgrs. |
|
Back to top |
|
 |
Mr Butcher |
Posted: Tue Aug 16, 2011 11:11 pm Post subject: |
|
|
 Padawan
Joined: 23 May 2005 Posts: 1716
|
If both channels ends are running, but channel is indoubt and data is not passing through (except the first batch of messages which is indoubt), then it could be that the receiving channel can not deliver the messages and is in retry processing. so i would check the retry parameters of the receiving channel and remove them. make sure a dlq is defined for the receiving queuemanager. then try again and check if anything hits the dlq, and if so, check why. _________________ Regards, Butcher |
|
Back to top |
|
 |
gbaddeley |
Posted: Wed Aug 17, 2011 1:19 am Post subject: |
|
|
 Jedi Knight
Joined: 25 Mar 2003 Posts: 2538 Location: Melbourne, Australia
|
Look at the MQ error logs and the output of DIS CHS(xx) ALL on both sides. This will give you more information about why there is an indoubt situation, and determines what you need to do to correct it. _________________ Glenn |
|
Back to top |
|
 |
BBM |
Posted: Wed Aug 17, 2011 5:12 am Post subject: |
|
|
Master
Joined: 10 Nov 2005 Posts: 217 Location: London, UK
|
Thanks everyone. I changed the retry values as per Mr Butcher's suggestion, which did not correct the issue.
The sending end is not clustered or using BIGIP.
Here is the receiving end's CHS - note that it says that the channel is not indoubt but the senders CHS states it is indoubt.
CHANNEL(SENDQM.TO.RCVQM) CHLTYPE(RCVR)
BATCHES(0) BATCHSZ(50)
BUFSRCVD(3) BUFSSENT(3)
BYTSRCVD(356) BYTSSENT(340)
CHSTADA(2011-08-17) CHSTATI(14.02.43)
COMPHDR(NONE,NONE) COMPMSG(NONE,NONE)
COMPRATE(0,0) COMPTIME(0,0)
CONNAME(x.x.x.x) CURLUWID(4E3D201D10024901)
CURMSGS(0) CURRENT
CURSEQNO(39) EXITTIME(0,0)
HBINT(300) INDOUBT(NO)
JOBNAME(0000414B00075AC5) LOCLADDR(::ffff:x.x.x.x(1540))
LSTLUWID(4E3D201D10024901) LSTMSGDA( )
LSTMSGTI( ) LSTSEQNO(39)
MCASTAT(RUNNING) MCAUSER(mqm)
MONCHL(OFF) MSGS(0)
NPMSPEED(FAST) RQMNAME(SENDQM)
SSLCERTI( ) SSLKEYDA( )
SSLKEYTI( ) SSLPEER( )
SSLRKEYS(0) STATUS(RUNNING)
STOPREQ(NO) SUBSTATE(RECEIVE)
XBATCHSZ(0,0)
I am running out of the things to try, we changed the discint to 300 to try and 'refresh' the conenction every 5 mins rather than leaving the channel up for an hour or more. This didn't make any differerence, it still goes into indoubt after sending a few messages successfully.
There is nothing in the error logs - as far as MQ is concerned the channel is running normally. |
|
Back to top |
|
 |
mqjeff |
Posted: Wed Aug 17, 2011 5:21 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
There has to be something in the sender's MQ error logs. |
|
Back to top |
|
 |
Mr Butcher |
Posted: Wed Aug 17, 2011 5:37 am Post subject: |
|
|
 Padawan
Joined: 23 May 2005 Posts: 1716
|
just to make it clear .... my suggestion was not to correct the problem, but to find it, as - when the channel is in retry - no error messages are written to the logs until all retries are performed.
however, as already suggested, check the logs on both ends.
what mq versions on sending and receiving queuemanagers? _________________ Regards, Butcher |
|
Back to top |
|
 |
BBM |
Posted: Wed Aug 17, 2011 6:12 am Post subject: |
|
|
Master
Joined: 10 Nov 2005 Posts: 217 Location: London, UK
|
Yep apologies Mr Butcher, we didn't get anything on the DLQ after the retry values were changed.
The versions are: sender 6.x (will find out exact version), and at our end 7.0.1.1.
I've just been speaking to the network team on our side and apparently they have managed to capture a TCP/IP anomaly in their trace.
I asked the sender side to re-examine their logs and they came up with the following error which correlates with our network team trace - I think we have found this issue. Strange thing is that on our side we had nothing in the logs and channel is running happily.
Many thanks!
17/08/2011 13:37:52 - Process(21964.1) User(mqm) Program(runmqchl_nd)
AMQ9209: Connection to host 'host (x.x.x.x)' closed.
EXPLANATION:
An error occurred receiving data from 'host (x.x.x.x)' over TCP/IP.
The connection to the remote host has unexpectedly terminated.
ACTION:
Tell the systems administrator.
----- amqccita.c : 3094 -------------------------------------------------------
17/08/2011 13:37:52 - Process(21964.1) User(mqm) Program(runmqchl_nd)
AMQ9999: Channel program ended abnormally.
EXPLANATION:
Channel program 'SENDQM.TO.RCVQM' ended abnormally.
ACTION:
Look at previous error messages for channel program 'SENDQM.TO.RCVQM' in
the error files to determine the cause of the failure. |
|
Back to top |
|
 |
fjb_saper |
Posted: Wed Aug 17, 2011 8:22 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
You might want to upgrade your 7.0.1.1 to 7.0.1.6 (latest) or to at least 7.0.1.3 (GA)  _________________ MQ & Broker admin |
|
Back to top |
|
 |
bruce2359 |
Posted: Wed Aug 17, 2011 8:41 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
BBM wrote: |
Yep apologies Mr Butcher, we didn't get anything on the DLQ after the retry values were changed. |
Have you defined a DLQ on the destination qmgr? Have you altered the qmgr object to tell the qmgr the name of the DLQ? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
gbaddeley |
Posted: Wed Aug 17, 2011 8:13 pm Post subject: |
|
|
 Jedi Knight
Joined: 25 Mar 2003 Posts: 2538 Location: Melbourne, Australia
|
The RCVR appears to be still in RUNNING status, in a RECEIVE state. It probably doesn't know that the SDR has died because it is sitting on a blocked TCP socket read.
I suggest that you STOP and then START the RCVR. It will either go into INACTIVE status (ie. DIS CHS() CURRENT shows there is no status info), or it will go into RUNNING, RETRY on INDOUBT status
At the far end, after starting the SDR, it may be in RETRY or INDOUBT status.
Check the latest messages in the error logs on both sides. Take appropriate action, such as RESOLVE ... BACKOUT, or RESET ... SEQNUM.
Its fairly safe to do a RESOLVE ... BACKOUT on the SDR side, and then RESET SEQNUM(1) on the SDR (the RCVR will then start at 1 again too). _________________ Glenn |
|
Back to top |
|
 |
BBM |
Posted: Thu Aug 18, 2011 1:14 am Post subject: |
|
|
Master
Joined: 10 Nov 2005 Posts: 217 Location: London, UK
|
Hi Bruce, yep DLQ was defined and qmgr was configured with name of the DLQ.
I've seen in-doubt channels in the past but this is the first time I hadn't immediately been able to pin it down to one thing or another.
When the issue occurs (which was 50% of the time) we were able to resolve the channel and then try again - sometimes successfully sometimes not.
The problem was it was happening way too often so needed manual intervention all the time which is not acceptable. Stopping the channel at either end and resetting sequence numbers usually did not cure the issue.
We found after a lot of testing the issue is less likely to occur with small messages. When we looked at the network trace we discovered that the sending end was not responding to TCP/IP SACKs (Selective ACK messages) which are TCP/IP requests to re-transmit part of a packet.
The sending end did not respond to these SACKs and carried on sending regardless, this is what caused the channel to go in-doubt. We are now looking into why this is happening and the evidence is that it's it an issue with the NIC/NIC driver on the sending end.
Thanks again for all the suggestions. |
|
Back to top |
|
 |
fjb_saper |
Posted: Thu Aug 18, 2011 1:34 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
BBM wrote: |
Hi Bruce, yep DLQ was defined and qmgr was configured with name of the DLQ.
I've seen in-doubt channels in the past but this is the first time I hadn't immediately been able to pin it down to one thing or another.
When the issue occurs (which was 50% of the time) we were able to resolve the channel and then try again - sometimes successfully sometimes not.
The problem was it was happening way too often so needed manual intervention all the time which is not acceptable. Stopping the channel at either end and resetting sequence numbers usually did not cure the issue.
We found after a lot of testing the issue is less likely to occur with small messages. When we looked at the network trace we discovered that the sending end was not responding to TCP/IP SACKs (Selective ACK messages) which are TCP/IP requests to re-transmit part of a packet.
The sending end did not respond to these SACKs and carried on sending regardless, this is what caused the channel to go in-doubt. We are now looking into why this is happening and the evidence is that it's it an issue with the NIC/NIC driver on the sending end.
Thanks again for all the suggestions. |
Wrong conclusion.
Again check out the version # of your MQ. And remember that for V7.0.1.x the GA version is V7.0.1.3. Your version precedes that. My suggestion is upgrade to V7.0.1.6 and verify if the problem is still happening.
I would not be surprised if the problem has been fixed since.  _________________ MQ & Broker admin |
|
Back to top |
|
 |
BBM |
Posted: Thu Aug 18, 2011 2:25 am Post subject: |
|
|
Master
Joined: 10 Nov 2005 Posts: 217 Location: London, UK
|
Hi fjb_saper
I'm willing to be wrong on the network thing, but we have 18 queue managers on this box all running fine. We also have 12 other external parties conencting into the queue manager that are not experiencing this issue.
Can you elaborate on why you think it might not be network infrastructure and an upgrade would fix it?
Thanks |
|
Back to top |
|
 |
BBM |
Posted: Thu Aug 18, 2011 3:41 am Post subject: |
|
|
Master
Joined: 10 Nov 2005 Posts: 217 Location: London, UK
|
By the way, the sending end is on 6.0.2.2 |
|
Back to top |
|
 |
|