|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
|
|
Question about long running UOW's |
« View previous topic :: View next topic » |
Author |
Message
|
GheorgheDragos |
Posted: Thu Aug 29, 2019 12:14 am Post subject: Question about long running UOW's |
|
|
Acolyte
Joined: 28 Jun 2018 Posts: 51
|
Dears,
We have encountered the following situation, in which, to my regret, I have made a mistake by claiming that "we cannot help from mainframe side" in front of the customer. I am not proud of myself and it is not like me at all to dismiss a possibility like that.
For several days one of our queue managers has not taken a checkpoint. All good and done, we identified the culprit queues, channels and servers. We informed the customer to commit their threads, and we were informed that they do not have the knowledge on how to do so, their application using JMS. We had a meeting and they have asked me : when we stop/start the application, how come that doesn't solve the issue ( since they are restarting it twice per night, at 1 and at 5 am ). I replied that when an application is stopped, at restart it will re-use any hanging threads in order to complete them, just like CICS starting up in warm mode etc. Ok. So they have asked for a confirmation that the problem is coming from their side in to the mainframe and if there is anything I can do from my side to fix it. The channel being SVRCONN, it is impossible for us to run a commit. So I replied with a definitive NO ( well more polite than that ) and I suggested that they commit all their units of work. Later on that day I receive an email from them with an ibm article about a possibility to use STOP CHANNEL(FORCE)!!!!!!! to commit the threads, from the RECEIVING side. I had not idea about this, and I feel terrible, having already stopped the channel several times from my side ( in normal quiesce mode ). I would like to ask the following :
KAINT - One SVRCONN has multiple threads opened against it ( running instances ), coming from different servers, all from the same application. In the case of an unexpected tcp/ip problem, when the connection is broken between the client and the mainframe, and the orphaned thread stays hanging on the mainframe ( is that even possible - we are running MQ 8 ), can the KAINT help ? for example, when there are 30 threads for example active ( and by the way, these threads could not even do any IO, just keeping the channel active ), tcp/ip breaks down, one incoming message I on it's way, can this message remain somehow in a 'loop' in mq side and, while the initial thread it was assigned to gets timed out due to KAINT, will it be 'purged' ?? Is the KAINT managing the channel itself, or each individual thread? for example, we have 30 individual thread on one SVRCONN, 29 are idle - yet still active - and one is doing active IO. Will the 29 threads be timed out ?
Thank you for your time.
Dragos Gheorghe |
|
Back to top |
|
|
hughson |
Posted: Thu Aug 29, 2019 1:54 am Post subject: |
|
|
Padawan
Joined: 09 May 2013 Posts: 1948 Location: Bay of Plenty, New Zealand
|
When an application running over a SVRCONN ends, and the CHINIT detects that ending of the SVRCONN, either because you stopped it, or because you used HBINT or KAINT to help to detect broken TCP/IP connections, the SVRCONN will, as part of it's clean up, issue an MQBACK, to rollback any MQ only UoWs.
If these UoWs you are having trouble with were single phase commit UoWs they would already be tidied up. Therefore I have to assume that these are 2-phase commit transactions?
I am not aware of an article from IBM that suggests doing a STOP MODE(FORCE) on a channel to Commit any transaction. Is it possible that you could provide us with the link to the article that they sent you?
P.S. KAINT is generally these days unnecessary since from MQ V7, client channels have much better heart-beating. Check that your SVRCONNs are using HBINT by looking at the HBINT value on DISPLAY CHSTATUS - which will show you the negotiated value.
Cheers,
Morag _________________ Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software |
|
Back to top |
|
|
bruce2359 |
Posted: Thu Aug 29, 2019 2:26 am Post subject: |
|
|
Poobah
Joined: 05 Jan 2008 Posts: 9442 Location: US: west coast, almost. Otherwise, enroute.
|
Some more general questions:
Has this application worked properly in the past?
You mentioned CICS. IS CICS involved in this? Are any other resource managers (data bases, for example) involved in the UofW?
When did this problem begin? Has the app been modified recently?
What errors did you see in the error logs? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
|
HubertKleinmanns |
Posted: Thu Aug 29, 2019 4:30 am Post subject: Re: Question about long running UOW's |
|
|
Shaman
Joined: 24 Feb 2004 Posts: 732 Location: Germany
|
GheorgheDragos wrote: |
Later on that day I receive an email from them with an ibm article about a possibility to use STOP CHANNEL(FORCE)!!!!!!! to commit the threads, from the RECEIVING side. I had not idea about this, and I feel terrible, having already stopped the channel several times from my side ( in normal quiesce mode ). |
I never would stop a running channel with mode force - you may loose messages. The only situation where I had to stop a channel with mode force was a sender channel in state BINDING. This happens e. g. when a firewall blocks the connection. In this case no messages have passed the channel and so a stop channel with mode force is harmless.
In addition only the application, which uses the channel, is able to decide, whether a COMMIT has to be done or not. You as an administrator do not know about the application's logic (normally).
- When the application stops in a "normal" manner, the QMgr will perform a COMMIT.
- When the application crashes, the QMgr will perform a BACKOUT.
So there is no need for MQ Administrators, to force a COMMIT (except in some special cases).
GheorgheDragos wrote: |
for example, we have 30 individual thread on one SVRCONN, 29 are idle - yet still active - and one is doing active IO. Will the 29 threads be timed out ? |
I saw such situations often at my customers. Normally
1. either the client app does not properly close a connection (and "forgets" the connection handle)
2. or a firewall cuts the connection (e. g. due to low traffic).
The second case sometimes occurs in test environments, whereas in production the channel - and so the IP connection - is more busy. _________________ Regards
Hubert |
|
Back to top |
|
|
tczielke |
Posted: Thu Aug 29, 2019 5:25 am Post subject: |
|
|
Guardian
Joined: 08 Jul 2010 Posts: 941 Location: Illinois, USA
|
You can definitely perform rollback and commits with JMS. If the JMS developer is not aware of this, they need to familiarize themselves with the JMS 2.0 specification. _________________ Working with MQ since 2010. |
|
Back to top |
|
|
Vitor |
Posted: Thu Aug 29, 2019 5:26 am Post subject: |
|
|
Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
Taking a step back, I would express the opinion that it's a no-brainer for the processing application to control UOW. What would happen if, instead of MQ it was a database and the database was determining when and if to commit or rollback any work?
Moving specifically to this:
GheorgheDragos wrote: |
We informed the customer to commit their threads, and we were informed that they do not have the knowledge on how to do so, their application using JMS |
There's another comment here about no brains.
I would have, with minimal politeness, advised them to maybe learn how to use the programming framework they've elected to employ.
I also throw into the mix that my lack of knowledge about Java is legend on this forum. I typed "controlling unit of work with JMS" into Google and not only got this, this and this as the first 3 hits but also this instructional video.
GheorgheDragos wrote: |
Later on that day I receive an email from them with an ibm article about a possibility to use STOP CHANNEL(FORCE)!!!!!!! to commit the threads, from the RECEIVING side. |
Post the link. I would rebut it with this, which is not an "article" but from the product documentation and has the following to say about the FORCE option:
Quote: |
The channel does not complete processing the current batch of messages, and can, therefore, leave the channel in doubt. In general, consider using the quiesce stop option. |
So would your customer prefer to recode their application to handle commits correctly, or recode their application (and the containing server in all likelihood) to handle in-doubt client channels?
(Hint: the correct answer in "handle commits" unless their admin is a masochist who enjoys spending time at the office) _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
|
GheorgheDragos |
Posted: Mon Sep 09, 2019 9:17 am Post subject: |
|
|
Acolyte
Joined: 28 Jun 2018 Posts: 51
|
The situation was handled such as the official response was that it is not possible to commit anything from the SVRCONN side, of course, however, stop mode(force) should work, according to the manual. Unfortunately our shop is a little bit unorganized, so we don't have a JMS specialist to work on their application, to handle the way the error handling works. The stop mode(force) was done while the application was stopped. After that we were able to take the checkpoint, and this particular administrator looked like he doesn't know what he's talking about when he said that the action(commit) has to be taken from the sender side and that we cannot help from the mainframe side. |
|
Back to top |
|
|
Vitor |
Posted: Mon Sep 09, 2019 10:38 am Post subject: |
|
|
Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
GheorgheDragos wrote: |
The stop mode(force) was done while the application was stopped. |
I hope it was disconnected as well, not just stopped but using a JMS connection pool that was still active.
I'm still waiting for a link to this IBM article they sent.
How's the channel now? Running properly, running properly but some weirdness, in-doubt or just plain broken?
Is the customer happy that the resolution to this problem is to stop the application periodically so you can force the channel (risking integrity) and take a checkpoint (which given the use of a FORCE may not be worth the electrons you're using to store it)? Do they still think this is better and has less business risk than learning how to use JMS properly? _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
|
GheorgheDragos |
Posted: Fri Sep 27, 2019 3:44 am Post subject: |
|
|
Acolyte
Joined: 28 Jun 2018 Posts: 51
|
Hi,
The channel is running again without any issues, out of sequence of anything like that. Since the moment of my last post here, the checkpoint has been successfully taken. The problem is that our client doesn't have people that know how to support the JMS from where the problem started, and were trying to pin it on us to solve it. And by force stopping the channel ( while the JCM was down ) proved to them that indeed we can do it and it has to be done like this, than to dig deeper in to their stuff and realise that firing the grey heads who wrote it was NOT a good idea.
Dragos
EDIT
I am not against knowledge, but I want to be paid if I am to support JMS as well as mainframe. |
|
Back to top |
|
|
gbaddeley |
Posted: Sun Sep 29, 2019 4:25 pm Post subject: |
|
|
Jedi Knight
Joined: 25 Mar 2003 Posts: 2527 Location: Melbourne, Australia
|
GheorgheDragos wrote: |
...The problem is that our client doesn't have people that know how to support the JMS from where the problem started, and were trying to pin it on us to solve it. And by force stopping the channel ( while the JCM was down ) proved to them that indeed we can do it and it has to be done like this, than to dig deeper in to their stuff and realise that firing the grey heads who wrote it was NOT a good idea... |
Seems that it's their problem, not yours. Push back and say that force stopping a channel may result in lost data and that you won't be doing it again. Don't give in. _________________ Glenn |
|
Back to top |
|
|
GheorgheDragos |
Posted: Tue Oct 01, 2019 1:27 am Post subject: |
|
|
Acolyte
Joined: 28 Jun 2018 Posts: 51
|
Hello,
I would like to take this time to reply to some of the questions in this thread, for the betterment of new ( and experienced ! ) colleagues. I start :
[i]Therefore I have to assume that these are 2-phase commit transactions? [/I]
I honestly do not know because I wasn't around when these were developed, was not part of the meetings between development and long retired MQ system programmers.
Here is the link mentioned : https://www.ibm.com/support/pages/kaint-keepalive-orphaned-client-channel-ibm-mq-zos - send by the client.
[i]Has this application worked properly in the past? [/I] Yet another question I cannot answer with certainty. Generally it is having problems closing it's threads as we see constantly Long UOW in the job logs, but this doesn't last more than a few days. The problem is that their applications are as old as time and getting them to change, as many of you know, in a corporation it is VERY hard. We had a situation, that messages put from the client side would simply disappear. Lots of ping pong emails, client would simply reject the fact that the clues tend to point on his application side, as they were using a home written app that would PUT messages on client, and then QR or channels would send them to us on mainframe. After several calls, and tests we proved them that, with several messages types the messages simply disappeared and NEVER reached the QMGR on distributed.. as so you can see the state of our clients infrastructure ( not at all that bad to be honest but could use a nice tuneup! )
[i]You mentioned CICS. IS CICS involved in this? Are any other resource managers (data bases, for example) involved in the UofW? [/I] No, no CICS involved, only client and server qmgrs. CICS was only an example of how their application might behave during normal restart as how CICS does, recovering all UOW's including toxic ones.
[i]When did this problem begin? Has the app been modified recently? [/I] not according to them. We always ask these questions, of course to rule out a common problem, and sometimes they say yes we rolled out some new modifications we will rollback and test more etc. Not in this case.
[i]What errors did you see in the error logs?[/I] Checkpoint expired with 55 seconds. That is all. And of course the long UoW shunted over several days.
[i]I never would stop a running channel with mode force - you may loose messages[/I] None of us would do it on a normal day .. but when the client application closes OK, the customer says their business is fine and no records are missing, what is there left to do ? When the customer says that "everything is fine from their side, especially when a JMS specialist is dreadfully missing".
[i]In addition only the application, which uses the channel, is able to decide, whether a COMMIT has to be done or not. You as an administrator do not know about the application's logic (normally). [/I] Correct. Since the client adopted the I want it to work policy, I was left with no choice, the checkpoint had to be taken. Look at my newest ghost messages thread and see that QMGR's crashed due to our most recent P1. What would happen if the checkpoint would not have been taken for 3-4 weeks.
[/i]- When the application stops in a "normal" manner, the QMgr will perform a COMMIT. [I] No, it didn't.
[/i]When the application crashes, the QMgr will perform a BACKOUT. [I] No, it hasn't.
You see, the theory doesn't always match the practice. When we increased the CF struct dynamically during the ghost messages P1, we were told by L4 and others I wish not to mention the change will be seamless and MQ will not suffer, even I advised the big fishes to inform the customer that a disruption is necessary, during disconnect and connect. Results were catastrophic, well yet necessary. When the labs informed us that there is a GET in a loop why didn't we see any uncommitted UoW on the same queue manager that they correctly predicted the problem was hosted ? And how can a GET be in a loop and cause the structure to get full. I understand a put, but a get ? Here is their explanation - with some info blanked out for obvious reasons -
Per MQ/zOS Development it looks like the reason the CF is filling up is because there's a CF list, #523, that contains nearly 27 million entries. List header #523 represents the uncommitted gets for queue manager xxxx.
Continuing in next post[/i] |
|
Back to top |
|
|
GheorgheDragos |
Posted: Tue Oct 01, 2019 1:31 am Post subject: |
|
|
Acolyte
Joined: 28 Jun 2018 Posts: 51
|
[i]I would have, with minimal politeness, advised them to maybe learn how to use the programming framework they've elected to employ.
[/I] and that would change what exactly.... there are customers and customers.
[i]The channel does not complete processing the current batch of messages, and can, therefore, leave the channel in doubt. In general, consider using the quiesce stop option.
[/I] Have done that on multiple occasions. No help. Only success was with force.
[i]How's the channel now? Running properly, running properly but some weirdness, in-doubt or just plain broken?
[/I] channel is running again with, as mentioned above, long running units of works shunted, and realised just now, blocking our checkpoints....... yet again, with no alerts. Back to work.
[I]EDIT[/I] it appears that our colleague from the past has set a limit of 5 consecutive alerts to generate one alert, while a positive checkpoints reset the counter. Quite Ok.
[i]Seems that it's their problem, not yours. Push back and say that force stopping a channel may result in lost data and that you won't be doing it again. Don't give in.
[/I] I'm afraid we already gave them a bad example by "fixing" the problem. Anyhow, is solved. Now we are preparing for migrating to MQ9. Still haven't decided whether LTS or CD, I guess LTS since we aren't exploiting a lot of the new functionalities anyway(at least not intentionally ).
Dragos Gheorghe |
|
Back to top |
|
|
gbaddeley |
Posted: Tue Oct 01, 2019 4:25 pm Post subject: |
|
|
Jedi Knight
Joined: 25 Mar 2003 Posts: 2527 Location: Melbourne, Australia
|
Wow. I guess you can inform management of all the technical issues, the consequences of failures, and the actions that have been taken. They then have a basis to assess the risk and impact of business transaction failures against the cost and benefit of remediating all the known issues and other likely failure modes. Sometimes there is no win-win situation at the MQ admin level, you just need to manage the situation as best you can, and keep prodding management. _________________ Glenn |
|
Back to top |
|
|
HubertKleinmanns |
Posted: Tue Oct 01, 2019 11:13 pm Post subject: |
|
|
Shaman
Joined: 24 Feb 2004 Posts: 732 Location: Germany
|
I think, this is the relevant statement:
Quote: |
You observe a SVRCONN client channel in a RUN state even though the network connection to the client has stopped. |
This means, the client is already disconnected from the QMgr, but the MQ channel is still visible in MQ. I found this in the past (on Unix), when I made a "STOP CONN(...)" to an application, which came across a SVRCONN channel. The STOP CONN diconnected the client from the QMgr, but didn't free the channel instance.
I guess, the described situation at your posted link is similar to what I had.
But don't stop a channel with MODE(FORCE), when the connection is still active! _________________ Regards
Hubert |
|
Back to top |
|
|
Vitor |
Posted: Wed Oct 02, 2019 5:17 am Post subject: |
|
|
Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
Quote: |
I would have, with minimal politeness, advised them to maybe learn how to use the programming framework they've elected to employ. |
GheorgheDragos wrote: |
and that would change what exactly.... there are customers and customers. |
At worst nothing, at best shame them into doing their jobs. It would also highlight to management (yours and theirs) the data risks their lack of knowledge is causing. "The customer is always right" does not apply in IT.
Quote: |
The channel does not complete processing the current batch of messages, and can, therefore, leave the channel in doubt. In general, consider using the quiesce stop option. |
GheorgheDragos wrote: |
Have done that on multiple occasions. No help. Only success was with force. |
Be aware that by doing this, you're exposing yourself. When data is lost (and by the laws of random chance, it will sooner or later) it will be your fault:
"It's never lost data before - he must have done it wrong"
Much better to fix the problem.
Quote: |
How's the channel now? Running properly, running properly but some weirdness, in-doubt or just plain broken? |
GheorgheDragos wrote: |
channel is running again with, as mentioned above, long running units of works shunted, and realised just now, blocking our checkpoints....... yet again, with no alerts. Back to work. |
My comment above applies.
GheorgheDragos wrote: |
it appears that our colleague from the past has set a limit of 5 consecutive alerts to generate one alert, while a positive checkpoints reset the counter. Quite Ok. |
You and I have different definitions of "OK".
Quote: |
Seems that it's their problem, not yours. Push back and say that force stopping a channel may result in lost data and that you won't be doing it again. Don't give in. |
GheorgheDragos wrote: |
I'm afraid we already gave them a bad example by "fixing" the problem. Anyhow, is solved. |
No it's not "solved". It's fixed until the next time. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
|
|
|
|
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|