Author |
Message
|
belchman |
Posted: Thu May 17, 2018 7:11 am Post subject: Mother of All Cluster Problems |
|
|
Partisan
Joined: 31 Mar 2006 Posts: 386 Location: Ohio, USA
|
I am hoping I can get some ideas from you folks on this one. I opened a ticket with IBM on this and their initial response was "a gasp" (it seemed) when I described the issue.
I have this situation where I get an AMQ9469 after a while. It says "AMQ9469: Update not received for CLUSRCVR channel TO.MQP1 hosted on queue manager MQP1.C5D30588EE85EF01 in cluster MYCLUSTER."
This occurs with predictable regularity and I have to issue a REFRESH cluster on this queue manager (AIX, not the z/OS one behind the cluster receiver) to make it stop.
If I do not refresh, some cluster queues on this AIX queue manager disappear and an outage ensues in which I have to refresh the cluster to restore service.
I have no idea how or why this is occurring and am looking for help.
Here is why IBM gasped when I opened the PMR. This one queue manager is a full repos for 8 different clusters.
Aside from that, the cluster that I need to refresh with regularity has
2 fulls and 1 partial. The 2 fulls are an AIX qmgr and a z/OS qmgr (MQP1). The local queues that are shared are in the cluster are on MQP1.
I do not admin the z/OS qmgr so I have to ask the admin to look up stuff. All I can see is what is on AIX.
I need to figure out why MQP1 is not sending updates about its receiver or queues. When this AIX qmgr gets this error report, it is a countdown to the outage.
AMQ9456: Update not received for queue MYCLUSTERQUEUE, queue manager MQP1.C5D30588EE85EF01 from full repository for cluster MYCLUSTER.
EXPLANATION:
The repository manager detected a cluster queue that had been used sometime in the last 30 days for which updated information should have been sent from afull repository. However, this has not occurred.
The repository manager will keep the information about this queue for a further 60 days from when the error first occurred.
Any ideas you may have would be appreciated. _________________ Make three correct guesses consecutively and you will establish a reputation as an expert. ~ Laurence J. Peter |
|
Back to top |
|
 |
Anant.v |
Posted: Thu May 17, 2018 7:20 am Post subject: |
|
|
 Apprentice
Joined: 26 Nov 2014 Posts: 40 Location: Malaysia
|
I'm facing some similar issues in my environment. What i have come to a conclusion is, its happening in my case, only after a DR simulation. Is it happening for you also after a DR ? |
|
Back to top |
|
 |
belchman |
Posted: Thu May 17, 2018 10:32 am Post subject: |
|
|
Partisan
Joined: 31 Mar 2006 Posts: 386 Location: Ohio, USA
|
Anant,
My situation is not related to a DR exercise or anything. It has been going on in production for over a year. _________________ Make three correct guesses consecutively and you will establish a reputation as an expert. ~ Laurence J. Peter |
|
Back to top |
|
 |
gbaddeley |
Posted: Thu May 17, 2018 4:12 pm Post subject: |
|
|
 Jedi Knight
Joined: 25 Mar 2003 Posts: 2538 Location: Melbourne, Australia
|
AFAIK, a PR qmgr will choose to send cluster object updates to one of the FR qmgrs that it knows about. If there is an error that prevents updates from being processed by that FR, the PR will NOT choose another FR to send updates.
Check the MQ error logs and check for FDCs on the PR and FR qmgrs.
If a PR cannot do this processing for ~90 days, it will silently delete all the cluster queue defs in its local repository. Apps will then fail with RC 2189 (cluster resolution error). _________________ Glenn |
|
Back to top |
|
 |
belchman |
Posted: Fri May 18, 2018 3:39 am Post subject: |
|
|
Partisan
Joined: 31 Mar 2006 Posts: 386 Location: Ohio, USA
|
gbaddeley,
That's why I am confused. The queues and the clusrcvr that are shared (and that go away)are shared on a FR and they become unavailable to another FR.
And the fact that when I do a manual refresh of the cluster, they come back tells me that these FR are able to communicate.
I am truly stumped. I opened another ESR with IBM .
Is it possible that the z/OS MQ code is experiencing that repository manager bug IT12700? _________________ Make three correct guesses consecutively and you will establish a reputation as an expert. ~ Laurence J. Peter |
|
Back to top |
|
 |
tczielke |
Posted: Fri May 18, 2018 4:46 am Post subject: |
|
|
Guardian
Joined: 08 Jul 2010 Posts: 941 Location: Illinois, USA
|
It sounded like one FR is on z/OS and the other FR is on distributed AIX. Are they at the same code level? Personally, I would run the FRs on the same platform and IBM MQ code level. _________________ Working with MQ since 2010. |
|
Back to top |
|
 |
belchman |
Posted: Fri May 18, 2018 4:57 am Post subject: |
|
|
Partisan
Joined: 31 Mar 2006 Posts: 386 Location: Ohio, USA
|
tczielke,
There are a number of things I would have done differently but I inherited this stuff and it is a challenge.
The mainframe queue manager is a FP (I believe) so the mainframe MQ admin can see the cluster. We have a separation between mainframe and open systems MQ that is another challenge.
I will reach out to the mainframe MQ person to see what version of MQ is installed there. I am not sure if IT12700 would affect the z/os MQ the way it did AIX MQ. _________________ Make three correct guesses consecutively and you will establish a reputation as an expert. ~ Laurence J. Peter |
|
Back to top |
|
 |
bruce2359 |
Posted: Fri May 18, 2018 5:46 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
belchman wrote: |
The mainframe queue manager is a FP (I believe) ... |
You “believe?” MQSC DISPLAY commands will tell you which are PRs and which are FRs. _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
belchman |
Posted: Fri May 18, 2018 5:49 am Post subject: |
|
|
Partisan
Joined: 31 Mar 2006 Posts: 386 Location: Ohio, USA
|
bruce2359,
Communication error.
1) I know MQP1 is a full repository
2) I believe it is was made a full repos so that the z/OS admin can see the full cluster
Sorry for confusion _________________ Make three correct guesses consecutively and you will establish a reputation as an expert. ~ Laurence J. Peter |
|
Back to top |
|
 |
bruce2359 |
Posted: Fri May 18, 2018 11:55 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
belchman wrote: |
bruce2359,
Communication error.
1) I know MQP1 is a full repository
2) I believe it is was made a full repos so that the z/OS admin can see the full cluster
Sorry for confusion |
MQ clustering software will only use the first two FRs as FRs. A 3rd FR, like MQP1, will not be used as an FR. _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
fjb_saper |
Posted: Fri May 18, 2018 8:18 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
bruce2359 wrote: |
MQ clustering software will only use the first two FRs as FRs. A 3rd FR, like MQP1, will not be used as an FR. |
Hi Bruce, can you please elaborate? I used to have a cluster with 4 FRs (transitory phase, while moving from AIX to Linux 2 FRs on each platform) _________________ MQ & Broker admin |
|
Back to top |
|
 |
exerk |
Posted: Sat May 19, 2018 1:17 am Post subject: |
|
|
 Jedi Council
Joined: 02 Nov 2006 Posts: 6339
|
bruce2359 wrote: |
MQ clustering software will only use the first two FRs as FRs. A 3rd FR, like MQP1, will not be used as an FR. |
Bruce, where do you get the idea that MQP1 is a 3rd PR?
belchman wrote: |
Aside from that, the cluster that I need to refresh with regularity has 2 fulls and 1 partial. The 2 fulls are an AIX qmgr and a z/OS qmgr (MQP1)... |
_________________ It's puzzling, I don't think I've ever seen anything quite like this before...and it's hard to soar like an eagle when you're surrounded by turkeys. |
|
Back to top |
|
 |
bruce2359 |
Posted: Sat May 19, 2018 6:11 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
Did I misinterpret this?
belchman wrote: |
Communication error.
2) I believe it is was made a full repos so that the z/OS admin can see the full cluster |
Seems like it was a PR before it was made an FR by the z folks so they could see all cluster stuff. So, was MQP1 one of the two original explicit FRs - the FRs that will propagate cluster info? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
mvic |
Posted: Sun May 20, 2018 7:03 am Post subject: Re: Mother of All Cluster Problems |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
belchman wrote: |
AMQ9469: Update not received for CLUSRCVR channel |
PRs subscribe to FRs for the queues they use and the qmgrs that host those queues.
As long as apps on the PR continue to use a queue / qmgr, the PR itself should continue to renew its subscriptions for the queue and qmgr.
And, in return, the FRs are supposed to send updates to the PR for anything relating to that queue name or qmgr.
I don't know what causes your particular problem, but it could be something like:
- FRs have both "lost" their record of the subscriptions the PR sent to them (unlikely)
- PR neglected to make or remake its subscriptions (unlikely)
- Owner of the queue has been deleted or failed to re-announce itself or its queue (unlikely)
- Messages announcing presence of queue / qmgr or remaking the PR's subscription have been deleted from a SCTQ somewhere by an administrator (unlikely?)
- DR test was done sometime in the past, and your internal prod sequence numbers are a long way behind what DR increased them to (in some cases likely but yours? you said no DR. Did you mean "never" or has there been a DR test sometime in the distant past?).
So all of these ideas are unlikely and quite probably untrue in your particular case. Hopefully IBM will get to the root cause for you.
One more thing: there have been bugs in the past, what levels are you at on the PR and FR? |
|
Back to top |
|
 |
gbaddeley |
Posted: Sun May 20, 2018 4:49 pm Post subject: |
|
|
 Jedi Knight
Joined: 25 Mar 2003 Posts: 2538 Location: Melbourne, Australia
|
tczielke wrote: |
Personally, I would run the FRs on the same platform and IBM MQ code level. |
I would think this is mandatory for cluster reliability. Also, the code level should be equal or above all other PRs qmgrs in the clusters. We enforce these requirements at our site.
If z/OS folks need a view of all clustered objects, they should use a tool that has access to the FR on distributed platforms. _________________ Glenn |
|
Back to top |
|
 |
|