Author |
Message
|
Old Lag |
Posted: Sat Sep 27, 2014 2:47 am Post subject: Restoring a Full Repository from a backup |
|
|
Novice
Joined: 02 Jul 2014 Posts: 13
|
Has anyone got any experience of connecting a Qmgr with a PR back into a cluster where every other Qmgr - including the FRs has been restored from backups ?
We were testing in QA the recommended procedure for moving FRs - def new FRs, interconnect all FRS, demote old FRs to PRs, remove old FRs from cluster, etc., and we had successfully tested this.
During the QA test, one of the cluster PRs failed to see the demotion of one of the old FRs, and we then found that this PR had a SYS.CLUS.CMD.Q depth of 100sK and was growing.
Nothing we tried worked and eventually it was decide to restore the cluster (PRs an FRs) to its state before we started.
(We are pretty certain this MUST be an MQ bug because even if we had screwed up the procedure, we cannot see how that should cause what appeared to be some sort of loop on cmd.q).
Unfortunately, we did not backup one of the PRs.
We now have a consistent, working cluster (as it was) but with one member (currently stopped) with a PR state as it was at time of backout - i.e it probably is aware of the new FRs and it may think the old FRs are PRs.
This Qmgr - along with the new FRs - are currently stopped and we are concerned that starting it could cause problems with the rest of the cluster. I know that timestamps are used in PR-FR comms and we now have an FR restored to some days earlier than the PR.
Of course, the PR may well think that the restored FRs are no longer FRs in which case I guess they can't damage them. And the new FRs it does know about are not running.
We will be raising a PMR with IBM for the original problem (yes, we KNOW that by backing out an restoring we have lost PD info !) and asking for their advice but we wonder if anyone on this forum could offer their advice/experience ?
Thanks in advance. |
|
Back to top |
|
 |
Old Lag |
Posted: Sat Sep 27, 2014 4:12 am Post subject: |
|
|
Novice
Joined: 02 Jul 2014 Posts: 13
|
I should have said that we are running MQ for Linux (X86-64) Version 7.5.0.2).
Google searches show old probs with cmd.q filling up and a more recent one that seems to be fixed in 7.5.0.3 but this is speculation right now and we will seek IBM advice.
But in the meantime, since we have destroyed (probably) any useful PD info. we need to get our cluster fully back to the way it was if only to try to reproduce the problem.
Thus the request here for any advice as to how we might safely re-introduce the Qmgr that has a more recent view of the Cluster state than that held in the FRs.
Thanks again |
|
Back to top |
|
 |
PeterPotkay |
Posted: Sat Sep 27, 2014 4:55 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
If the S.C.C.Q. is piling up, sounds like the Cluster Repository Manager took a dirt nap . The only way to restart it is to restart the whole queue manager.
Were this my queue manager in a play environment, I would:
#1 Make sure the Cluster Sender channel is properly defined to the correct FR and to the correct cluster.
#2 Make sure the Cluster Receiver is correctly defined.
#3 If its got old messages piled up on it, clear out the S.C.C.Q.
#4 If the S.C.C.Q. was piling up, that means the repository manager is dead, so restart the QM to get it going again.
#5 Issue the REFRESH CLUSTER command with the REPOS (YES) option to cause this PR to purge any and all old info it has and reintroduce itself to the cluster using the info from #1 and #2.
That's just me, a stranger on the internet, with a play queue manager.
You, dealing with a "real" queue manager, might want to bump that PMR to a Sev 1 and only do what they say. That way if things go from bad to worse you can say you were just following official IBM guidance. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
Old Lag |
Posted: Sat Sep 27, 2014 8:01 am Post subject: |
|
|
Novice
Joined: 02 Jul 2014 Posts: 13
|
Many thanks for the prompt reply Peter.
When the problem with s.c.c.q occurred it was only with one cluster Qmgr (a PR); other PRs in the cluster were behaving quite normally - i.e. they recognised that we had demoted both old FRs.
As far as we can tell, all sdr chls look ok but we'll certainly check the next time round.
So now, we have all Qmgrs in the cluster (FRs and the PR that had the problem) restored and a few tests show all is as it should be. But we have one Qmgr (a PR) that is currently down and which has NOT been restored and which seemed OK during the problem scenario.
If we could get it to come back safely (i.e without screwing up the restored FRs) then I had certainly thought of REFRESH REPOS(YES) to get its PR restored but I am concerned that before we are able to issue the cmd, we MIGHT have caused some mayhem in the FRs because this Qmgr - having NOT been restored - has PR entries of a later date than any in the FRs.
I hope this is making sense.
(This is a QA system and NOT real Production so - at the risk of causing a bit of test outage - we can be savage if we have to be).
Thanks again. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Sat Sep 27, 2014 10:41 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
The REFRSH CLUSTER command doesn't refresh the cluster. When issued on a PR with the REPOS(YES) option its saying "I am a PR, I am going to purge all knowledge I have of the cluster, and then I am going to reintroduce myself into the cluster."
So make sure the PR looks right for the cluster its supposed to be in (Step#1 and #2), make sure its Repository Manager is functioning (Steps #3 and #4 if needed), then tell the PR to forget everything it knew and welcome to the cluster (step #5).
Should get you back in business. But you are in a bit of a messy situation. Al bets are off. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
Old Lag |
Posted: Sun Sep 28, 2014 2:42 am Post subject: |
|
|
Novice
Joined: 02 Jul 2014 Posts: 13
|
Many thanks Peter and this is really helpful.
I think the final bit of PR-FR "how it works" info that would completely put my mind at rest is whether there is any danger that before I get to issue the REFRESH cmd, there might have already been some PR-FR chat that because there may be Cluster info. in the PR at a later date than that in the FR could have screwed up the FR.
My understanding of PR-FR comms. is that - apart from setting up initial subscriptions - it is always FR -> PR so even if the PR had info at a later date, there would not automatically be PR -> FR comms ).
If anyone can confirm my understanding, then I am as happy as I can be that I can safely bring up the PR Qmgr and issue the REFRESH cmd.
Thanks again, |
|
Back to top |
|
 |
PeterPotkay |
Posted: Sun Sep 28, 2014 6:34 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
PRs do initiate and push communications to a FR.
Create a new queue on the PR, cluster it.
Put Inhibit a clustered queue on the PR.
Un cluster a queue on the PR.
Just change the description of a clustered queue on a PR.
All of these things and lots of others I'm sure cause a PR to push data to its FRs. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
Old Lag |
Posted: Sun Sep 28, 2014 9:43 am Post subject: |
|
|
Novice
Joined: 02 Jul 2014 Posts: 13
|
Thanks again Peter.
But during out testing (before we discovered we had the s.c.c.q error), we definitely did NOT make any changes from the Qmgr (PR) in question (i.e. the one currently down that has not been restored to the state prior to testing start). What its PR probably knows about is the existence of the 2 new FRs we created during our testing but that's it. We did NOT try to access any existing Cluster Qs from that Qmgr during testing, nor did we create any new ones.
Given the above, would you say that restarting that Qmgr should NOT cause its PR to start comms. with the FR ?
I realise that this is an unfair question to ask on the basis of these brief emails and we WILL be seeking IBM advice before we do anything, but if we were pushed to start the Qmgr before we got clear IBM advice (it is a QA system after all), then we might have to "give it a shot" based on best-can-do advice from MQ professionals. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Sun Sep 28, 2014 10:04 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
100% of the times when a PR talks to its FR is one of those MQ Internal type things that I just don't know about.
Were it my play QM, I would follow the steps I laid out above. Steps 1 and 2 make sure things are the way they need to be on the PR.
Steps 3 and 4 get rid of those suspect command messages that are aither valid, bogus, corrupt, obsolete. Whatever, I would blow them all away since you are about to issue REFRESH CLUSTER REPOS(YES) anyway.
Then issue that command to tell that PR to forget anything else it may know about (all the info still in its S.C.Repository.Q).
Back in MQ 5.2 and early 5.3 when MQ clustering was flakier than a croissant in Paris we would occasionally need to cold start a cluster because it was a mess. Same steps as above but it would also involve manually clearing out all the messages in the S.C.R.Q. before restarting the QM to manually purge any bogus info. That was a decade ago. But back then this process worked when we needed it to. I would actually add that as step 3.5 were it my play Queue Manager and I was experimenting. I'm not advising you do any of this...just rambling.
Alternatively you may want to delete and recreate the PR QM. Then issue the RESET CLUSTER command from one of the FRs to get rid of any knowledge of the old instance of the rebuilt QM. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
Old Lag |
Posted: Mon Sep 29, 2014 9:11 am Post subject: |
|
|
Novice
Joined: 02 Jul 2014 Posts: 13
|
Peter,
Many thanks for all your advice and I will keep you (and everyone else) posted on our next steps - especially including the advice from IBM.
This may take a few days but I will update this thread.
Thanks again |
|
Back to top |
|
 |
|