MQSeries.net :: View topic - MQ failover issue + SRDF strategies

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » General IBM MQ Support » MQ failover issue + SRDF strategies

Goto page 1, 2 Next

MQ failover issue + SRDF strategies

« View previous topic :: View next topic »

Author

Message

sm138929

Posted: Wed Feb 06, 2008 4:29 pm Post subject: MQ failover issue + SRDF strategies

Apprentice

Joined: 29 Aug 2007
Posts: 25

Hi ,

We are in the process of having our MQ prod server to have a failover to our DR site .We have created QMGR objects in the secondary site by copying the /var/mqm data in SAN usign SRDF .We need to know if we stop the QMGRS in Primary Site and bring the QMGRs in Failover site using the data copied from Primary and have our vendor site pointin the traffic to our Backup/failover site ona different VPN shall we have a smooth failover ith all the channels and QMGR running in failover site.Also during the failback step do we need to reverse the SRDF so that the MQ system data/log is again copied back to Primary site or we can have the golden copy in Primary to be there without copying the data from failover site ....I am bit confused in this steps because I am new to the stuff ...All the application data however is picked by the application servers from the MQ queues in a different database .I am concerned with MQ related data.

Plz advise...

SM

jsware

Posted: Thu Feb 07, 2008 1:08 am Post subject:

Chevalier

Joined: 17 May 2001
Posts: 455

I'd look at the MC91 support pac and in the documentation around "backup queue managers" as these may be what you are looking for.
_________________
Regards
John
The pain of low quaility far outlasts the joy of low price.

PeterPotkay

Posted: Thu Feb 07, 2008 10:05 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

How much lag time is there in copying the date from the SAN in DataCenter#1 (DC1) to the SAN in DataCenter#2 (DC2)? If its more than zero, you got problems in your design.

App A connects to QM1 on Server1 in DC1 and does a MQPUT to a local q and does a put to a remote q that cause an MQ channel to start up. The MQ Channel's sequence # changes as the message is shipped to some other QM in DC1. App B gets the MQ messages from the local queue a few seconds later.

Disaster strikes. DC1 and its SAN are gone. Due to lag time in copying the data to SAN2 in DC2 you now have some missing data.

What's the state of your data on the SAN in DC2? Maybe the data related to 1st MQPUT was sucessfully transmitted to the other SAN, but what if the MQGET wasn't? When things come back up that MQ message is back in the queue. Can all your apps handle duplicate messages? What if an MQPUT happened in DC1, the message was in the queue, and disaster strikes before that info is shipped over. Can your apps all tolerate missing messages and/or duplicate messages? The sequence #s on all your channels will almost certainly be off.

Unless your server is updating both SANs synchronously which is only possible if the DCs are close to each other and you have the bandwidth (i.e. stretch clusters) you have to deal with potential data loss or duplicate messages.
_________________
Peter Potkay
Keep Calm and MQ On

sm138929

Posted: Wed Feb 20, 2008 5:11 pm Post subject: MQ data failover between datacenters ...

Apprentice

Joined: 29 Aug 2007
Posts: 25

Hi ,

We have been talking to our storage and network guys regarding the latency for the data to be transferred fro m DC1 to DC2 ,,
Our plan is to ...

Prefailover from DC1 to DC2 ...
1. Stop Channels to our Business partners ..Ensure our queues are empty in DC2 MQ servers ...
2. Stop our Queue manager 3. Take the system level backup of the the DC 2 MQ qmgr dat and log files ..as per the backup procedure in sys admin guide .../var/mqm and all the subdirectories..
4.Unmount the file system in DC2 .There we have a single solaris box running all three qmgrs ...
5. Start EMC SRDF (replication between the DC1 and DC2 SAN )..till all the data and logs from DC1 is coped to DC2 wit hDc1 as primary and Dc2 as secondary as per EMC-- SRDF technology .
6.Stop the sender and rcvr chls to our BPartner in DC1 mq clusters...check for any message in the queues to be picked up by the applications connected to DC1 queues.
7.Stop the app server connectivity to the MQ ...
8.Stop the MQIPT service ....
9.Stop the Sun Cluster which manages the MQ qmgrs .
10.Split the SRDF connectivity to DC2 .
11.Take the backup of the MQ data and log files in /var/mqm in DC1 storage .
12.Unmount file system in DC1 MQ boxes ..
13.Mount file system in DC2 unix box (there is no sun cluster in our DR site ).
14.Start the Qmnagers ,..ensuring that the v-ips are now pointing to the qmgrs in DC2 ...Make necessary config changes for chl parameters because connname was pointing to the different load balancer v-ip in DC2 ..we have a MQIPT and load balancer in our env to talk to our BPartenr ...
15..Validate if Load bal and mqipt are connecting properly to our MQ cluster ...qmgrs ...
16.Ensure that 3DNS is pointing to our DC2 MQ v-ips ..
17.Start our app serversin DC2 and connect the channels and see if applications are going through from our side to our business partner and back.
18.Once done ..we need to fail back to our DC1 datacenter ...
19.Stop MQ chls from Businespartner in DC2.Stop app connectivity ..
20.Stop MQIPT service ..Ensure all messages are processed .
.21..Start the reverse SRDF replication with DC2 as primary and DC1 as secondary ...ensure replication is complete

Stop qmgrs ...Take a backup of the MQ data and log ...in DC2..unmount file system in DC2..
20.Start MQIPT service in DC1..Mount file system in DC1..start the SUn clusters and the Qmgrs..in DC1 mak channel specific configuration changes in DC1 ...start the channels and connect the APp server in DC1 .
21.Mount the file system in DC1 ...start the QMGRS ...start the channels and connect the apps ...

My question is is the steps that we have devised proper for a failover /failback testting as far MQ is concerned ,,We are using SRDF and 3DNS for the first time ..so I am more concerned with the implementation ...

Thanks,
SM

PeterPotkay

Posted: Wed Feb 20, 2008 5:20 pm Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

You can't call it DR because you can't do any of the steps for DC1 if DC1 is gone. You can't consider it H.A. because it looks like it will take hours. Even if you could do all the steps it looks like everything needs to be functioning and available prior to starting so why bother doing it?

What is this supposed to accomplish?
_________________
Peter Potkay
Keep Calm and MQ On

sm138929

Posted: Wed Feb 20, 2008 6:26 pm Post subject: Good question

Apprentice

Joined: 29 Aug 2007
Posts: 25

Actually the same we told to our mgmt but they said that we can consider it as a failover site and we can have 6-7 hrs downtime to check if this stuff works or not ....

Our concern is whether SRDF will work in this test as per our plan,,,

One question what is a best way to take MQ backup and restore...

Stop chls.///take QMGRS down..take the backup of /var/mqm/data and mqm logs with subdirectories...and restore it when required in the same way .....

Any hidden tips can you sugges .because I am very new to MQ related backup procedures..
Thanks...

SM

PeterPotkay

Posted: Wed Feb 20, 2008 7:41 pm Post subject:

Poobah

Joined: 15 May 2001
Posts: 7723

Maybe someone else will have a different perspective, or maybe I'm missing something, but you plan is pointless (sorry for the harsh words). Its to slow for H.A. and impossible for a real DR, so what do you prove executing it?
_________________
Peter Potkay
Keep Calm and MQ On

Vitor

Posted: Thu Feb 21, 2008 2:25 am Post subject: Re: Good question

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

sm138929 wrote:

Actually the same we told to our mgmt but they said that we can consider it as a failover site and we can have 6-7 hrs downtime to check if this stuff works or not ....

No you can't because your plan relies on the primary site being closed down cleanly before the seecondary comes up. In either of the likely scenarios (catastrophic hardware failure or catastrophic damage to site) this cannot be the case.

Unless your management have some means of prediciting unexpected events with 100% accuracy?

You need to explain to them what a distaster is, most management don't read dictionaries because they find the plot hard to follow. I offer as an example one of my previous employers who had me running not just the MQ but the network & local server hardware. The working day was disrupted when the guy outside trying to find the water pipe with a mechanical digger found the power cable & fibre link running just above it. This really put us out of action, and I spent a good 30 minutes trying to explain to this guy I couldn't "plug the server into a different wall socket" or "repatch some of the cables" or anything because the whole building was dead to the world.

but a mechanical digger flies surprisingly well when it hits an high voltage underground cable and snaps the insulation!

_________________
Honesty is the best policy.
Insanity is the best defence.

jefflowrey

Posted: Thu Feb 21, 2008 2:37 am Post subject:

Grand Poobah

Joined: 16 Oct 2002
Posts: 19981

I keep a small piece of fibre cable in my shoulder bag, in case I am ever stranded on a desert isle.

Once I bury that thing, some fool with a digger is going to come along and cut it.
_________________
I am *not* the model of the modern major general.

Vitor

Posted: Thu Feb 21, 2008 2:43 am Post subject:

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

jefflowrey wrote:

I keep a small piece of fibre cable in my shoulder bag, in case I am ever stranded on a desert isle.

Once I bury that thing, some fool with a digger is going to come along and cut it.

Bury an electric cable next to it. It'll be easier to steal his boat while he's distracted trying to get his digger out of the tree.

_________________
Honesty is the best policy.
Insanity is the best defence.

SAFraser

Posted: Thu Feb 21, 2008 2:33 pm Post subject:

Shaman

Joined: 22 Oct 2003
Posts: 742
Location: Austin, Texas, USA

I think a controlled failover to a production DR site does accomplish a few things. It assures that network access works, that internal & external trading partners can communicate with MQ (firewall danger!!), that stupid little things in the documentation haven't been missed (such as a channel reset if a newbie is executing the plan) and that contact lists are accurate at that point in time. Also confirms that a successful procedure is in place for keeping the two sites exactly the same in terms of MQ objects, channel exits and security.

However, once a planned failover exercise is successful, the task list should be saved as a "planned failover" checklist. Then, a second checklist should be created for an actual disaster. The second checklist assumes that nothing was brought down gracefully, that messages may have been lost, that trading partners have suddenly lost connection to MQ and so forth.

And here's an obvious tip.... don't store your only copy of the checklists at your primary site....

Whether a planned failover exercise yields sufficient benefits to justify the exercise is a matter of debate. For it to be really useful, everything should be failed over (not just the MQ servers). It's an enormous amount of work.

Shirley

sm138929

Posted: Thu Feb 21, 2008 2:44 pm Post subject: Thanks for the tips

Apprentice

Joined: 29 Aug 2007
Posts: 25

Hi ,

I understand that this is not realy a DR site or a HA site but may be our mgmt would like to see that if something odd happens in theMQ in primary site we can at least bring the MQ servers up in second site with SRDF technology transferring data to the secondary site ..I was told by them that they have plans for HA with Sun cluster atthe DR site ..

Regarding backup-recovery procedure of MQ can some one suggest the easiest way to do in least possible time ...This might be helpful if something happens in our Primary site servers and we can restore it in the same site in quicktime ...

Thanks,
SM

jefflowrey

Posted: Thu Feb 21, 2008 3:31 pm Post subject:

Grand Poobah

Joined: 16 Oct 2002
Posts: 19981

You can't take a backup of a running queue manager.

You may not get the results you expect using SRDF.

You should review the documentation in the HA support packs, and compare it (without looking at the scripts for working with a particular HA product) with what you're trying to do.

You should also review the "backup qmgr" feature in MQ v6. This is almost certainly what you want to do.
_________________
I am *not* the model of the modern major general.

sm138929

Posted: Thu Feb 21, 2008 3:42 pm Post subject: Question

Apprentice

Joined: 29 Aug 2007
Posts: 25

Hi,

I am no planning to take backup of running queue Mgr..We will schedule a monthly /weekly backup by taking the qmgr down...then take a backup of the important file and logs under /var/mqm ...check the permissions on each ..then restore it when needed in the same directory structure ...

But there is a concept called media backup and system backup in the MQ backup recovery book from IBM ..That is confusing me to some extent..
What is the best practice for MQ backups ??

Also regd SRDF why cannot we get the desired results if we make a controlled failover /failback with all messages processed...

Please explain so that we can rethink our failover plan which is happening in our prod env next weeeknd ...

Thanks,

SM

SAFraser

Posted: Fri Feb 22, 2008 10:39 am Post subject:

Shaman

Joined: 22 Oct 2003
Posts: 742
Location: Austin, Texas, USA

Your idea to backup /var/mqm once a week is relatively useless, unless there is something here I am missing. The rcdmqimg that you mention from the manuals would only be useful if your failover occurred immediately after the image was taken.

You need to ensure that the queue managers in DC1 and DC2 are identical all the time. I myself would not use any type of file copying for that, but your situation may be different. MS03 is your friend.

I mean this in the most helpful way possible: Given the type of questons you are asking, I would do the following to prevent a disaster of your disaster recovery exercise! Have your queue managers in DC2 ready to go. Copy nothing. Do not shut down DC1 until queues are empty. Stop your applications from feeding them and wait for all messages to be consumed. Wait till queues on DC2 are empty before you fail back.

Your overall strategy using SRDF needs to be examined more thoroughly in relation to the way MQ processes and logs messages, but I'm not sure you have time for that study before your failover exercise.

Display posts from previous:

Goto page 1, 2 Next

Page 1 of 2

MQSeries.net Forum Index » General IBM MQ Support » MQ failover issue + SRDF strategies

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP