ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum IndexChallenge ForumChallenge Question - 07 / 2008

This forum is locked: you cannot post, reply to, or edit topics.This topic is locked: you cannot edit posts or make replies. Goto page 1, 2  Next
Challenge Question - 07 / 2008 View previous topic :: View next topic
Author Message
AkankshA
PostPosted: Thu Jul 10, 2008 2:14 am Post subject: Challenge Question - 07 / 2008 Reply with quote

Grand Master

Joined: 12 Jan 2006
Posts: 1494
Location: Singapore

no challenge
_________________
Cheers
Back to top
View user's profile Send private message Visit poster's website
Mehrdad
PostPosted: Thu Jul 10, 2008 2:32 am Post subject: Reply with quote

Master

Joined: 27 Feb 2004
Posts: 219
Location: Europe

Running late but being worked on.

2 different challengers are working on refining their suggested entries, one will be posting the July Challenge in the next couple of days. The other will be reserved and used for August
Back to top
View user's profile Send private message Visit poster's website
Challenger
PostPosted: Thu Jul 10, 2008 8:36 pm Post subject: Reply with quote

Centurion

Joined: 31 Mar 2008
Posts: 115

Here comes the July Challenge :

Problem: How to keep multiple resource managers at a DR site in synchronization with the Primary site when the DR site is WAN distance away?. This means that a synchronous replication scheme is not feasible.


A set of input queues (one or more) are used to place requests for updates/changes to application DB tables. DB tables are replicated to remote DR site via DB utility. Messages queue(s) also replicated by some means (your choice). How to keep Input queue(s) in sync with local DB and have remote DR site reflect this same synchronization?. Thus no duplicated messages at DR site and minimize (as close to zero as possible) message loss due to replication latency.

For example 1000 messages written to request Q. An application in 2PC fashion reads Input queue and updates DB tables. Say 100 GETs processed. So local queue now shows 900 messages and DB tables show 100 updates. How does one reflect this at DR site namely 900 messages on queue and 100 message updates in DB tables. Pick your resource managers as you please, say MQ for message server, DB2 for DB or Oracle whatever .

The key is that the remote DR site is synchronized and one just needs to know the last message on the queue not yet processed for where to pick up. Thus 901 messages on queue and 99 messages updated is OK as well or 999 messages on queue and 1 updated (but not so good of course!!).

Challenge Question Repeated: How to keep Input queue(s) in sync with local DB and have remote DR site reflect this same synchronization?. Thus no duplicated messages at DR site and minimize (as close to zero as possible) message loss due to replication latency.

Good Luck !
Back to top
View user's profile Send private message
PeterPotkay
PostPosted: Fri Jul 11, 2008 7:51 am Post subject: Reply with quote

Jedi Council

Joined: 15 May 2001
Posts: 7466

Challenger wrote:
...when the DR site is WAN distance away?

1. Define this distance please.

2. What is the network latency between the sites?

3. What is the Recovery Point Object (RPO)? i.e. How much data, in seconds, minutes, hours, can be lost in a true disaster?

4. What is the Recovery Time Object (RTO)? How much time do we have to get the DR site to meet the RPO when a disaster is declared?

5. How big will the messages be? How many? How fast will they be going thru the queue?

6. Can the applications deal with missing messages in the case of a disaster?

7. Can the applications deal with duplicate messages in the case of a disaster?

8. How much money can you spend on this?

9. Is there a restriction on the platforms, or can it be Windows, UNIX and/or z/OS?

10. Do you need to fail over to the DR site automatically? Or does a human have to make a conscious decision to declare DR, and then manually kicks off an automated process to fail over?

11. Is there a requirement for H.A. in the primary data center? In the DR data center?


Quite a challenge you've proposed!
_________________
Peter Potkay
Keep Calm and MQ On
Back to top
View user's profile Send private message
Challenger
PostPosted: Fri Jul 11, 2008 11:15 am Post subject: reply to PP questions Reply with quote

Centurion

Joined: 31 Mar 2008
Posts: 115

All the correct questions!

1.) 1000's of miles out side any metro cluster distance(62 mi max)
2.) second to multi-second to multi-multi-second, not a determinate bandwidth availability
3.)As close to zero as possible ... last transaction update on local system lost on replication is considered desired target
4.) Only time it takes to start up application, hardware in warm standby, data expected to be in place from replication
5.a) Bulk ~20K over time more megabyte to multi-10's of mb in size
5 b&c.) Low through put requirements single digit to low double digit msgs/sec, high payload value
6.) Application (yes), Potential consequences(yes) but ... see later comments
7.) Application (yes), Potential consequeces(yes) thus yes although still situation were dups can have negative results and should be avoided if possible
8.) Prove viability of solution is key here ... but for solution placement single to low millions(remember duplicate hardware in place at DR) this is additional monies just for challenge solution.
9.) Unix platforms(Solaris and/or Linux)
10.)Human, conscious decision, manual kick off
11.)Yes, Yes


Last edited by Challenger on Mon Jul 14, 2008 9:32 am; edited 1 time in total
Back to top
View user's profile Send private message
PeterPotkay
PostPosted: Fri Jul 11, 2008 11:45 am Post subject: Re: reply to PP questions Reply with quote

Jedi Council

Joined: 15 May 2001
Posts: 7466

So, we have a problem here:
Challenger wrote:

2.) second to multi-second to multi-multi-second, not a determinate bandwidth availability

Lets call it 10 seconds just to have a #. App A puts the message on the queue at 12:00:00. It commits it and receives a MQRC of 0 for both the MQPUT and the MQCMIT. App A is satisfied that at 12:00:00 MQ has its message. In this example App B doesn't grab the message until the bottom of the hour. But the DR datacenter wont see the message until 12:00:10 due to network latency, agreed?

Challenger wrote:

6.) Application (yes), Potential consequences(NO) thus no missing messages

If disaster strikes at 12:00:01, we've lost that message. The primary datacenter blew up after App A got a successful return code for its MQPUT + MQCMIT, but before the asynchronous replication asynchronously started to ship the data to the DR data center.

Hmmm, how can we solve this?

Or are the requirements going to be altered so that the RPO is >= the agreed upon latency, in this case 10 seconds? Or maybe you can live with a couple of missing messages if a whole datacenter goes bye-bye?

99.9% of the time the queues are empty as getters and putters are passing messages almost instantaneously. The real vulnerability is for messages that sit in queues. For our customer, is this a concern? How often do they have messages sitting in queues? Can they easily reproduce these messages? When the whole place blows up, are all the other technologies marching to the same RPO? There is no point in the MQ team spending millions for a low RPO if all the surrounding technologies have an RPO of minutes or hours, and the applications are going to rely on their own checks and balances to reconcile back to the last hour (day?) after a disaster anyway.

Does the customer want to spend millions on a very complex solution that will never achieve a zero RPO anyway? Or would they rather spend less on a robust system that will actually work as designed in DR and will get them up and running quickly. When your whole world just went down the tubes, what's more important - getting up and running again, doors open for business, or insuring you had every last MQ message that happened to be sitting in a queue? Remeber, we are not designing for H.A. here, that can be spanned across 30, 40 miles. You are trying to protect againsta regional disaster where everything in a 50 mile radius is out of commision.

Consider these questions and points food for thought for you, the hypothetical customer. Lets get our REAL requirements defined first, then solution. Right now I consider your first post a wish list. Maybe that will be the final list of requirements, maybe not.
_________________
Peter Potkay
Keep Calm and MQ On
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Fri Jul 11, 2008 12:35 pm Post subject: Reply with quote

Grand Poobah

Joined: 18 Nov 2003
Posts: 19770
Location: LI,NY

And please keep in mind that you may well go with a different model at all.

MQ being asynchronous... let's view the picture...

Would pub/sub do it for you?
a) messages piling up / left in DB inputq
b) moving back to primary from recovery
c) synchronization to the message or synchronization to the sender app state?

Enjoy
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
Challenger
PostPosted: Fri Jul 11, 2008 12:49 pm Post subject: teams Reply with quote

Centurion

Joined: 31 Mar 2008
Posts: 115

Try not thinking MQ team vs other team perspective ... I said resource managers ... not resticting to MQ ... but as this is MQ forum would be good to see MQ solution.

Quote:
"Remeber, we are not designing for H.A. here, that can be spanned across 30, 40 miles. ..."


Are you giving the requirements now?

I said would like to shoot for
Quote:
"last transaction update on local system lost on replication is considered desired target"


In your well analysed scenario ...
Quote:
"If disaster strikes at 12:00:01, we've lost that message. The primary datacenter blew up after App A got a successful return code for its MQPUT + MQCMIT, but before the asynchronous replication asynchronously started to ship the data to the DR data center. "


Is there any pre-supposition in this dilemma??

An yes messages can be reproduced in some cases but not all. So best not to lose any ... thus that is the target ... perhaps not achievable...

So you can ask for a number but there is no REAL acceptable number !!

Right now I consider your response a parochial perspective of the total solution space.
Back to top
View user's profile Send private message
Challenger
PostPosted: Fri Jul 11, 2008 2:13 pm Post subject: response to FJb Reply with quote

Centurion

Joined: 31 Mar 2008
Posts: 115

Quote:
Would pub/sub do it for you?
a) messages piling up / left in DB inputq
b) moving back to primary from recovery
c) synchronization to the message or synchronization to the sender app state?


Pub/sub would work to get request message across to DR site, where DR site hangs a sub for request topic

not following your b & c step ... how would you deplete the input Q at primary and update DB and keep synchronized with DR site ?? ... need more explanation
Back to top
View user's profile Send private message
PeterPotkay
PostPosted: Fri Jul 11, 2008 2:32 pm Post subject: Re: response to FJb Reply with quote

Jedi Council

Joined: 15 May 2001
Posts: 7466

Challenger wrote:

Pub/sub would work to get request message across to DR site

Not if the disaster strikes after the publishers publishes but before the Broker pushes it out all the way to the DR Broker.

When dealing with asynchronicity (is that a word?!) and disasters you have the potential message loss. Even commited persistent messages There is no way around this fact.
_________________
Peter Potkay
Keep Calm and MQ On
Back to top
View user's profile Send private message
Challenger
PostPosted: Fri Jul 11, 2008 6:46 pm Post subject: restatemet Reply with quote

Centurion

Joined: 31 Mar 2008
Posts: 115

Were getting hung up on no missing messages ... here is the original challenge statement ...

Quote:
"minimize (as close to zero as possible) message loss due to replication latency. "


So 6 response, should have just restated above phrase. So replace it with that.

Better to think in transactions, n transaction locally shooting for n-1 synchronized transaction at DR site as target best case ... as given in example 901 msgs on queue 99 updates in database reflected at DR site even though 900 msgs on queue and 100 updates on primary site before disaster strikes.

Shooting to need to replay just last transactions and go forward. Indeterminate latency may cause this to be one transaction or multiple but the key is they are in sync Q and DB at DR site.

Source input request can be redone in most but not all cases ... so need to minimize the need and duplication can cause problems so need to minimize as well.

And yes goal is to be up and operational quickly at DR site but with all the above conditions optimized to desired goals for reasons given.

Hope that clarifies ... so no more issue with 'no message loss' !
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Sat Jul 12, 2008 4:28 pm Post subject: Reply with quote

Grand Poobah

Joined: 18 Nov 2003
Posts: 19770
Location: LI,NY

Ok so let me give you more details on b) and c)

At point of disaster declaration you have
m) messages on inputq to DB
n) messages on inputq to Recovery DB

So you switch over to the Recovery DB. Note that because of asynchronicity of the messaging system you have an m-n difference in state of your DB

During Disaster your messages continue to pile up (m+k)

So at the end of the Disaster you need to continue processing before switching the DB in order for the regular DB to catch up with the disaster one... Your state is:
regular site: m+k
disaster site n+l-s

You want to wait until
m+k+v-o (regular) is very near to n+l-s+t-u (recovery)

In other words before switching DB back to the normal site you will have to wait until it has caught up with the work that piled up while it was down. Thus the question about synchronicity on the message or on the app state....
Note that depending on the app it might be one and the same thing...
_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
Challenger
PostPosted: Mon Jul 14, 2008 8:57 am Post subject: Db rep not tied to Msg rep Reply with quote

Centurion

Joined: 31 Mar 2008
Posts: 115

Quote:
"Note that because of asynchronicity of the messaging system you have an m-n difference in state of your DB "


Not necessarily, if messages are replicated independently from the DB updates there is no guarantee that there is an m-n difference in msgs on Qs are reflected in DBs.

DBs may be up to date and messages may not be, or messages may be up to date and DB updates may not be .... it is not merely a simple asynchonicity delay problem.

Also messages are not building up at Primary to get m+k once disaster hits, DR is assuming true DR, nothing happening at primary. RPO and RTO are for the backup DR site. before any new work can continue.

How can DR site start up at m-1 (at best) and go forward with DB reflecting m-1 messages or to state original challenge another way,

let the msg Q show some N number of messages and DB reflects some M number of update messages at some point in time.
(ex using same challenge numbers say: N=900, M=100, want any pair for N+M=1000 reflected at the DR site. ... ideally N=900 and M=100 perfection, next best N+1 and M-1, next best N+2 and M-2, etc. ....)

Also do not factor a DR going back to primary scenario !!?? Primary is out for unknown duration.
Back to top
View user's profile Send private message
fjb_saper
PostPosted: Mon Jul 14, 2008 4:47 pm Post subject: Reply with quote

Grand Poobah

Joined: 18 Nov 2003
Posts: 19770
Location: LI,NY

OK... Here I thought you were trying to keep 2 DB's synchronized to use one in normal traffic and the other in DR?

Apparently this is quite more complex.
Quote:
Not necessarily, if messages are replicated independently from the DB updates there is no guarantee that there is an m-n difference in msgs on Qs are reflected in DBs.


Then what is the point of replicating the messages?

Differences between sites
Assumptions 1000 msgs put to queue, curdepth = 900
There is no guarantee that the same 100 have been consumed, especially if there are multiple routes from source to target, or if the receiving queues are mismatched (priority vs FIFO).

You would expect that once all the messages have been consumed the state of the DBs in both sites would be equal...


_________________
MQ & Broker admin
Back to top
View user's profile Send private message Send e-mail
sridhsri
PostPosted: Mon Jul 14, 2008 8:09 pm Post subject: Reply with quote

Master

Joined: 19 Jun 2008
Posts: 297

I have never done DR nor have I read up on any material to do so. I definitely over simplifying the issue when I propose this. But I was wondering if you could please tell me what possible disadvantages would you see if

1) We used linear logging for the Qmgr ( and used all the usual scripts that come with that for maintenance)
2) Use Archival logging for DB2 or similar techniques for other databases

and used disk mirroring on the file systems between the primary site and the DR site.
Back to top
View user's profile Send private message
Display posts from previous:
This forum is locked: you cannot post, reply to, or edit topics.This topic is locked: you cannot edit posts or make replies. Goto page 1, 2  Next Page 1 of 2

MQSeries.net Forum IndexChallenge ForumChallenge Question - 07 / 2008
Jump to:



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP


Theme by Dustin Baccetti
Powered by phpBB 2001, 2002 phpBB Group

Copyright MQSeries.net. All rights reserved.