Author |
Message
|
AkankshA |
Posted: Thu Jul 10, 2008 2:14 am Post subject: Challenge Question - 07 / 2008 |
|
|
 Grand Master
Joined: 12 Jan 2006 Posts: 1494 Location: Singapore
|
no challenge  _________________ Cheers |
|
Back to top |
|
 |
Mehrdad |
Posted: Thu Jul 10, 2008 2:32 am Post subject: |
|
|
Master
Joined: 27 Feb 2004 Posts: 219 Location: Europe
|
Running late but being worked on.
2 different challengers are working on refining their suggested entries, one will be posting the July Challenge in the next couple of days. The other will be reserved and used for August  |
|
Back to top |
|
 |
Challenger |
Posted: Thu Jul 10, 2008 8:36 pm Post subject: |
|
|
 Centurion
Joined: 31 Mar 2008 Posts: 115
|
Here comes the July Challenge :
Problem: How to keep multiple resource managers at a DR site in synchronization with the Primary site when the DR site is WAN distance away?. This means that a synchronous replication scheme is not feasible.
A set of input queues (one or more) are used to place requests for updates/changes to application DB tables. DB tables are replicated to remote DR site via DB utility. Messages queue(s) also replicated by some means (your choice). How to keep Input queue(s) in sync with local DB and have remote DR site reflect this same synchronization?. Thus no duplicated messages at DR site and minimize (as close to zero as possible) message loss due to replication latency.
For example 1000 messages written to request Q. An application in 2PC fashion reads Input queue and updates DB tables. Say 100 GETs processed. So local queue now shows 900 messages and DB tables show 100 updates. How does one reflect this at DR site namely 900 messages on queue and 100 message updates in DB tables. Pick your resource managers as you please, say MQ for message server, DB2 for DB or Oracle whatever .
The key is that the remote DR site is synchronized and one just needs to know the last message on the queue not yet processed for where to pick up. Thus 901 messages on queue and 99 messages updated is OK as well or 999 messages on queue and 1 updated (but not so good of course!!).
Challenge Question Repeated: How to keep Input queue(s) in sync with local DB and have remote DR site reflect this same synchronization?. Thus no duplicated messages at DR site and minimize (as close to zero as possible) message loss due to replication latency.
Good Luck ! |
|
Back to top |
|
 |
PeterPotkay |
Posted: Fri Jul 11, 2008 7:51 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Challenger wrote: |
...when the DR site is WAN distance away? |
1. Define this distance please.
2. What is the network latency between the sites?
3. What is the Recovery Point Object (RPO)? i.e. How much data, in seconds, minutes, hours, can be lost in a true disaster?
4. What is the Recovery Time Object (RTO)? How much time do we have to get the DR site to meet the RPO when a disaster is declared?
5. How big will the messages be? How many? How fast will they be going thru the queue?
6. Can the applications deal with missing messages in the case of a disaster?
7. Can the applications deal with duplicate messages in the case of a disaster?
8. How much money can you spend on this?
9. Is there a restriction on the platforms, or can it be Windows, UNIX and/or z/OS?
10. Do you need to fail over to the DR site automatically? Or does a human have to make a conscious decision to declare DR, and then manually kicks off an automated process to fail over?
11. Is there a requirement for H.A. in the primary data center? In the DR data center?
Quite a challenge you've proposed!  _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
Challenger |
Posted: Fri Jul 11, 2008 11:15 am Post subject: reply to PP questions |
|
|
 Centurion
Joined: 31 Mar 2008 Posts: 115
|
All the correct questions!
1.) 1000's of miles out side any metro cluster distance(62 mi max)
2.) second to multi-second to multi-multi-second, not a determinate bandwidth availability
3.)As close to zero as possible ... last transaction update on local system lost on replication is considered desired target
4.) Only time it takes to start up application, hardware in warm standby, data expected to be in place from replication
5.a) Bulk ~20K over time more megabyte to multi-10's of mb in size
5 b&c.) Low through put requirements single digit to low double digit msgs/sec, high payload value
6.) Application (yes), Potential consequences(yes) but ... see later comments
7.) Application (yes), Potential consequeces(yes) thus yes although still situation were dups can have negative results and should be avoided if possible
8.) Prove viability of solution is key here ... but for solution placement single to low millions(remember duplicate hardware in place at DR) this is additional monies just for challenge solution.
9.) Unix platforms(Solaris and/or Linux)
10.)Human, conscious decision, manual kick off
11.)Yes, Yes
Last edited by Challenger on Mon Jul 14, 2008 9:32 am; edited 1 time in total |
|
Back to top |
|
 |
PeterPotkay |
Posted: Fri Jul 11, 2008 11:45 am Post subject: Re: reply to PP questions |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
So, we have a problem here:
Challenger wrote: |
2.) second to multi-second to multi-multi-second, not a determinate bandwidth availability
|
Lets call it 10 seconds just to have a #. App A puts the message on the queue at 12:00:00. It commits it and receives a MQRC of 0 for both the MQPUT and the MQCMIT. App A is satisfied that at 12:00:00 MQ has its message. In this example App B doesn't grab the message until the bottom of the hour. But the DR datacenter wont see the message until 12:00:10 due to network latency, agreed?
Challenger wrote: |
6.) Application (yes), Potential consequences(NO) thus no missing messages
|
If disaster strikes at 12:00:01, we've lost that message. The primary datacenter blew up after App A got a successful return code for its MQPUT + MQCMIT, but before the asynchronous replication asynchronously started to ship the data to the DR data center.
Hmmm, how can we solve this?
Or are the requirements going to be altered so that the RPO is >= the agreed upon latency, in this case 10 seconds? Or maybe you can live with a couple of missing messages if a whole datacenter goes bye-bye?
99.9% of the time the queues are empty as getters and putters are passing messages almost instantaneously. The real vulnerability is for messages that sit in queues. For our customer, is this a concern? How often do they have messages sitting in queues? Can they easily reproduce these messages? When the whole place blows up, are all the other technologies marching to the same RPO? There is no point in the MQ team spending millions for a low RPO if all the surrounding technologies have an RPO of minutes or hours, and the applications are going to rely on their own checks and balances to reconcile back to the last hour (day?) after a disaster anyway.
Does the customer want to spend millions on a very complex solution that will never achieve a zero RPO anyway? Or would they rather spend less on a robust system that will actually work as designed in DR and will get them up and running quickly. When your whole world just went down the tubes, what's more important - getting up and running again, doors open for business, or insuring you had every last MQ message that happened to be sitting in a queue? Remeber, we are not designing for H.A. here, that can be spanned across 30, 40 miles. You are trying to protect againsta regional disaster where everything in a 50 mile radius is out of commision.
Consider these questions and points food for thought for you, the hypothetical customer. Lets get our REAL requirements defined first, then solution. Right now I consider your first post a wish list. Maybe that will be the final list of requirements, maybe not. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
fjb_saper |
Posted: Fri Jul 11, 2008 12:35 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
And please keep in mind that you may well go with a different model at all.
MQ being asynchronous... let's view the picture...
Would pub/sub do it for you?
a) messages piling up / left in DB inputq
b) moving back to primary from recovery
c) synchronization to the message or synchronization to the sender app state?
Enjoy  _________________ MQ & Broker admin |
|
Back to top |
|
 |
Challenger |
Posted: Fri Jul 11, 2008 12:49 pm Post subject: teams |
|
|
 Centurion
Joined: 31 Mar 2008 Posts: 115
|
Try not thinking MQ team vs other team perspective ... I said resource managers ... not resticting to MQ ... but as this is MQ forum would be good to see MQ solution.
Quote: |
"Remeber, we are not designing for H.A. here, that can be spanned across 30, 40 miles. ..." |
Are you giving the requirements now?
I said would like to shoot for
Quote: |
"last transaction update on local system lost on replication is considered desired target" |
In your well analysed scenario ...
Quote: |
"If disaster strikes at 12:00:01, we've lost that message. The primary datacenter blew up after App A got a successful return code for its MQPUT + MQCMIT, but before the asynchronous replication asynchronously started to ship the data to the DR data center. " |
Is there any pre-supposition in this dilemma??
An yes messages can be reproduced in some cases but not all. So best not to lose any ... thus that is the target ... perhaps not achievable...
So you can ask for a number but there is no REAL acceptable number !!
Right now I consider your response a parochial perspective of the total solution space. |
|
Back to top |
|
 |
Challenger |
Posted: Fri Jul 11, 2008 2:13 pm Post subject: response to FJb |
|
|
 Centurion
Joined: 31 Mar 2008 Posts: 115
|
Quote: |
Would pub/sub do it for you?
a) messages piling up / left in DB inputq
b) moving back to primary from recovery
c) synchronization to the message or synchronization to the sender app state? |
Pub/sub would work to get request message across to DR site, where DR site hangs a sub for request topic
not following your b & c step ... how would you deplete the input Q at primary and update DB and keep synchronized with DR site ?? ... need more explanation |
|
Back to top |
|
 |
PeterPotkay |
Posted: Fri Jul 11, 2008 2:32 pm Post subject: Re: response to FJb |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Challenger wrote: |
Pub/sub would work to get request message across to DR site |
Not if the disaster strikes after the publishers publishes but before the Broker pushes it out all the way to the DR Broker.
When dealing with asynchronicity (is that a word?!) and disasters you have the potential message loss. Even commited persistent messages There is no way around this fact. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
Challenger |
Posted: Fri Jul 11, 2008 6:46 pm Post subject: restatemet |
|
|
 Centurion
Joined: 31 Mar 2008 Posts: 115
|
Were getting hung up on no missing messages ... here is the original challenge statement ...
Quote: |
"minimize (as close to zero as possible) message loss due to replication latency. " |
So 6 response, should have just restated above phrase. So replace it with that.
Better to think in transactions, n transaction locally shooting for n-1 synchronized transaction at DR site as target best case ... as given in example 901 msgs on queue 99 updates in database reflected at DR site even though 900 msgs on queue and 100 updates on primary site before disaster strikes.
Shooting to need to replay just last transactions and go forward. Indeterminate latency may cause this to be one transaction or multiple but the key is they are in sync Q and DB at DR site.
Source input request can be redone in most but not all cases ... so need to minimize the need and duplication can cause problems so need to minimize as well.
And yes goal is to be up and operational quickly at DR site but with all the above conditions optimized to desired goals for reasons given.
Hope that clarifies ... so no more issue with 'no message loss' ! |
|
Back to top |
|
 |
fjb_saper |
Posted: Sat Jul 12, 2008 4:28 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
Ok so let me give you more details on b) and c)
At point of disaster declaration you have
m) messages on inputq to DB
n) messages on inputq to Recovery DB
So you switch over to the Recovery DB. Note that because of asynchronicity of the messaging system you have an m-n difference in state of your DB
During Disaster your messages continue to pile up (m+k)
So at the end of the Disaster you need to continue processing before switching the DB in order for the regular DB to catch up with the disaster one... Your state is:
regular site: m+k
disaster site n+l-s
You want to wait until
m+k+v-o (regular) is very near to n+l-s+t-u (recovery)
In other words before switching DB back to the normal site you will have to wait until it has caught up with the work that piled up while it was down. Thus the question about synchronicity on the message or on the app state....
Note that depending on the app it might be one and the same thing... _________________ MQ & Broker admin |
|
Back to top |
|
 |
Challenger |
Posted: Mon Jul 14, 2008 8:57 am Post subject: Db rep not tied to Msg rep |
|
|
 Centurion
Joined: 31 Mar 2008 Posts: 115
|
Quote: |
"Note that because of asynchronicity of the messaging system you have an m-n difference in state of your DB " |
Not necessarily, if messages are replicated independently from the DB updates there is no guarantee that there is an m-n difference in msgs on Qs are reflected in DBs.
DBs may be up to date and messages may not be, or messages may be up to date and DB updates may not be .... it is not merely a simple asynchonicity delay problem.
Also messages are not building up at Primary to get m+k once disaster hits, DR is assuming true DR, nothing happening at primary. RPO and RTO are for the backup DR site. before any new work can continue.
How can DR site start up at m-1 (at best) and go forward with DB reflecting m-1 messages or to state original challenge another way,
let the msg Q show some N number of messages and DB reflects some M number of update messages at some point in time.
(ex using same challenge numbers say: N=900, M=100, want any pair for N+M=1000 reflected at the DR site. ... ideally N=900 and M=100 perfection, next best N+1 and M-1, next best N+2 and M-2, etc. ....)
Also do not factor a DR going back to primary scenario !!?? Primary is out for unknown duration. |
|
Back to top |
|
 |
fjb_saper |
Posted: Mon Jul 14, 2008 4:47 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
OK... Here I thought you were trying to keep 2 DB's synchronized to use one in normal traffic and the other in DR?
Apparently this is quite more complex.
Quote: |
Not necessarily, if messages are replicated independently from the DB updates there is no guarantee that there is an m-n difference in msgs on Qs are reflected in DBs. |
Then what is the point of replicating the messages?
Differences between sites
Assumptions 1000 msgs put to queue, curdepth = 900
There is no guarantee that the same 100 have been consumed, especially if there are multiple routes from source to target, or if the receiving queues are mismatched (priority vs FIFO).
You would expect that once all the messages have been consumed the state of the DBs in both sites would be equal...
 _________________ MQ & Broker admin |
|
Back to top |
|
 |
sridhsri |
Posted: Mon Jul 14, 2008 8:09 pm Post subject: |
|
|
Master
Joined: 19 Jun 2008 Posts: 297
|
I have never done DR nor have I read up on any material to do so. I definitely over simplifying the issue when I propose this. But I was wondering if you could please tell me what possible disadvantages would you see if
1) We used linear logging for the Qmgr ( and used all the usual scripts that come with that for maintenance)
2) Use Archival logging for DB2 or similar techniques for other databases
and used disk mirroring on the file systems between the primary site and the DR site. |
|
Back to top |
|
 |
|