ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » General IBM MQ Support » Linear logs & recovering after disk failure

Post new topic  Reply to topic
 Linear logs & recovering after disk failure « View previous topic :: View next topic » 
Author Message
jpw
PostPosted: Mon Nov 18, 2002 10:54 am    Post subject: Linear logs & recovering after disk failure Reply with quote

Newbie

Joined: 28 Jul 2002
Posts: 5

Hi there,
I've got 2 questions refering to methods of recovering queue manger after disk failures with the use of linear logging.

Our queue manager is going to run on MQSeries v5.2 on HPUX. In some extreme cases that QM may hold 4-5 GB of persistent messages. We have a requirement to protect that data from disk drive failures or accidental deletion of files. So we are going to use linear logging as well as we are going to take daily backup of the QM data
(everything under /var/mqm/qmgrs/<QM_name> will be backed up) and the QM log files (everything under /var/mqm/log/<QM_name> will be backed up).
Because of the large amout of data that may be stored in the queues we would like to avoid making the media images of the queues in the linear log.

Question1:
Is it indespensible to record the media images of the queues in the log in order to be able to restore the state of the queue manager (to the last commited state before the failure) after the disk drive failure or accidental deletion of queue manger file(s)?

Question2:
We are thinking about the following backup and recovery procedure, where images of the QM objects are not recorded.

1. Every day the queue manger is stopped and the backup of the following
directories is taken:
- /var/mqm/qmgrs/<QMGR name>/
- /var/mqm/log/<QMGR namq>/
Before taking a backup the log is truncated to minimize the volume of the
data that must be backed up in order to be able to restore the queue manager in a consistent state if such a need occurs. All the log files that are not needed to RESTART the queue manager are deleted.

2. If the queue manager data gets damaged or is lost (due to disk failue or
accidental deletion of files) we are going to stop qmgr and restore only the
queue manager data backup (the directory /var/mqm/qmgrs/<QMGR name>/ will be restored). The log data on the disk is left intact.
Then we want to restart the queue manager and expect that the information in the log will be replayed and the queue manger will update itself to the last commited state before the failure.
The logs on the disk are relevant for the just restored queue manger
backup and contain the information from the time of that backup up to the
failure time.

Is this procedure correct?
Can I recover the queue manager and all queue definitions, process objects and persistent messages in this way? (i know that the channels created just before the failure will be lost)

Thanks in advance,
JPW
Back to top
View user's profile Send private message
jhalstead
PostPosted: Tue Nov 19, 2002 1:01 am    Post subject: Reply with quote

Master

Joined: 16 Aug 2001
Posts: 258
Location: London

My understanding of linear logging on distributed platforms is that the record media image command is essential for performing any sort of recovery from the log files and indeed and form of linear log management. So if you don't have a media image in the logs you can never recover damaged objects. The good thing about taking the media image is that once it is taken all the logs which pre-date it should be prime candidates for deletion / archiving. So the concern about using loads of disk on the image only counts if you don't manage the out of date log files.. However taking the media image of a few gig of message will probably take a while!

W.r.t your recovery approach. As it stands I don't think it's going to work. The logs don't just automatically play themselves forward. The rcrmqobj command is used to recreate damaged objects from the media image and any subsequent logs.

Another point to recognise is that in combination with the backup of the old log/data files you also need all more recent log files intact. Otherwise the best you can do is restore to the backup point in time.

Obviously some form of harware redundancy helps out - for instance mirroring the log and data disks.

I hope this helps...

Jamie
Back to top
View user's profile Send private message Send e-mail
jpw
PostPosted: Tue Nov 19, 2002 5:00 am    Post subject: Reply with quote

Newbie

Joined: 28 Jul 2002
Posts: 5

Thanks for comment (any comment counts).

I've tested a bit my recovery procedure and it seemed that it worked. The idea of this approach was to rely on 2 things:
- First, the checkpoint file is restored when the backup of the /var/mqm/qmgrs/<QMGR> directory is restored. According to the documentation (book: MQSeries System Administration->Chapter 15. Recovery and restart->Recovery scenarios) this file contains the information determining how much of the data in the log must be applied to give a consistent queue manager.
- second, I assume that oldest log file that was required to start the queue manager at the time of the backup, and all subsequent log files, will be available in the log file directory. As I understand that condition must be satisfied also if I record media images in the log.

In my test I put some persisten msgs to a queue, then stopped QM, truncated the logs so that only the logs needed for QM restart were left on the disk. Then I made a backup of the /var/mqm/qmgrs/<QMGR>. I started the QM, put some more persistent messages, created a process object and a remote queue object and then I deleted the /var/mqm/qmgrs/<QMGR> directory.
Next I killed the QM, restored the backup of the QM directory and started the QM. And I didn't notice that any object or message was lost. It seems that the log entries created after the backup were replayed during the start of the QM and that recording images in the log is not neccessary.

One thing that confuses me is that I found that approach in the documentation in the recovery scenario titled "For linear logging with media recovery and with an undamaged log". This "with media recovery" is unclear to me.
Is it essential in the "For linear logging with media recovery and with an undamaged log" scenario or not?

Any other comments?
JPW
Back to top
View user's profile Send private message
jhalstead
PostPosted: Wed Nov 20, 2002 12:58 am    Post subject: Reply with quote

Master

Joined: 16 Aug 2001
Posts: 258
Location: London

I still think there are a few weaknesses here.

1) Messages which have been resident for a period of time (not required for qmgr re-start) therefore they cannot be recovered.

2) If a object gets damaged and you need to perform media recovery.

My main concern is the loss of messages - so I think you need to consider an object getting damaged.

I'd try the same thing but with lots more messages, maybe a couple of stop/starts before the backup and perhaps with damage a few objects...

Thanks

Jamie
Back to top
View user's profile Send private message Send e-mail
nimconsult
PostPosted: Wed Nov 20, 2002 11:38 pm    Post subject: Reply with quote

Master

Joined: 22 May 2002
Posts: 268
Location: NIMCONSULT - Belgium

This is an original and clever way of considering recovery.

I understand the concern raised in the replies.

Three comments for my contribution:

- You rely on a mechanism which is not clearly documented by IBM, but acceptable by deduction of the well known mechanism of recovery. Basically, we know that MQ Series performs forward recovery (plays the log file forward), to synchronize the queue content with the log image, so why not perform forward recovery from a backup image of the data. Definitely this is a clever idea.

- It is to be noted that the main assumption of this mechanism is that the backup image does not contain corrupted objects. Maybe you could keep multiple generations of backup images, just in case the last one would contain corrupted objects, but I am afraid of the time it will take to run the forward recovery.

- In your initial requirements you are asked to build an infrastructure protected against accidental file corruption. I suppose that you also want to protect your log files? By also doing a daily backup? Together with the backup of the data file? In this case why don't you simply record a media image in the log files and backup the log files only?
_________________
Nicolas Maréchal
Senior Architect - Partner

NIMCONSULT Software Architecture Services (Belgium)
http://www.nimconsult.be
Back to top
View user's profile Send private message Send e-mail Visit poster's website
jhalstead
PostPosted: Thu Nov 21, 2002 12:48 am    Post subject: Reply with quote

Master

Joined: 16 Aug 2001
Posts: 258
Location: London

Well this is interesting isn't it! I think I understand fully your approach after a second read through... blush

As long as a good backup of the object file system is available in combination with the logs since backup you can recover every object to it's most recent state. It's an all or nothing recovery, if any sort of damage occurs to any of the objects or underlying object files this mean that you have to do a complete restore from log/old object files? I guess this will probably take a significant period of time...


Jamie
Back to top
View user's profile Send private message Send e-mail
nimconsult
PostPosted: Thu Nov 21, 2002 11:29 pm    Post subject: Reply with quote

Master

Joined: 22 May 2002
Posts: 268
Location: NIMCONSULT - Belgium

Well I have some bad news.

I really wanted to test it and here is the result:

Tested on W2K with MQ Series 5.2.

1) Create a queue manager TTT, with linear log files, 1M per log file (256x4k)

2) Create a backup copy of <MQ>\qmgrs\TTT

3) Start the queue manager TTT

4) Create a queue LQ.TEST (increase maxdepth to 500000)

5) Put 105000 persistent messages, 10 bytes app buffer each. This takes about 5 minutes.

6) 68 log files are created. MQ Series output says that the last one is required to restart the queue manager (S0000068.LOG). It means that the last checkpoint is in the last log file.

7) Stop the queue manager.

8) Restore the backup performed in point 2 on <MQ>\qmgrs\TTT

9) My hope is to see LQ.TEST recreated, ans forward recovery be performed until the queue contains 105000 messages again.

10) Restart the queue manager.

11) The start command is accepted and starts working!

12) MQ Series has recreated the queue LQ.TEST! (I see the file <MQ>\qmgrs\TTT\QUEUES\LQ!TEST, which was not part of the backup)

13) MQ Series is performing forward recovery. It takes about 10 minutes until the queue file reaches the same size as before the restore.

14) After 10 miniutes CRASH! I receive an error "The instruction 0x00256992" referenced memory 0x00000004. The memory could not be read". The queue manager has refused to start.


I performed the test a second time, but I took the backup after the creation of LQ.TEST. The result was exactly the same.

Hard to explain why it crashed so close to the end. Maybe other contributors will perform the same test and confirm?

Nicolas
_________________
Nicolas Maréchal
Senior Architect - Partner

NIMCONSULT Software Architecture Services (Belgium)
http://www.nimconsult.be
Back to top
View user's profile Send private message Send e-mail Visit poster's website
jhalstead
PostPosted: Fri Nov 22, 2002 1:16 am    Post subject: Reply with quote

Master

Joined: 16 Aug 2001
Posts: 258
Location: London

I did a real quick version of the same test last night, putting 10010 10bytes persistent messages onto queue.

stopped queue manager and took copy of object filesystem.

started queue manager

put another 10010 messages onto queue..

stopped queue manager

replaced object filesystem

started queue manager

....

it seemed to work Ok.

I think I'll increase the number of messages by an order of magnitude and try again!
Back to top
View user's profile Send private message Send e-mail
pgorak
PostPosted: Fri Nov 22, 2002 1:30 am    Post subject: Reply with quote

Disciple

Joined: 15 Jul 2002
Posts: 158
Location: Cracow, Poland

Hello Nicolas,

I performed the same test on HP-UX, MQSeries V5.2 and recreated LQ.TEST and all messages without problems.

My understanding is that to recreate LQ.TEST without recording an image, you always need the oldest log file, that is S0000000.LOG, whereas if you record an image, the set of files required to performed media recovery shrinks, so that you need, for example, S0000060.LOG, and can discard all log files prior to this one.

Piotr
Back to top
View user's profile Send private message
nimconsult
PostPosted: Mon Nov 25, 2002 12:19 am    Post subject: Reply with quote

Master

Joined: 22 May 2002
Posts: 268
Location: NIMCONSULT - Belgium

If I perform the test with a small number of messages (say 1000), the recovery works, but with a higher number (say 100000) it does not work.

Maybe this is linked to checkpoints? I would tend to believe so.

My understanding of MQ Series recovery is the following:
- When you start MQ Series, it performs a combination of backward and forward recovery.
- Starting from the last checkpoint in the log files, MQ Series performs an inventory of the transactions that are completed and those that are not completed.
- For transactions that are completed, MQ Series performs forward recovery. Starting from the latest checkpoint, MQ Series applies the modifications from the log files into the queue data files.
- For transaction that are not completed, a backward recovery is performed. The backward recovery may access log files that are older than the last checkpoint. This is why the oldest log file required to restart MQ Series is the one containing information about the oldest uncompleted transaction.

JPW, you said that by restoring the data files you also restore the checkpoint file. That's true, but the issue is: does MQ Series expect to meet checkpoints in the log files during the forward recovery operation?

If I had to guess the reason why my test crashed, I would say that the sequence of checkpoints is broken after the recovery operation because MQ Series does not expect to meet checkpoints during recovery.

By the way JPW, can you consider ths comment that I made earlier in this thread?:
Quote:
- In your initial requirements you are asked to build an infrastructure protected against accidental file corruption. I suppose that you also want to protect your log files? By also doing a daily backup? Together with the backup of the data file? In this case why don't you simply record a media image in the log files and backup the log files only?



Nicolas
_________________
Nicolas Maréchal
Senior Architect - Partner

NIMCONSULT Software Architecture Services (Belgium)
http://www.nimconsult.be
Back to top
View user's profile Send private message Send e-mail Visit poster's website
jpw
PostPosted: Thu Nov 28, 2002 10:30 am    Post subject: Reply with quote

Newbie

Joined: 28 Jul 2002
Posts: 5

Thank you guys for all the replies.

Refering to the idea of recording a media image in the log files and then making the backup the log files only. I think that you can't give up making the backup of the directory /var/mqm/qmgrs/<QM>. If that directory is deleted or severly damaged then the QM may crash and you won't be able to restart it in order to issue the rcrmqobj and recreate the lost objects and messages. The same may happen if the disk drive with the queue manger files fails. I think so.


JPW
Back to top
View user's profile Send private message
nimconsult
PostPosted: Thu Nov 28, 2002 11:02 pm    Post subject: Reply with quote

Master

Joined: 22 May 2002
Posts: 268
Location: NIMCONSULT - Belgium

Yes but you only need one initial backup of the directory qmgrs after the creation. Not a backup of the directory every day. I think.

I tried the following and it worked:

1) Create a queue manager TTT, with linear log files, 1M per log file (256x4k)
2) Create a backup copy of <MQ>\qmgrs\TTT
3) Start the queue manager TTT
4) Create a queue LQ.TEST (increase maxdepth to 500000)
5) Put 15000 persistent messages on LQ.TEST, 10 bytes app buffer each.
6) rcdmqimg -m TTT -t all "*"
7) Put another 15000 persistent messages on LQ.TEST, 10 bytes app buffer each.
8) Stop the queue manager TTT
9) Restore initial backup of <MQ>\qmgrs\TTT
10) Restart the queue manager

After restart of the queue manager, the queue LQ.TEST appears with 30000 messages. Recovery is succesfull!

Steps 1-2 are one-shot operation
Steps 3-6 represent every-day operations
Steps 7-10 happen the day of the crash

I hope I am not missing anything.

Nicolas
_________________
Nicolas Maréchal
Senior Architect - Partner

NIMCONSULT Software Architecture Services (Belgium)
http://www.nimconsult.be
Back to top
View user's profile Send private message Send e-mail Visit poster's website
pgorak
PostPosted: Wed Dec 04, 2002 2:40 am    Post subject: Reply with quote

Disciple

Joined: 15 Jul 2002
Posts: 158
Location: Cracow, Poland

Nicolas,
Did you look at which log files were needed to restart queue manager and which were needed to perform media recovery?

I believe the queue manager recreated from the initial backup will expect the oldest log file (i.e. S0000000.LOG) to be present, because it stores relevant information concerning logs and checkpoints in amqhlctl.lfh and amqalchk.fil files (I'm referring to UNIX directory structure). This is only what I suspect, I have not made a test yet.

Piotr
Back to top
View user's profile Send private message
pgorak
PostPosted: Wed Dec 04, 2002 6:56 am    Post subject: Reply with quote

Disciple

Joined: 15 Jul 2002
Posts: 158
Location: Cracow, Poland

Nicolas,

I repeated the same operations with the difference that after stopping the queue manager I deleted the "unnecessary" log files. So after step 8 I read that:
- S00000026.LOG is the oldest log file required to restart the queue manager and
- S00000010.LOG is the oldest log file required to perform media recovery

I deleted all log files older than S00000010.LOG and unfortunately, the restored queue manager refuses to start with AMQ7017 Log not available. So I also restored amqalchk.fil and amqhlctl.lfh files I had copied just after performing step 8. Then, the queue manager starts, but the queue LQ.TEST is not recreated. What's more, I cannot connect to the queue manager - the application I use for putting messages fails with RC 2059.

Supposing I found myself in such a situation in a production environment...

Piotr
Back to top
View user's profile Send private message
nimconsult
PostPosted: Wed Dec 04, 2002 11:32 pm    Post subject: Reply with quote

Master

Joined: 22 May 2002
Posts: 268
Location: NIMCONSULT - Belgium

That's an excellent analysis, Piotr. I definitly missed something.

I played with the challenge proposed in this thread, but my conclusion is: follow the rules of the product, and do not build a recovery strategy on undocumented assumptions.
_________________
Nicolas Maréchal
Senior Architect - Partner

NIMCONSULT Software Architecture Services (Belgium)
http://www.nimconsult.be
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Display posts from previous:   
Post new topic  Reply to topic Page 1 of 1

MQSeries.net Forum Index » General IBM MQ Support » Linear logs & recovering after disk failure
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.