|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
 |
|
MQOPEN Performance w/SAN |
« View previous topic :: View next topic » |
Author |
Message
|
fordcam |
Posted: Tue Nov 11, 2003 2:42 pm Post subject: MQOPEN Performance w/SAN |
|
|
 Apprentice
Joined: 28 Mar 2002 Posts: 35 Location: MGIC
|
We've been battling sporadic performance issues on one of our Solaris servers. This is Solaris 2.6, MQ 5.2, CSD6. After much legwork on my part, I found a pattern. These slowdowns coincided with BCV work that was done on our SAN; specifically, EMC TimeFinder software running on the mainframe. We didn't have these slowdowns every time the TF stuff was run, but every time we did have a slowdown, TF was running.
Our SAN group has spent three months on this, to no avail. They have repeatedly asked us to provide them with more info. So we found one of our simpler applications, and had it record where it was spending its time. The developer reported that when we had a problem, his program was waiting for an MQPUT to complete. That made sense, but still no progress was made.
But a week ago I found that it was actually an MQPUT1, not an MQPUT, that was stalled. So I wrote a program of my own that iteratively did an MQOPEN, MQPUT, MQGET and MQCLOSE, then waited 5 seconds. I captured any of these calls that took more than 1 second. To my surprise, it is always the MQOPEN that is stalled.
For example, yesterday I had 51 of these MQOPENs that took more than 1 second. Since I'm doing one of these loops every 5 seconds, that's 51 out of about 17,000, so it usually works quickly. But three of the opens took nearly a full minute to complete, and there were many of the 51 that took 20 seconds, too.
So the question is... what would cause an MQOPEN to slow down this much, given that PUTs and GETs are OK? FWIW - I also wrote a tiny C program that constantly writes a 1K string to a file on /var/mqm, and it never has any problems. It's just MQOPEN.
The SAN team wants me to call IBM and open an MQSeries problem, but it's so vague that seems hopeless to me. And the machine is way too busy to turn on traces, especially since we can't recreate the problem at will. I think the correlation between their BCV replication and these slowdowns is too strong to ignore, but I'm not a SAN guy at all.
Thanks for reading this, and any help of any kind would be great! |
|
Back to top |
|
 |
PeterPotkay |
Posted: Wed Nov 12, 2003 4:50 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Fordcam, please do post the solution if you ever get it. We sometimes have the same problem.
I minimized it by making all my channels FAST and making sure all the apps that go thru these channels send non persistent messages only, but occasionally still we see delays when a SAN gets updated. Not always like you said, but sometimes. And when we do see a problem, there is always a SAN update.
Our QMs that rely on the SAN are our HUB QMs (hence the channel semi solution above) as well as MQSI QMs. The actual apps are on non SAN boxes.
The fact that you narrowed it down to the MQOPEN makes sense for me, since with high DISCINTs on my channels, I have eleminated most of the MQOpens. MQSI must still be doing some that are being effected though, which causes the occasional problem.
We got a new monitoring tool (Transaction Vision) finally coming in next month. I know I will be able to solve this then, as the info this tool was giving us during the Proof of Concept we did with it last year is just what I need. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
fordcam |
Posted: Thu Nov 13, 2003 7:09 am Post subject: |
|
|
 Apprentice
Joined: 28 Mar 2002 Posts: 35 Location: MGIC
|
OK, I'll post anything I learn.
Out of curiosity, does anybody know what in general could cause an open to be slow? According to a conference presentation I once attended, an MQOPEN involves some shared memory work, and then security checking via the OAM. Shared memory isn't disk, and I'm sure that the OAM defs are cached, so no disk access there either.
After that, the queue file is opened. But I'm able to open files in /var/mqm without any problems even while the MQOPEN is struggling. So opening a file shouldn't be a problem either.
What's left??? It's all so frustrating! |
|
Back to top |
|
 |
JasonE |
Posted: Thu Nov 13, 2003 7:30 am Post subject: |
|
|
Grand Master
Joined: 03 Nov 2003 Posts: 1220 Location: Hursley
|
Isnt the open the point in time the queue gets loaded, so if there are many messages on the queue that could definitely cause problems. If you see this with an empty queue, no idea. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Thu Nov 13, 2003 7:34 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
Hmmm, in your tests was the queue sometimes empty and sometimes full? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
fordcam |
Posted: Thu Nov 13, 2003 7:42 am Post subject: |
|
|
 Apprentice
Joined: 28 Mar 2002 Posts: 35 Location: MGIC
|
You're right, the queue loading is done at MQOPEN time if no one else has the queue open. When I'm running my little tester program, my local queue is always empty. I put to it, and then get the message, so I know that.
The application that isolated the MQPUT1 as the point of problem is putting reply messages using RTQ/RTQM. So the actual queue is a transmit queue. And this is a busy channel, so the sender channel is active all day. The queue should always be empty, and at least one process has it open for output (the channel) all day.
So I was thinking that queue loading wasn't the problem. |
|
Back to top |
|
 |
fordcam |
Posted: Tue Dec 02, 2003 6:46 am Post subject: Update |
|
|
 Apprentice
Joined: 28 Mar 2002 Posts: 35 Location: MGIC
|
Last night I was able to catch the problem with the MQSeries trace turned on. I did an MQOPEN, which took 7 seconds to complete.
The trace file for the process doing the MQOPEN had its first line written at +7 seconds, after the "pause" had ended. That was kind of unexpected to me. At that point, the trace data looks fine, and the open quickly gets a completion code of 0.
The trace file for the associated agent process, though, had entries beginning at +0 seconds. But there's a spot where the trace data shows an "xlsRequestMutex." At that point there's no more entries until +7 seconds when I see an entry that says "cond_timedwait rc=0." Then everything takes off again.
So it looks like my problem is that this particular "RequestMutex" stalls. What does that mean? I have no idea, so I forwarded the trace data to IBM. |
|
Back to top |
|
 |
fordcam |
Posted: Tue Feb 24, 2004 11:26 am Post subject: Queue Loading is the Problem! |
|
|
 Apprentice
Joined: 28 Mar 2002 Posts: 35 Location: MGIC
|
I think I've gotten to the bottom of this at long last, so I thought I'd post what I found.
I ran a 10 minute MQ trace a few days ago. During the trace duration, one of our key apps did an MQPUT1 that took 54 seconds to complete. I did my best to analyze the trace data (1.3 gig, and 1100 files!) and have come to a conclusion.
Around the time of the pause, there was an unrelated process that opened an unrelated queue that has 185,000 XML messages on it. That caused a flurry of activity. There were a large number of trace entries that implied "queue loading" was occuring. I attended an internals class at the MQ conference, I remembered them talking about MQOPEN and queue loading. The handout said:
Quote: |
The DAP performs the operations needed to make the physical (local) queue available. This is termed loading the queue
It involves opening the file containing the underlying message data and allocating the shared memory buffers and other
shared resources necessary for the queue to be used. Of course, if the queue is already loaded, this work can be
avoided. |
Earlier it said that the DAP (Data Abstraction and Persistence component) "provides storage and recovery of the data held by the queue manager. DAP holds the messages."
I decided that queue loading could be serialised, and that was what was slowing things down. Furthmore, EMC replication - no matter what the San guys say - had a drastic impact on SAN performance, making things much worse.
So to test my theory I wrote a tiny Perl program that opened the queue in question. Then, in an endless loop, it did a BROWSE_FIRST and then slept for 5 minutes. It's been running for 2 days, and in that time we haven't noticed a single slow MQI call!
Everyone is thrilled, and I'm a semi-hero. Per my request the app team is converting this queue to a data base, which is what it should have been in the first place. In the meantime I intend to just let this program run. Is there any downside? This is a frequently used queue, so I think having it "pre-loaded" does more good than harm. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Tue Feb 24, 2004 11:56 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
The DAP took a hit as soon as the app tried to OPEN a queue with 185,000 messages? Or it took a hit when the app did a GET to this huge queue?
I wonder if your app simply opened the queue and never browsed, would that also solve the problem?
(Thanks for posting what you found.) _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
jefflowrey |
Posted: Tue Feb 24, 2004 12:05 pm Post subject: |
|
|
Grand Poobah
Joined: 16 Oct 2002 Posts: 19981
|
So you spent how much time, and effort, to finally end up showing that it really was an application problem in the first case?
 _________________ I am *not* the model of the modern major general. |
|
Back to top |
|
 |
fordcam |
Posted: Tue Feb 24, 2004 1:32 pm Post subject: |
|
|
 Apprentice
Joined: 28 Mar 2002 Posts: 35 Location: MGIC
|
I suspect that the browses are unnecessary, since DAP does the loading as part of MQOPEN. That's when I saw the performance hit. I may try doing away with the browse, or I may just let sleeping dogs lie.
I spent a lot of time on this, and it's been a problem for around 6 months, believe it or not. It really surfaced when we started doing heavy duty SAN replication last August. I naturally tried to focus on what the replication was doing to us, and was met by a stone wall from the SAN team. They maintained, and still maintain, that replication isn't adding any appreciable time to I/Os, which is BS. So this will rear its head again in some other form, I'm sure. |
|
Back to top |
|
 |
|
|
 |
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|