Author |
Message
|
SAFraser |
Posted: Thu Sep 09, 2010 7:03 am Post subject: RESOLVED: Solaris 10 Zone Impact on MQ Performance |
|
|
 Shaman
Joined: 22 Oct 2003 Posts: 742 Location: Austin, Texas, USA
|
We are working on a performance issue.
Operating system: Solaris 10
WebsphereMQ: 7.0.1.3
System tuning as specified in IBM documents.
During application testing, we noticed a degraded dequeue rate. In addition to application testing, we were also running a utility to clear a queue using a destructive get utility (based upon amqsget). When we killed the utility, the dequeue rate improved but still was not what we expect.
So, we ran the following test:
MQ running on a Solaris 10 global zone
Sent and consumed 150,000 small test messages.
Enqueue rate: 42,857 avg
Dequeue rate: 42,857 avg
MQ running on a Solaris 10 sparse zone
Sent and consumed 150,000 small test messages.
Enqueue rate: 33,335 avg
Dequeue rate: 5,642 avg
System resources were virtually unaffected during either test. In other words, no appreciable change in CPU and memory usage were noted. Our initial troubleshooting suggests that the global zone / sparse zone variable is the significant one.
We are working with our Solaris admins, and perhaps a ticket to Sun will follow. We will probably open a PMR shortly; but, I'd be interested in thoughts from this forum. Thanks very much.
Last edited by SAFraser on Thu Sep 16, 2010 4:26 pm; edited 1 time in total |
|
Back to top |
|
 |
mvic |
Posted: Fri Sep 10, 2010 12:12 pm Post subject: Re: Solaris 10 Zone Impact on MQ Performance |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
SAFraser wrote: |
Our initial troubleshooting suggests that the global zone / sparse zone variable is the significant one.
We are working with our Solaris admins, and perhaps a ticket to Sun will follow. We will probably open a PMR shortly; but, I'd be interested in thoughts from this forum. Thanks very much. |
Is the OS/hardware delivering the same CPU and I/O resources to MQ in these 2 situations? Your OS support will be the best place to start. If (unlikely) they say MQ did something different in the 2 cases, then the job would be to work out why that was. |
|
Back to top |
|
 |
SAFraser |
Posted: Fri Sep 10, 2010 12:44 pm Post subject: |
|
|
 Shaman
Joined: 22 Oct 2003 Posts: 742 Location: Austin, Texas, USA
|
Yes, the same system resources are available in the global and sparse zones. We removed all resource caps and performed the test in the global zone and the sparse zone on the same physical server. Get rates were 8000/minute on the global zones and 4500/minute on the sparse zone. Same volume of messages, same message content.
IBM has not been helpful, but I didn't expect it to be an MQ problem anyway. Our OS admins are planning a trouble ticket to Sun, I think.
I am hopeful that this can be corrected somehow. The alternative is to tear down all that we have built in sparse zones and rebuild them in the global zone. <sigh> I didn't want those sparse zones in the first place. MQ and WMB are not sparse zone friendly; there is little value in virtualization on Solaris unless whole root zones are used. |
|
Back to top |
|
 |
fjb_saper |
Posted: Fri Sep 10, 2010 6:51 pm Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
SAFraser wrote: |
Yes, the same system resources are available in the global and sparse zones. We removed all resource caps and performed the test in the global zone and the sparse zone on the same physical server. Get rates were 8000/minute on the global zones and 4500/minute on the sparse zone. Same volume of messages, same message content.
IBM has not been helpful, but I didn't expect it to be an MQ problem anyway. Our OS admins are planning a trouble ticket to Sun, I think.
I am hopeful that this can be corrected somehow. The alternative is to tear down all that we have built in sparse zones and rebuild them in the global zone. <sigh> I didn't want those sparse zones in the first place. MQ and WMB are not sparse zone friendly; there is little value in virtualization on Solaris unless whole root zones are used. |
Could it be that the only difference between your sparse and global zone is the number of CPUs allocated/available to each? Could it be that in your case you have only about half the cpus allocated to the sparse zone MQ is running in and all the CPUs allocated to the global zone.
Other avenue of research would be the load manager.
Let's assume that the sparse zone has exactly the same resources as the global zone. If the workload manager gives priority to the processes in the global zone over those in the sparse zone, and there is activity in the global zone at the time of your test, I would expect that you see a performance slow down in the sparse zone. How much of a performance hit would of course depend on the level of activity in the global zone.
A bizarre test here would be to have the mq clients working in the global zone and the manager in the sparse zone. So the more you scale the clients the more you slow down the qmgr!!.
Think about the nightmare of setting priorities in zOS between the database, CICS and MQ. Well you just might run into the same type of problem with sparse vs global zones on Solaris...
And at this rate we haven't even talked about memory allocation across the zones...
Have fun  _________________ MQ & Broker admin |
|
Back to top |
|
 |
SAFraser |
Posted: Sun Sep 12, 2010 9:46 am Post subject: |
|
|
 Shaman
Joined: 22 Oct 2003 Posts: 742 Location: Austin, Texas, USA
|
As always, fjb, you've raised some good points.
We performed the first set of tests with resource allocations assigned to both the global and sparse zones. Thinking (as you did) that the resource caps were a variable that might impact performance, we removed the resource caps. There was a slight increase in sparse zone performance, but not nearly enough. (We removed caps for both CPUs and memory.)
We have engaged our data center services provider, and have asked them to open a ticket with Sun. Several of us agree with you -- that the issue is workload management. These are T series machines which are designed specifically to manage multi-threaded applications in a virtualized environment and is notoriously slow with single threaded work. (Heck no, I didn't pick these machines.)
 |
|
Back to top |
|
 |
SAFraser |
Posted: Thu Sep 16, 2010 6:37 am Post subject: |
|
|
 Shaman
Joined: 22 Oct 2003 Posts: 742 Location: Austin, Texas, USA
|
Current status:
A global zone uses UFS filesystem.
A sparse zone uses ZFS filesystem.
"write" operation is the same speed on both types of zones.
"read" operation is five times slower on sparse compared to global zone.
PUT rates are about the same speed on both types of zones.
GET rates are dramatically slower on sparse compared to global zone.
Tell me what you know about GET..... does this anomaly make sense?
(Oh yes, we have high priority tickets open with both IBM and Sun. But I value your thoughts, too.)
Thanks. |
|
Back to top |
|
 |
mvic |
Posted: Thu Sep 16, 2010 6:52 am Post subject: |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
SAFraser wrote: |
"read" operation is five times slower on sparse compared to global zone. |
Why the large difference? Is it somewhere in the configuration? Or is this "to be expected" in your environment? I guess these are questions that your support tickets are there to answer
MQ can't go quicker if the I/O goes slower. |
|
Back to top |
|
 |
PeterPotkay |
Posted: Thu Sep 16, 2010 6:54 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
The puts are done under syncpoint, in batches, reducing how often MQ actually has disk I/O to just the frequency of the MQCMITs?
The gets are done one at a time, meaning MQ has disk I/O each time? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
SAFraser |
Posted: Thu Sep 16, 2010 6:57 am Post subject: |
|
|
 Shaman
Joined: 22 Oct 2003 Posts: 742 Location: Austin, Texas, USA
|
mvic, My current question is "how much 'read' I/O is involved in a 'get' operation"? I'm not sure if the 'read' degradation is relevant or a red herring.
I think that's the right question. Which, yes, is being pursued with tech supports. |
|
Back to top |
|
 |
SAFraser |
Posted: Thu Sep 16, 2010 7:02 am Post subject: |
|
|
 Shaman
Joined: 22 Oct 2003 Posts: 742 Location: Austin, Texas, USA
|
Peter, Good thought! But I don't think we do any darn thing under syncpoint at our site. You meant syncpoint by the application, right? Not the unders-the-covers stuff that MQ does to assure once-and-once-only delivery? |
|
Back to top |
|
 |
PeterPotkay |
Posted: Thu Sep 16, 2010 7:05 am Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
SAFraser wrote: |
Peter, Good thought! But I don't think we do any darn thing under syncpoint at our site. You meant syncpoint by the application, right? Not the unders-the-covers stuff that MQ does to assure once-and-once-only delivery? |
Correct. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
mvic |
Posted: Thu Sep 16, 2010 7:08 am Post subject: |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
SAFraser wrote: |
mvic, My current question is "how much 'read' I/O is involved in a 'get' operation"? I'm not sure if the 'read' degradation is relevant or a red herring. |
Difficult to generalise, but if your queue is building up, and/or your message throughput is high, probably each MQGET results in some I/O. It it's persistent and/or in syncpoint, it'll result in I/O. If you're doing only non-persistent on an empty queue, then MQ tries (maybe even succeeds, if there is a waiting MQGET) to do no I/O at all.
For such a massive I/O performance difference, it has got to hurt. I take it from the way you presented the info that the difference is seen at an OS or driver level..? If so, there has got to be some OS or driver explanation. IMHO. |
|
Back to top |
|
 |
SAFraser |
Posted: Thu Sep 16, 2010 4:26 pm Post subject: |
|
|
 Shaman
Joined: 22 Oct 2003 Posts: 742 Location: Austin, Texas, USA
|
Root cause has been determined.
The zfs filesystem for sparse zones does synchronous writes to disk. First data is written to buffer, then to disk. Control is not returned to the application until the write to disk is successful. In zfs, this introduces noticeable latency.
We turned off the synchronous write flag on the sparse zone and the performance issue was completely corrected.
Unfortunately, running without synchronous write protection is not feasible. If the server crashes with messages in the buffer, we believe that the queue manager would not come back up as its logs would not reflect correct information.
For the short term, we may try to move /var/mqm/* to SAN storage. Longer term, we will rebuild everything on global zones.
I am surprised no one has posted about this before. Surely we can't be the only site that has noticed this problem! |
|
Back to top |
|
 |
mvic |
Posted: Thu Sep 16, 2010 4:41 pm Post subject: |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
SAFraser wrote: |
I am surprised no one has posted about this before. Surely we can't be the only site that has noticed this problem! |
What was this Solaris setting you changed.. is there a web page you can give us a link to? Thanks |
|
Back to top |
|
 |
PeterPotkay |
Posted: Thu Sep 16, 2010 4:56 pm Post subject: |
|
|
 Poobah
Joined: 15 May 2001 Posts: 7722
|
SAFraser wrote: |
The zfs filesystem for sparse zones does synchronous writes to disk. First data is written to buffer, then to disk. Control is not returned to the application until the write to disk is successful. In zfs, this introduces noticeable latency.
|
But PUTs were faster than GETs? I would think both would be equally affected. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
 |
|