|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
 |
|
QSG CHIN high CPU / workload skew - why ? |
« View previous topic :: View next topic » |
Author |
Message
|
GheorgheDragos |
Posted: Thu Sep 24, 2020 1:09 pm Post subject: QSG CHIN high CPU / workload skew - why ? |
|
|
 Acolyte
Joined: 28 Jun 2018 Posts: 51
|
Dear MQ community,
For a few good days, up to even a week or more, we at MQ team get complaints that two of our channel initiators ( from different QSG's but on same LPAR ) are running with high cpu and are causing performance issues to the whole LPAR. The problem is not repeating in production. The QSG setup is the same. We use two gateway queue managers ( 1 on each lpar ) in to the queue sharing group. The number of dispatcher tasks is the same between the both. The MAXCHL is the same. How could I interpret the below data and why ? Why is the workload going mostly to B ? And Why is B using such high numbers for the below charts ?
CPUA is the OK one, CPUB is not ( fake names though not real LPAR names ) . Stats gathered with MP1B
Code: |
DNS
mvs qm qsg date time Type Requests Busy% Total CPU CPU% avg CPU avg ET
CPUA MQA QSG 24/09/2020 18:01:28 DNS 0 0.0 0.000000 0.0 0.0 0
CPUA MQA QSG 24/09/2020 18:31:22 DNS 0 0.0 0.000000 0.0 0.0 0
CPUA MQA QSG 24/09/2020 19:01:16 DNS 0 0.0 0.000000 0.0 0.0 0
CPUA MQA QSG 24/09/2020 19:31:11 DNS 0 0.0 0.000000 0.0 0.0 0
CPUA MQA QSG 24/09/2020 19:59:53 DNS 982 0.0 0.004972 0.0 5.1 1
mvs qm qsg date time Type Requests Busy% Total CPU CPU% avg CPU avg ET
CPUB MQB QSG 24/09/2020 18:04:42 DNS 2622 0.0 0.023521 0.0 9.0 4
CPUB MQB QSG 24/09/2020 18:34:37 DNS 2490 0.0 0.022688 0.0 9.1 3
CPUB MQB QSG 24/09/2020 19:04:31 DNS 2548 0.0 0.019088 0.0 7.5 2
CPUB MQB QSG 24/09/2020 19:34:25 DNS 2492 0.0 0.023597 0.0 9.5 3
SSL
mvs qm qsg date time Type Requests Busy% Total CPU CPU% avg CPU avg ET longest request Date Time
CPUA MQA QSG 24/09/2020 18:01:28 SSL 2239868 0.2 2.025935 0.2 5.5 7 73 24/09/2020 18:01:01.537272
CPUA MQA QSG 24/09/2020 18:31:22 SSL 2610255 0.2 2.454760 0.2 5.5 7 13 24/09/2020 18:02:12.108316
CPUA MQA QSG 24/09/2020 19:01:16 SSL 1893426 0.1 1.661657 0.1 5.1 6 21 24/09/2020 18:32:06.223394
CPUA MQA QSG 24/09/2020 19:31:11 SSL 1909699 0.1 1.786110 0.1 5.3 6 70 24/09/2020 19:02:10.645409
CPUA MQA QSG 24/09/2020 19:59:53 SSL 1872263 0.2 3.279193 0.2 9.9 11 17 24/09/2020 19:45:21.998967
mvs qm qsg date time Type Requests Busy% Total CPU CPU% avg CPU avg ET longest request Date Time
CPUB MQB QSG 24/09/2020 18:04:42 SSL 2944971 3.2 8.067913 0.5 15.7 97 1721 24/09/2020 18:02:05.566018
CPUB MQB QSG 24/09/2020 18:34:37 SSL 2939129 1.8 8.179115 0.5 15.4 55 240 24/09/2020 18:05:12.318716
CPUB MQB QSG 24/09/2020 19:04:31 SSL 3728686 1.1 8.400311 0.5 13.1 26 154 24/09/2020 18:45:00.996398
CPUB MQB QSG 24/09/2020 19:34:25 SSL 3709952 3.3 8.438746 0.6 13.4 80 371 24/09/2020 19:30:00.976832
DISP
mvs qm qsg date time Type Requests Busy% Total CPU CPU% avg CPU avg ET
CPUA MQA QSG 24/09/2020 18:01:28 DISP 6726835 0.4 0.157831 0.4 7.9 9
CPUA MQA QSG 24/09/2020 18:31:22 DISP 7827337 0.4 0.187892 0.3 7.8 8
CPUA MQA QSG 24/09/2020 19:01:16 DISP 5726524 0.2 0.173564 0.2 6.9 6
CPUA MQA QSG 24/09/2020 19:31:11 DISP 5790634 0.2 0.176444 0.2 7.1 7
CPUA MQA QSG 24/09/2020 19:59:53 DISP 5606207 0.2 0.466262 0.2 7.1 6
mvs qm qsg date time Type Requests Busy% Total CPU CPU% avg CPU avg ET
CPUB MQB QSG 24/09/2020 18:04:42 DISP 6832301 0.9 8.178705 0.5 11.9 23
CPUB MQB QSG 24/09/2020 18:34:37 DISP 6752118 0.6 8.971264 0.4 11.8 16
CPUB MQB QSG 24/09/2020 19:04:31 DISP 9018203 0.6 8.376705 0.5 10.7 12
CPUB MQB QSG 24/09/2020 19:34:25 DISP 9010019 0.9 9.356962 0.6 11.9 17 |
In the TASKET tab I have 2525 counts with a total CPU time of 81 seconds for CPUA, while for CPUB I have 10000 counts with a total of 75s CPU time! How does that make sense ? This is beyond my current understanding.
Thank you in advance. |
|
Back to top |
|
 |
gbaddeley |
Posted: Thu Sep 24, 2020 4:05 pm Post subject: |
|
|
 Jedi Knight
Joined: 25 Mar 2003 Posts: 2538 Location: Melbourne, Australia
|
High CHIN CPU can indicate excessive MQ channel activity, including connect / disconnect and MQI calls made by Client channels. Also, check your queue depths and indexing. _________________ Glenn |
|
Back to top |
|
 |
hughson |
Posted: Thu Sep 24, 2020 6:57 pm Post subject: Re: QSG CHIN high CPU / workload skew - why ? |
|
|
 Padawan
Joined: 09 May 2013 Posts: 1959 Location: Bay of Plenty, New Zealand
|
GheorgheDragos wrote: |
Why is the workload going mostly to B ? |
Could you tell us a little more about how the workload chooses where to go? Are these short-live client connections, or connections in from other queue managers (which would be assumed to be longer lived)? Do you use Sysplex Distributors, VIPAs, or CCDTs or something else, to have these connections choose A or B?
GheorgheDragos wrote: |
In the TASKET tab I have 2525 counts with a total CPU time of 81 seconds for CPUA, while for CPUB I have 10000 counts with a total of 75s CPU time! How does that make sense ? |
You don't show us these numbers in your output - could you add that data too?
One way that those totals you mention could make sense would be if the connections which go to CPUA manage to do some work, but those that go to CPUB don't spend very long there, which might suggest that they fail because a resource they need is missing. Are QMA and QMB identical in all ways - i.e. have all queues that are needed and so on?
Cheers,
Morag _________________ Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software |
|
Back to top |
|
 |
GheorgheDragos |
Posted: Thu Sep 24, 2020 10:01 pm Post subject: |
|
|
 Acolyte
Joined: 28 Jun 2018 Posts: 51
|
Hello,
The problem is that only the gateway queue managers show very high cpu, on one LPAR while the other has almost nothing. Both of the GW QMGRS are listening on the same port.
There is an equal nr of local qmgr on CPUA and B. Connection to the QSG is done via SVRCONN/CLNTCONN. How the QSG choses where to distribute the work, I don't know exactly, since as far as I know MQ does not have a real workload balancer, however, the numbers seem balanced on terms of shared puts/gets ( also taken from MP1B ). So why one channel initiator on one LPAR that is identical to the other spends so much more CPU% in DNS, DISP, SLL ? I don't understand. MQS's on CPUA and CPUB all host shared queues, obviously, but they also host local queues of course, for non HA applications. But why would a HA application access NON HA queues. Everything else is identical.. number of CICS-es etc.
Code: |
CPUA
Date Time Type Tran1 Tran2 Count CPU S logBytesMB put MB Get MB
2020/09/24 17:00:00 M xxxxxxxx 10.212.203 50 0,00893 0 0 0
2020/09/24 18:00:00 M xxxxxxxx 10.224.228 2 0,03164 0.84247 0 0
2020/09/24 18:00:00 M xxxxxxxx 10.212.202 2 0,02260 0 0 0
2020/09/24 18:00:00 M xxxxxxxx 10.212.194 4 0,03793 0 0 0
2020/09/24 18:00:00 M xxxxxxxx 10.212.194 10 0,11330 0 0 0
2020/09/24 18:00:00 M xxxxxxxx 10.212.194 20 5,94150 0 0 0
2020/09/24 18:00:00 M xxxxxxxx 10.212.203 2000 4,81220 2.8282 0 0.54474
2020/09/24 18:00:00 M xxxxxxxx 10.212.203 20 1,80220 0 0 108.71
2020/09/24 18:00:00 M xxxxxxxx 10.212.224 20 1,64010 0 0 0
2020/09/24 18:00:00 M xxxxxxxx 10.212.225 100 18,08600 0.23779 0 0.13488
2020/09/24 18:00:00 M xxxxxxxx 10.212.194 4 0,16914 0 0 0
2020/09/24 19:00:00 M xxxxxxxx 10.224.228 2 0,01723 1.0133 0 0
2020/09/24 19:00:00 M xxxxxxxx 10.212.202 2 0,02396 0 0 0
2020/09/24 19:00:00 M xxxxxxxx 10.212.194 4 0,04046 0 0 0
2020/09/24 19:00:00 M xxxxxxxx 10.212.194 10 0,12677 0 0 0
2020/09/24 19:00:00 M xxxxxxxx 10.212.194 50 6,38420 24.371 0 0.015259
2020/09/24 19:00:00 M xxxxxxxx 10.212.203 20 1,87190 0 0 125
2020/09/24 19:00:00 M xxxxxxxx 10.212.224 20 1,74550 0 0 0
2020/09/24 19:00:00 M xxxxxxxx 10.212.225 100 21,67600 4.4541 0 1.3639
2020/09/24 19:00:00 M xxxxxxxx 10.212.194 4 0,17811 0 0 0
2020/09/24 20:00:00 M xxxxxxxx 10.224.228 1 0,00004 0 0 0
2020/09/24 20:00:00 M xxxxxxxx 10.212.202 1 0,01179 0 0 0
2020/09/24 20:00:00 M xxxxxxxx 10.212.194 2 0,02047 0 0 0
2020/09/24 20:00:00 M xxxxxxxx 10.212.194 6 0,06355 0 0 0
2020/09/24 20:00:00 M xxxxxxxx 10.212.194 10 3,20660 0 0 0
2020/09/24 20:00:00 M xxxxxxxx 10.212.203 9 0,91672 0 0 63.203
2020/09/24 20:00:00 M xxxxxxxx 10.212.224 10 0,91649 0.03764 0 6.5819
2020/09/24 20:00:00 M xxxxxxxx 10.212.225 40 11,38800 0 0 0
2020/09/24 20:00:00 M xxxxxxxx 10.212.194 2 0,08643 0 0 0
CPUB
Date Time Type Tran1 Tran2 Count CPU S logBytesMB put MB Get MB
2020/09/24 17:00:00 M XXXXX 10.212.194 3 0,00102 0.081135 0 0.02721
2020/09/24 17:00:00 M XXXXX 10.212.203 50 0,01605 0 0 0
2020/09/24 18:00:00 M XXXXX 10.212.194 50 0,51360 0 0 0
2020/09/24 18:00:00 M XXXXX 10.212.194 500 20,97100 0.13149 0 0.23563
2020/09/24 18:00:00 M XXXXX 10.173.71. 20 0,38541 0.0083981 0 0.0034599
2020/09/24 18:00:00 M XXXXX 10.212.203 3000 4,65350 2.3298 0 0.46959
2020/09/24 18:00:00 M XXXXX 10.212.202 20 2,28460 0 0 0
2020/09/24 18:00:00 M XXXXX 10.212.202 20 1,65450 0 0 0
2020/09/24 18:00:00 M XXXXX 10.212.202 4 0,44455 0 0 0
2020/09/24 18:00:00 M XXXXX 10.212.202 4 0,44264 0 0 0
2020/09/24 18:00:00 M XXXXX 10.212.202 80 3,42220 0.53914 0 0.16206
2020/09/24 18:00:00 M XXXXX 10.212.225 2 0,01920 0.069759 0 0.065398
2020/09/24 18:00:00 M XXXXX 10.212.129 4 0,07144 5.9453 0 3.4928
2020/09/24 19:00:00 M XXXXX 10.212.194 50 0,51573 0 0 0
2020/09/24 19:00:00 M XXXXX 10.212.194 500 22,13100 5.2621 0 3.5808
2020/09/24 19:00:00 M XXXXX 10.173.71. 10 0,37586 0 0 0
2020/09/24 19:00:00 M XXXXX 10.212.203 6000 8,92010 6.7903 0 1.3451
2020/09/24 19:00:00 M XXXXX 10.212.202 20 2,29600 0 0 0
2020/09/24 19:00:00 M XXXXX 10.212.202 20 1,63590 0 0 0
2020/09/24 19:00:00 M XXXXX 10.212.202 4 0,43805 0 0 0
2020/09/24 19:00:00 M XXXXX 10.212.202 4 0,43040 0 0 0
2020/09/24 19:00:00 M XXXXX 10.212.202 70 3,42850 0.055463 0 0.13924
2020/09/24 19:00:00 M XXXXX 10.212.225 2 0,06561 0.36759 0 0.3596
2020/09/24 19:00:00 M XXXXX 10.212.129 4 0,02668 2.1645 0 1.4387
2020/09/24 20:00:00 M XXXXX 10.212.194 4 0,00212 0.047628 0 0.0092936
2020/09/24 20:00:00 M XXXXX 10.212.203 80 0,34718 0.047455 0 0.022057
|
|
|
Back to top |
|
 |
hughson |
Posted: Thu Sep 24, 2020 10:07 pm Post subject: |
|
|
 Padawan
Joined: 09 May 2013 Posts: 1959 Location: Bay of Plenty, New Zealand
|
GheorgheDragos wrote: |
Both of the GW QMGRS are listening on the same port. |
And what is the technology that routes the clients to one or other LPAR?
From everything you have said so far, it would seem that a lot more clients are routed to CPUA than CPUB, so the above question is quite important! This would account for more CPU in DNS task, SSL tasks and DISPatcher tasks as those are the things client connections would use.
Cheers,
Morag _________________ Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software |
|
Back to top |
|
 |
GheorgheDragos |
Posted: Thu Sep 24, 2020 10:13 pm Post subject: |
|
|
 Acolyte
Joined: 28 Jun 2018 Posts: 51
|
I'm sorry I don't understand, a CLNTCONN uses a CCDT file to connect to a shared channel on the mainframe. How the workload is split between A and B I don't know yet. TCPIP task on z/OS I guess I don't know.
edit
Apparently a sysplex distributor running for the balancing of the IPs. Never heard that one before.
edit2 - the problem doesn't seem to be uneven workload distribution, but the fact that on one LPAR the gateway CHIN spends much more time and CPU than the other. |
|
Back to top |
|
 |
GheorgheDragos |
Posted: Thu Sep 24, 2020 10:52 pm Post subject: |
|
|
 Acolyte
Joined: 28 Jun 2018 Posts: 51
|
Moreover, in MSGM we have more or less equal numbers
Code: |
MVS QM Date Time Puts Put1s Gets Open Close Inquire Set Close all H Sub SubR Reg CB Control Stat Publish PersPuts NPersPuts PersBytsPut NPersBytsPut
CPUA MQA 24/09/2020 18:01:28 10825 7 515095 58182 58178 9184 0 0 0 0 542218 495164 0 0 402 10436 331242 23468749
CPUA MQA 24/09/2020 18:31:22 10335 48 508796 49264 49295 8774 0 0 0 0 517361 478681 0 0 247 10142 214584 20857070
CPUA MQA 24/09/2020 19:01:16 10073 0 469256 1532 1533 988 0 0 0 0 387988 387991 0 0 82 9997 61957 16595862
CPUA MQA 24/09/2020 19:31:10 10243 498 471700 1530 1532 988 0 0 0 0 393000 395003 0 0 729 10018 1573570 17225862
CPUA MQA 24/09/2020 20:01:05 10952 0 473516 3946 3901 3667 0 0 0 0 388872 388961 0 0 951 10007 25096289 15517766
|
Code: |
MVS QM Date Time Puts Put1s Gets Open Close Inquire Set Close all H Sub SubR Reg CB Control Stat Publish PersPuts NPersPuts PersBytsPut NPersBytsPut
CPUB MQB 24/09/2020 18:04:42 10998 0 578826 80184 80064 13764 0 0 0 0 215880 148814 0 0 887 10129 3010612 16039250
CPUB MQB 24/09/2020 18:34:37 10577 0 594522 75927 75815 11850 0 0 0 0 206953 141334 0 0 475 10118 1614906 15852338
CPUB MQB 24/09/2020 19:04:31 10335 0 657688 123020 122841 19433 0 0 0 0 337048 232711 0 0 223 10130 575430 16042447
CPUB MQB 24/09/2020 19:34:25 10981 0 623083 131931 131812 19800 0 0 0 0 362101 251467 0 0 887 10110 2730701 15617762
|
|
|
Back to top |
|
 |
bruce2359 |
Posted: Fri Sep 25, 2020 6:10 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
GheorgheDragos wrote: |
I'm sorry I don't understand, a CLNTCONN uses a CCDT file to connect to a shared channel on the mainframe. How the workload is split between A and B I don't know yet. TCPIP task on z/OS I guess I don't know. |
If I understand correctly, the inbound connections are from midrange Windows or UNIX. What channel and IP address in the CCDT? Is the IP defined as a shared port on z? Or a local port for a single qmgr? _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
GheorgheDragos |
Posted: Fri Sep 25, 2020 7:45 am Post subject: |
|
|
 Acolyte
Joined: 28 Jun 2018 Posts: 51
|
Hello Bruce,
All incoming connections are towards the shared port, then the connection will reach one of the two clones, one on lpar A or B. How this is chosen, in depth, I don't have much clue as there is no load balancer between the incoming connections and the QSG. We use an F5 interface between agencies, arm's etc but not in this case. And it is curious, since the setup is identical, and in PRD we do not have the issue ( both CPU% usage of the gateway chin's is reasonably balanced ).
PMR opened with IBM today.
Dragos |
|
Back to top |
|
 |
hughson |
Posted: Sat Sep 26, 2020 8:02 pm Post subject: |
|
|
 Padawan
Joined: 09 May 2013 Posts: 1959 Location: Bay of Plenty, New Zealand
|
You have told us that a CCDT is used, that both GW QMgrs listen on the same port, and that a Sysplex Distributor is also in the picture. Could you complete our understanding of this part of the puzzle by showing us the CCDT CLNTCONN definitions, or at minimum telling us whether their CONNAMEs are the same (i.e. Sysplex Distributor is doing the routing) or different (i.e. CCDT is chosing the route).
You opened this thread saying that more connections were going to one system than the other, but now you say they are even. What data has led you to this new understanding?
Cheers,
Morag _________________ Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software |
|
Back to top |
|
 |
fjb_saper |
Posted: Sun Sep 27, 2020 10:58 am Post subject: |
|
|
 Grand High Poobah
Joined: 18 Nov 2003 Posts: 20756 Location: LI,NY
|
You say that you have HA traffic and non HA traffic...
Do you separate the channels for these? Channels that do not use the same ip?
Channels with the correct ip in the localaddress field?
 _________________ MQ & Broker admin |
|
Back to top |
|
 |
GheorgheDragos |
Posted: Tue Sep 29, 2020 4:11 am Post subject: |
|
|
 Acolyte
Joined: 28 Jun 2018 Posts: 51
|
Dear MQ community,
The last few days have been hectic. Quick update though, with your suggestions, IBM support team and Gwydion, we found out that load balancing is seriously skewed towards CPUB, and, after activating STATCHL and running the newest version of MP1B, we found the culprit for the very high CPU%. 2,3 channels from the same application opening, putting, getting, closing and disconnecting for 1 message!
Will continue with updates that might help newer members in the future.
Dragos |
|
Back to top |
|
 |
zpat |
Posted: Tue Sep 29, 2020 6:04 am Post subject: |
|
|
 Jedi Council
Joined: 19 May 2001 Posts: 5866 Location: UK
|
SCTQ "Message rellocation" can consume a lot of CPU, check for retrying channels. _________________ Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error. |
|
Back to top |
|
 |
GheorgheDragos |
Posted: Tue Oct 27, 2020 3:30 am Post subject: |
|
|
 Acolyte
Joined: 28 Jun 2018 Posts: 51
|
Hello good afternoon.
The problem has been found - uneven weight LPAR allocation forcing more connections prefer one LPAR over the other.
Dragos |
|
Back to top |
|
 |
|
|
 |
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|