ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum Index » IBM MQ Installation/Configuration Support » RDQM in "Ended Unexpectedly" status - clueless..

Post new topic  Reply to topic
 RDQM in "Ended Unexpectedly" status - clueless.. « View previous topic :: View next topic » 
Author Message
mqdev
PostPosted: Mon Jul 15, 2019 7:23 am    Post subject: RDQM in "Ended Unexpectedly" status - clueless.. Reply with quote

Centurion

Joined: 21 Jan 2003
Posts: 136

Hello,
mine is a 3 node RDQM which hosts 4 RDQM QMs - QM0/1/2/3.

QM1/2/3 are healthy and in Running on one of the 3 nodes. QM0 however, is in "Ended unexpectedly" status on one of the nodes while the other 2 nodes are reporting "Running Elsewhere" for QM0.

Here is my investigation:
1. QM1/2/3 are healthy - able to failover etc. with no issues whatsoever.
2. QM0 when is generating an FDC with the following:

Major Errorcode :- xecF_E_UNEXPECTED_SYSTEM_RC |
| Minor Errorcode :- OK |
| Probe Type :- MSGAMQ6119 |
| Probe Severity :- 1 |
| Probe Description :- AMQ6119S: An internal IBM MQ error has occurred (Rc=5 |
| from write) |
| FDCSequenceNumber :- 0 |
| Comment1 :- Rc=5 from write |


MQM Function Stack
amqzxma0
zxcExecutionController
zxcStartupLogic
kpiStartup
apiStartup
almPerformReDoPass
apiReplayLogRecord
aqmReplayLog
aqlDoReplayLog
aqpWriteData
adhWrite
adiWriteFile
xcsFFST

MQM Trace History
-------{ xcsFreeMemFn
Data: 0x00000000 0x0243eb30
-------} xcsFreeMemFn rc=OK
------} xcsFreeQuickCellEnumerator rc=OK
-----} xcsSearchQuickCellBlock rc=xecU_W_KEY_NOT_FOUND
-----{ xlsInitMutex
------{ xlsSpinLockInit

The above FDC was generated when I rebooted the node where QM0 is "Ended unexpectedly".


root@host01 /var/mqm/errors
# dspmq
QMNAME(QM0.TEST_RDQM) STATUS(Ended unexpectedly)
QMNAME(QM1.TEST_RDQM) STATUS(Running elsewhere)
QMNAME(QM2.TEST_RDQM) STATUS(Running elsewhere)
QMNAME(QM3.TEST_RDQM) STATUS(Running elsewhere)

root@host01 /var/mqm/errors
#

root@host01 /var/mqm/errors
#



Now (post reboot), if I attempt to start the QM (strmqm <QM NAme>), I get the following error:

# strmqm QM0.TEST_RDQM
IBM MQ queue manager 'QM0.TEST_RDQM' HA sub-system is already active.

root@host01 /var/mqm/errors



Any ideas as to how to troubleshoot this error? There is nothing suspicious under /var/mqm/vols/qmgrs for QM0 - the QM under duress


Last edited by mqdev on Tue Jul 16, 2019 9:37 am; edited 3 times in total
Back to top
View user's profile Send private message
hughson
PostPosted: Mon Jul 15, 2019 3:01 pm    Post subject: Reply with quote

Padawan

Joined: 09 May 2013
Posts: 1914
Location: Bay of Plenty, New Zealand

What is the output for the following commands:-
Code:
rdqmstatus

and:-
Code:
crm status

and are there any logged issues with Pacemaker
Code:
systemctl status pacemaker

or with Corosync
Code:
systemctl status corosync

Cheers,
Morag
_________________
Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software
Back to top
View user's profile Send private message Visit poster's website
mqdev
PostPosted: Tue Jul 16, 2019 8:11 am    Post subject: Reply with quote

Centurion

Joined: 21 Jan 2003
Posts: 136

Results from host03:

root@host03 ~
# rdqmstatus
Node: host03.xyz.com

Queue manager name: QM0.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host01.xyz.com

Queue manager name: QM1.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host02.xyz.com

Queue manager name: QM2.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host02.xyz.com

Queue manager name: QM3.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host02.xyz.com

root@host03 ~
root@host03 ~
# dspmq
QMNAME(QM0.TEST.RDQM) STATUS(Running elsewhere)
QMNAME(QM1.TEST.RDQM) STATUS(Running elsewhere)
QMNAME(QM2.TEST.RDQM) STATUS(Running elsewhere)
QMNAME(QM3.TEST.RDQM) STATUS(Running elsewhere)

root@host03 ~

============================================================================

Results from host02:

root@host02 ~
# rdqmstatus
Node: host02.xyz.com

Queue manager name: QM0.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host01.xyz.com

Queue manager name: QM1.TEST.RDQM
Queue manager status: Running
HA current location: This node

Queue manager name: QM2.TEST.RDQM
Queue manager status: Running
HA current location: This node

Queue manager name: QM3.TEST.RDQM
Queue manager status: Running
HA current location: This node

root@host02 ~
root@host02 ~
# dspmq
QMNAME(QM0.TEST.RDQM) STATUS(Running elsewhere)
QMNAME(QM1.TEST.RDQM) STATUS(Running)
QMNAME(QM2.TEST.RDQM) STATUS(Running)
QMNAME(QM3.TEST.RDQM) STATUS(Running)

root@host02 ~

============================================================================


Results from host01:

root@host01 /var/mqm/errors
# rdqmstatus
Node: host01.xyz.com

Queue manager name: QM0.TEST.RDQM
Queue manager status: Ended unexpectedly
HA current location: This node

Queue manager name: QM1.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host02.xyz.com

Queue manager name: QM2.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host02.xyz.com

Queue manager name: QM3.TEST.RDQM
Queue manager status: Running elsewhere
HA current location: host02.xyz.com

root@host01 /var/mqm/errors
# dspmq
QMNAME(QM0.TEST.RDQM) STATUS(Ended unexpectedly)
QMNAME(QM1.TEST.RDQM) STATUS(Running elsewhere)
QMNAME(QM2.TEST.RDQM) STATUS(Running elsewhere)
QMNAME(QM3.TEST.RDQM) STATUS(Running elsewhere)

-------------------------------------------------------------------------------------------------------------------------

crm status output:
------------------

Stack: corosync
Current DC: host03.xyz.com (version 1.1.15.linbit-2.0+20160622+e174ec8.el7-e174ec8) - partition with quorum
Last updated: Tue Jul 16 12:02:11 2019 Last change: Tue Jul 16 10:06:33 2019 by root via crm_attribute on host02.xyz.com

3 nodes and 24 resources configured

Online: [ host02.xyz.com host01.xyz.com host03.xyz.com ]

Full list of resources:

Master/Slave Set: ms_drbd_qm0_test_rdqm [p_drbd_qm0_test_rdqm]
Masters: [ host01.xyz.com ]
Slaves: [ host02.xyz.com host03.xyz.com ]
p_fs_qm0_test_rdqm (ocf::heartbeat:Filesystem): Started host01.xyz.com
p_rdqmx_qm0_test_rdqm (ocf::ibm:rdqmx): Started host01.xyz.com
qm0_test_rdqm (ocf::ibm:rdqm): Stopped
Master/Slave Set: ms_drbd_qm1_test_rdqm [p_drbd_qm1_test_rdqm]
Masters: [ host02.xyz.com ]
Slaves: [ host01.xyz.com host03.xyz.com ]
p_fs_qm1_test_rdqm (ocf::heartbeat:Filesystem): Started host02.xyz.com
p_rdqmx_qm1_test_rdqm (ocf::ibm:rdqmx): Started host02.xyz.com
qm1_test_rdqm (ocf::ibm:rdqm): Started host02.xyz.com
Master/Slave Set: ms_drbd_qm2_test_rdqm [p_drbd_qm2_test_rdqm]
Masters: [ host02.xyz.com ]
Slaves: [ host01.xyz.com host03.xyz.com ]
p_fs_qm2_test_rdqm (ocf::heartbeat:Filesystem): Started host02.xyz.com
p_rdqmx_qm2_test_rdqm (ocf::ibm:rdqmx): Started host02.xyz.com
qm2_test_rdqm (ocf::ibm:rdqm): Started host02.xyz.com
Master/Slave Set: ms_drbd_qm3_test_rdqm [p_drbd_qm3_test_rdqm]
Masters: [ host02.xyz.com ]
Slaves: [ host01.xyz.com host03.xyz.com ]
p_fs_qm3_test_rdqm (ocf::heartbeat:Filesystem): Started host02.xyz.com
p_rdqmx_qm3_test_rdqm (ocf::ibm:rdqmx): Started host02.xyz.com
qm3_test_rdqm (ocf::ibm:rdqm): Started host02.xyz.com

Failed Actions:
* qm0_test_rdqm_start_0 on host03.xyz.com 'unknown error' (1): call=142, status=complete, exitreason='none',
last-rc-change='Mon Jul 15 10:36:39 2019', queued=1ms, exec=915ms
* qm0_test_rdqm_start_0 on host01.xyz.com 'unknown error' (1): call=91, status=complete, exitreason='none',
last-rc-change='Tue Jul 16 10:15:35 2019', queued=0ms, exec=932ms
* qm0_test_rdqm_start_0 on host02.xyz.com 'unknown error' (1): call=132, status=complete, exitreason='none',
last-rc-change='Mon Jul 15 10:36:34 2019', queued=0ms, exec=1244ms

-------------------------------------------------------------------------------------------------------------------------

Pacemaker status:

# systemctl status pacemaker

● pacemaker.service - Pacemaker High Availability Cluster Manager
Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2019-07-16 10:15:08 EDT; 1h 49min ago
Docs: man:pacemakerd
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
Main PID: 6247 (pacemakerd)
CGroup: /system.slice/pacemaker.service
├─6247 /usr/sbin/pacemakerd -f
├─6270 /usr/libexec/pacemaker/cib
├─6271 /usr/libexec/pacemaker/stonithd
├─6272 /usr/libexec/pacemaker/lrmd
├─6273 /usr/libexec/pacemaker/attrd
├─6274 /usr/libexec/pacemaker/pengine
└─6275 /usr/libexec/pacemaker/crmd

Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sending flush op to all hosts for: last-failure-qm0_test_rdqm (1563286536)
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sent update 22: last-failure-qm0_test_rdqm=1563286536
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sending flush op to all hosts for: fail-count-qm0_test_rdqm (INFINITY)
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sent update 24: fail-count-qm0_test_rdqm=INFINITY
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
Jul 16 10:15:36 host01.xyz.com rdqm[8663]: stop(qm0_test_rdqm)
Jul 16 10:15:36 host01.xyz.com rdqm[8663]: stop(qm0_test_rdqm) rc(0)
Jul 16 10:15:36 host01.xyz.com crmd[6275]: notice: Result of stop operation for qm0_test_rdqm on host01.xyz.com: 0 (ok)

root@host01 /var/mqm/errors

-------------------------------------------------------------------------------------------------------------------------

Corosync status:

# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2019-07-16 10:15:08 EDT; 1h 51min ago
Process: 5943 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
Main PID: 6108 (corosync)
CGroup: /system.slice/corosync.service
└─6108 corosync

Jul 16 10:15:08 host01.xyz.com corosync[6108]: [TOTEM ] adding new UDPU member {172.24.102.130}
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [TOTEM ] adding new UDPU member {172.24.103.121}
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [TOTEM ] adding new UDPU member {172.24.103.151}
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [TOTEM ] A new membership (172.24.103.121:2196) was formed. Members joined: 2
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [TOTEM ] A new membership (172.24.102.130:2200) was formed. Members joined: 1 3
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [QUORUM] This node is within the primary component and will provide service.
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [QUORUM] Members[3]: 1 2 3
Jul 16 10:15:08 host01.xyz.com corosync[6108]: [MAIN ] Completed service synchronization, ready to provide service.
Jul 16 10:15:08 host01.xyz.com corosync[5943]: Starting Corosync Cluster Engine (corosync): [ OK ]
Jul 16 10:15:08 host01.xyz.com systemd[1]: Started Corosync Cluster Engine.



-------------------------------------------------------------------------------------------------------------------------


Last edited by mqdev on Tue Jul 16, 2019 9:38 am; edited 4 times in total
Back to top
View user's profile Send private message
mqdev
PostPosted: Tue Jul 16, 2019 8:13 am    Post subject: Reply with quote

Centurion

Joined: 21 Jan 2003
Posts: 136

host01 seems to be the problematic node - there are errors in Pacemaker as seen above.

host02 and host03 are reporting valid status for QM0 (Running elsewhere)

host01 is reporting error for QM0 (Ended unexpectedly).
Back to top
View user's profile Send private message
mqdev
PostPosted: Tue Jul 16, 2019 8:18 am    Post subject: Reply with quote

Centurion

Joined: 21 Jan 2003
Posts: 136

Also tried the command ( on host01 ):

crm resource cleanup p_drbd_qm0_test_rdqm

Result: did not work...
Back to top
View user's profile Send private message
hughson
PostPosted: Wed Jul 17, 2019 1:53 pm    Post subject: Reply with quote

Padawan

Joined: 09 May 2013
Posts: 1914
Location: Bay of Plenty, New Zealand

Also:
Code:
rdqmstatus -m QM0

I'll prod Hursley to take a look at this post.

Cheers,
Morag
_________________
Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software
Back to top
View user's profile Send private message Visit poster's website
hughson
PostPosted: Thu Jul 18, 2019 1:52 am    Post subject: Reply with quote

Padawan

Joined: 09 May 2013
Posts: 1914
Location: Bay of Plenty, New Zealand

Hursley said you should raise a PMR.

Cheers,
Morag
_________________
Morag Hughson @MoragHughson
IBM MQ Technical Education Specialist
Get your IBM MQ training here!
MQGem Software
Back to top
View user's profile Send private message Visit poster's website
mqdev
PostPosted: Thu Jul 18, 2019 4:44 am    Post subject: Reply with quote

Centurion

Joined: 21 Jan 2003
Posts: 136

Thanks Morag....we already raised a PMR. Not much progress there either, am afraid. Will keep pushing though...

Thanks
-mqdev
Back to top
View user's profile Send private message
mqdev
PostPosted: Thu Jul 18, 2019 4:48 am    Post subject: Reply with quote

Centurion

Joined: 21 Jan 2003
Posts: 136

Am just curious about the following log entries in Pacemaker log:

Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sending flush op to all hosts for: last-failure-qm0_test_rdqm (1563286536)
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sent update 22: last-failure-qm0_test_rdqm=1563286536
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sending flush op to all hosts for: fail-count-qm0_test_rdqm (INFINITY)
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Sent update 24: fail-count-qm0_test_rdqm=INFINITY
Jul 16 10:15:36 host01.xyz.com attrd[6273]: notice: Update relayed from host03.xyz.com
Jul 16 10:15:36 host01.xyz.com rdqm[8663]: stop(qm0_test_rdqm)
Jul 16 10:15:36 host01.xyz.com rdqm[8663]: stop(qm0_test_rdqm) rc(0)
Jul 16 10:15:36 host01.xyz.com crmd[6275]: notice: Result of stop operation for qm0_test_rdqm on host01.xyz.com: 0 (ok)


Questions:

1. What exactly is "INFINITY" signifying here? If it is some unrecoverable condition, what is that? How can we identify this (and similar conditions), going forward? More then tackling this problem, we are looking for tips to troubleshoot such problems with RDQM...

2. Looks like the QM0 was stopped....why?
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic  Reply to topic Page 1 of 1

MQSeries.net Forum Index » IBM MQ Installation/Configuration Support » RDQM in "Ended Unexpectedly" status - clueless..
Jump to:  



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP
 
 


Theme by Dustin Baccetti
Powered by phpBB © 2001, 2002 phpBB Group

Copyright © MQSeries.net. All rights reserved.