|
RSS Feed - WebSphere MQ Support
|
RSS Feed - Message Broker Support
|
 |
|
Timeout during xcsReadFileLock component (NFS) |
« View previous topic :: View next topic » |
Author |
Message
|
progruma |
Posted: Fri Jan 30, 2015 11:48 am Post subject: Timeout during xcsReadFileLock component (NFS) |
|
|
Novice
Joined: 19 Dec 2012 Posts: 11
|
Hello folks,
I need some help figuring whats going on here, we have a multi-instance queue manager, v7.1.0.9 running on RedHat Enterprise Linux Server release 6.6 (Santiago)
with NFS v4 - NetApp appliance
This queue manager fails over to its Standby node once every 2 to 4 weeks, the failover summary mentioned in one of the generated FDCs is as below
1- component: xcsReadFileLock
AMQ6119: An internal WebSphere MQ error has occurred ('5 - Input/output error' from read.)
2- component: zutVerifyQMFileLocks
Minor Errorcode :- xecN_E_FILE_ERROR
3- component: zutVerifyQMFileLocks
Minor Errorcode :- xecN_E_LOCK_NOT_GRANTED
4- component: zxcFileLockVerifyThread
Major Errorcode :- lrcE_S_Q_MGR_LOCK_LOST
AMQ7279: WebSphere MQ queue manager 'QM01' lost ownership of data lock.
5- component: zxcProcessChildren
Major Errorcode :- zrcX_PROCESS_MISSING
AMQ5008: An essential WebSphere MQ process 19556 (amqrrmfa) cannot be found and is assumed to be terminated.
the mount configuration is set per IBMs recommendation (sync,hard,intr,noac)
the current FileLockHeartBeatLen is set to 30 in the qm.ini file
the NetApp nfs.v4.lease_seconds is set to 30
at this point, Im not sure whats wrong, is it a configuration problem ? network problem ?
if its a configuration, what am I missing here ?
if its a Network problem, how can I prevent it from happening ? would increasing the nfs.v4.lease_seconds on NetApp or FileLockHeartBeatLen in qm.ini help ?
do they have to match?
any guidance is much appreciated
please let me know I missed stating any information needed for the current configuration/setup
Last edited by progruma on Fri Feb 06, 2015 6:24 am; edited 1 time in total |
|
Back to top |
|
 |
samsansam |
Posted: Thu Feb 05, 2015 12:09 pm Post subject: |
|
|
Apprentice
Joined: 19 Mar 2014 Posts: 41
|
First you need to know why the queue manager keep failing over evry 2-4 weeks.
we had the same issue before , in my case , the active queue manager process kept reaching 100 % which made the our queue manager faill over every 4 weeks.
We open PMR and they recommended the following :-
The way that the MQ multi-instance feature works is as follows:
One qmgr is started on one node. Another qmgr is started on another
node. Both qmgrs access the same qmgr data. Both try to get the lock on
the same "master" file. The successful qmgr regards itself as the active
qmgr, writes identification information into the "master" file,
and holds onto the lock. It commences full running as the active qmgr.
The unsuccessful qmgr regards itself as the standby qmgr and simply
retries trying to obtain the lock on "master", every 2s.
The active qmgr "monitors" the "master" file, reading it every 10s to
check that the information it wrote into it when it became the active
qmgr is unchanged.
If the active qmgr ends for whatever reason, the lock on "master" is
dropped and the standby qmgr then gains the lock and becomes the active
qmgr, writing its info into "master" and starting up full running.
The successful operation of the MQ multi-instance feature critically
depends on the correct functioning of NFS file locks. It also requires
that operations (such as open and read) on an NFS file-locked file do
not take longer than a small number of tens of seconds at most. If the
qmgr does incur delays of tens of seconds then the qmgr has to conclude
that the system is not operating correctly and the qmgr will shutdown
to allow the standby instance to take over.
Other utilities, particularly and perhaps exclusively dspmq, may also
enquire of the "master" file (and other files), reading the file
contents. In the case of dspmq, that is to determine the state of the
qmgr. The frequency of the checking would depend on the frequency of
invocation of dspmq. The dspmq execution time is
quite variable too, so the times when it checks the "master" files can
be quite variable. The dspmq program will typically check the "master"
file of each defined qmgr, in turn.
The 10s (default) verification time is configurable via the
TuningParameters stanza of the active qmgr's qm.ini file. E.g. to set
it to 30s:
TuningParameters:
FileLockHeartBeatLen=30
The health check thread time is computed as twice that value, so setting
FileLockHeartBeatLen to 30s will mean the lock verification thread can
be hung for a period of 2*30s to 4*30s before triggering the qmgr to
end.
There is no similar tuning for the standby qmgr. If the standby qmgr is
able to obtain the lock, then it has to conclude the active qmgr has
dropped it, and the standby qmgr then attempts to start.
If the NFS failover causes the lock to appear to have been dropped, when
in fact the active qmgr has not issued an instruction to drop it, then
that would be an issue with the NFS failover mechanism.
There is further a Liveness check that is made by default every 60s,
and if there are two consecutive occasions when the qmgr fails the
check, then an FDC is written and the qmgr is brought down.
The Liveness check time is configurable up to 600s. If the system may
"pause" for up to 180s, then the check time could be set to 190s to
avoid the check being tripped. Add the following tuning parameters
stanza to the qm.ini and recycle the qmgr:
TuningParameters:
LivenessHeartBeatLen=190
*** If it is found necessary to adjust from the default times, then
there is likely an underlying issue on the filesystem/network/file
server or NFS client that should be investigated and addressed.
This is outside of WebSphere MQ's scope.
we fixed our issue by updating our NFS following this link
http://www-01.ibm.com/support/docview.wss?uid=swg21592501 |
|
Back to top |
|
 |
progruma |
Posted: Fri Feb 06, 2015 6:49 am Post subject: |
|
|
Novice
Joined: 19 Dec 2012 Posts: 11
|
thanks samsansam,
at the beginning, yes that was happenning, the process hangs on CPU 100%, the problem got solved after we update the kernel
after the kernel upgrade, we did notice the iowait% goes up to 100% just before a failover happen.
this line in your reply brought my attention:
Quote: |
If the NFS failover causes the lock to appear to have been dropped, when
in fact the active qmgr has not issued an instruction to drop it, then
that would be an issue with the NFS failover mechanism. |
thats I think whats happening,
IBM recommended to use the FileLockHeartBeatLen tuning parameter, and we did,
what I think is happening here (need someone to validate it)
the active queue manager does not issue instruction to drop it ...
& for some hell of a reason (network packet drop or whatever), NFS issue lock release after the 30 seconds (NetApp nfs.v4.lease_seconds is set to 30), and that errors in the OS kernel (NFS client), causing the process to die.. meanwhile, the secondary is already monitoring the master file (2 seconds interval), it takes over.
now, 2 things - if my theory is valid:
#1 what cause the 30 seconds delay ? & how to prevent that root cause ? (we did specify static route to help from network perspective)
#2 how to prevent a failover from happening ? if we increased the NetApp nfs.v4.lease_seconds is set to 60 seconds, would it let the MQ control the failover instruction and not the kernel ?
RedHat recommendation snippet
Quote: |
In addition, a workaround for this issue is to increase the NFS server lease time from 30 seconds to a higher value, as the NFS server takes a long time to respond to RENEW requests increasing the lease time will provide a bit of a buffer to provide a larger window for the RENEW to be received on the client-side before the lease is expired. |
[/quote] |
|
Back to top |
|
 |
samsansam |
Posted: Fri Feb 06, 2015 10:57 am Post subject: |
|
|
Apprentice
Joined: 19 Mar 2014 Posts: 41
|
The core reason for this problem is because the MQ queue manager had a problem with obtaining from the file system (via the operating system) a file lock in a timely manner. Since MQ is reliant on having access to its files in order to run, then any interruption to file accesses can cause the queue manager to fail. The MQ queue manager runs a file lock monitor thread, which, every 10 seconds (in your case 30) , checks that it has access to the resources it needs over the network. Another health-check thread monitors this one to determine whether it has hung, which is a sign that there is a network problem. The FDC with probe id ZX155001 is from this health-check thread which detected that the file lock monitor thread had not responded for 60 seconds. Subsequent FDCs may result due to MQ processes being unable to access files over the network."
Keep in your mind MQ is the victim here not the cause. |
|
Back to top |
|
 |
samsansam |
Posted: Fri Feb 06, 2015 11:00 am Post subject: |
|
|
Apprentice
Joined: 19 Mar 2014 Posts: 41
|
for 100 % Process , we had to increase our memory from 8 G to 32 G  |
|
Back to top |
|
 |
|
|
 |
|
Page 1 of 1 |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
|