Author |
Message
|
lorenmcc |
Posted: Tue Dec 11, 2012 8:46 am Post subject: Cluster workload exit intermittant SIGSEGV |
|
|
Newbie
Joined: 11 Dec 2012 Posts: 5
|
I am receiving a SIGSEGV (address not mapped) intermittantly in an exit I wrote for Cluster WLM exit. The purpose of the exit is to logically split our cluster so applications can test an implementation before making it live (one side of the cluster is BAU, the other side is for validation testing). The exit is live all the time (and normally does nothing except exit), but is only activated/deactivated when it receives a signal to do that (I won't go into details as to how that happens or how the split is determined as that is not where the problem is).
Now for the problem. I want to be able to log the "enable" and "disable" activities of the exit. This works most of the time, however, sometimes the exit terminates (and of course restarts the cluster process) due to a detected SIGSEGV in the fprintf function. I suppose this is due to multi-threading, but I am at a loss as to how to prevent this from happening.
The Stack trace and the code causing the SIGSEGV are attached below. This is for 64 bit Linux on Redhat, but the problem also occurs on 32bit Linux (RedHat), zLinux (RedHat) and Solaris. This does work a good portion of the time, but intermittently crashes.
any help appreciated.
Thanks
O/S Call Stack for current thread
/opt/mqm/lib64/libmqmcs_r.so(xcsPrintStackForCurrentThread+0xa0)[0x2b670b420070]
/opt/mqm/lib64/libmqmcs_r.so(signalHandlerInternal+0x5c)[0x2b670b4365ac]
/opt/mqm/lib64/libmqmcs_r.so(PrepareDumpAreas+0xd2)[0x2b670b434ac2]
/opt/mqm/lib64/libmqmcs_r.so(xcsFFSTFn+0x20d9)[0x2b670b438f89]
/opt/mqm/lib64/libmqmcs_r.so(xehExceptionHandler+0x625)[0x2b670b4332e5]
/lib64/libpthread.so.0[0x39efa0ebe0]
/lib64/libc.so.6(_IO_vfprintf+0x39)[0x39ef242b59]
/lib64/libc.so.6(_IO_fprintf+0x88)[0x39ef24cd28]
/var/mqm/exits64/hmclwley(clwlFunction+0x62b)[0x2aaaaeb54339]
/opt/mqm/lib64/libmqmr_r.so(rfxCallClusterWorkloadExit+0xf6)[0x2b670b00e4e6]
/opt/mqm/bin/amqzlwa0(xcsTerminate+0x903)[0x401e5b]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x39ef21d994]
/opt/mqm/bin/amqzlwa0(xcsTerminate+0x42)[0x40159a]
Failing code:
curtime = time(NULL);
/*char* dt = ctime(&curtime);*/
curtm = localtime (&curtime);
strftime(dt, 25, "%a %b %e %Y %H:%M:%S", curtm);
fprintf(debugf, "%s:\tEnableExit\n", dt);
fflush(debugf);
+-----------------------------------------------------------------------------+
| |
| WebSphere MQ First Failure Symptom Report |
| ========================================= |
| |
| Date/Time :- Tue December 11 2012 14:29:19 UTC |
| UTC Time :- 1355236159.970230 |
| UTC Time Offset :- -300 (EST) |
| Host Name :- chln884 |
| Operating System :- Linux 2.6.18-308.13.1.el5 |
| PIDS :- 5724H7230 |
| LVLS :- 7.0.1.6 |
| Product Long Name :- WebSphere MQ for Linux (x86-64 platform) |
| Vendor :- IBM |
| Probe Id :- XC130003 |
| Application Name :- MQM |
| Component :- xehExceptionHandler |
| SCCS Info :- lib/cs/unix/amqxerrx.c, 1.242.1.2 |
| Line Number :- 1386 |
| Build Date :- Jul 25 2011 |
| CMVC level :- p701-106-110725 |
| Build Type :- IKAP - (Production) |
| Effective UserID :- 1701 (mqm) |
| Real UserID :- 1701 (mqm) |
| Program Name :- amqzlwa0 |
| Addressing mode :- 64-bit |
| Process :- 12135 |
| Process(Thread) :- 12135 |
| Thread :- 1 |
| ThreadingModel :- PosixThreads |
| QueueManager :- mqsb_qma |
| UserApp :- FALSE |
| ConnId(1) IPCC :- 191 |
| Last HQC :- 1.0.0-58944 |
| Last HSHMEMB :- 0.0.0-0 |
| Major Errorcode :- STOP |
| Minor Errorcode :- OK |
| Probe Type :- HALT6109 |
| Probe Severity :- 1 |
| Probe Description :- AMQ6109: An internal WebSphere MQ error has occurred. |
| FDCSequenceNumber :- 0 |
| Arith1 :- 11 (0xb) |
| Comment1 :- SIGSEGV: address not mapped(0xc0) |
| |
+-----------------------------------------------------------------------------+ |
|
Back to top |
|
 |
mqjeff |
Posted: Tue Dec 11, 2012 9:35 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
Your applications should have configuration that can be easily changed such that you can tell them to put to queues that are shared in a validation cluster rather than queues that are shared in the business-as-usual cluster.
Then you can overlap these clusters across the relevant set of queue managers, and not have to worry about damaging your BAU due to a failure of an inhouse-written cluster exit.
Plus you won't have to solve your SIGSEGV problem.
What I mean to say is, EVERY SINGLE TIME YOU THINK YOU SHOULD WRITE OR INVOLVE AN MQ EXIT, you're doing things the hard way rather than the right way. |
|
Back to top |
|
 |
Vitor |
Posted: Tue Dec 11, 2012 9:43 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
mqjeff wrote: |
EVERY SINGLE TIME YOU THINK YOU SHOULD WRITE OR INVOLVE AN MQ EXIT, you're doing things the hard way rather than the right way. |
If the answer is an MQ exit, you're asking the wrong question.  _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
lorenmcc |
Posted: Tue Dec 11, 2012 10:02 am Post subject: |
|
|
Newbie
Joined: 11 Dec 2012 Posts: 5
|
We have several hundred integrated apps (WebSphere AS and mainframe), across about 50 qmgrs and several thousand queues involved. I don't need a debate about whether this should have been done. This was the resolution chosen (with involvement from IBM) and I have to deliver a solution. |
|
Back to top |
|
 |
bruce2359 |
Posted: Tue Dec 11, 2012 10:03 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
Decades ago I attended a tech conference where one of the sessions was entitled "Exits: a guide to self-inflicted wounds." _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
mqjeff |
Posted: Tue Dec 11, 2012 10:33 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
You're not likely to get meaningful advice on solving multithreading issues in MQ exits outside of a PMR process.
The very very few people that have experience with these things who do not work for Hursley tend to view it as the source of their revenue.
There are likely general purpose strategies for debugging problems with reentrant code - making sure you know how and where variables are declared, and when they can be accessed from more than one thread and etc.
There are likely guidelines on the relevant c library documentation and compiler documentation for the different build environments you're using as to which functions in the standard library can be used safely from reentrant code (i.e. it's possible that fprintf is simply not threadsafe and you have to use fprintfcxpdqrlm or something).
As you say, you have to deliver a solution and you already have involvement from IBM, so follow up that way.
But you're adding a significant production risk to all of your systems for a gain that's hard to quantify externally. This is not a debate, this is a fact.
I state this fact not because I don't believe you don't understand it, but to make sure that future readers understand it. |
|
Back to top |
|
 |
rekarm01 |
Posted: Tue Dec 11, 2012 10:44 am Post subject: Re: Cluster workload exit intermittant SIGSEGV |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
lorenmcc wrote: |
I suppose this is due to multi-threading, but I am at a loss as to how to prevent this from happening. |
The C datetime formatting functions are not thread-safe. If that's an issue, there should be variants of these functions that are, (such as ctime_r, localtime_r, etc.). Consult the platform documentation for more details. |
|
Back to top |
|
 |
lorenmcc |
Posted: Tue Dec 11, 2012 10:47 am Post subject: |
|
|
Newbie
Joined: 11 Dec 2012 Posts: 5
|
I agree on the risk, but this is the decision that was made.
As for IBM's involvement, it stops as soon as you write the exit. They will not debug user code or give advice on how to write it, even if it involves their exits. |
|
Back to top |
|
 |
bruce2359 |
Posted: Tue Dec 11, 2012 10:57 am Post subject: |
|
|
 Poobah
Joined: 05 Jan 2008 Posts: 9469 Location: US: west coast, almost. Otherwise, enroute.
|
lorenmcc wrote: |
I agree on the risk, but this is the decision that was made.
As for IBM's involvement, it stops as soon as you write the exit. They will not debug user code or give advice on how to write it, even if it involves their exits. |
You are half-right. IBM will, for a hefty fee, come to your site to write code for you, debug your code, help you debug your code.
If this is a crisis, a significant risk to your business, then management must treat it as such. _________________ I like deadlines. I like to wave as they pass by.
ב''ה
Lex Orandi, Lex Credendi, Lex Vivendi. As we Worship, So we Believe, So we Live. |
|
Back to top |
|
 |
mqjeff |
Posted: Tue Dec 11, 2012 11:17 am Post subject: Re: Cluster workload exit intermittant SIGSEGV |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
rekarm01 wrote: |
lorenmcc wrote: |
I suppose this is due to multi-threading, but I am at a loss as to how to prevent this from happening. |
The C datetime formatting functions are not thread-safe. If that's an issue, there should be variants of these functions that are, (such as ctime_r, localtime_r, etc.). Consult the platform documentation for more details. |
 |
|
Back to top |
|
 |
Vitor |
Posted: Tue Dec 11, 2012 11:24 am Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
lorenmcc wrote: |
As for IBM's involvement, it stops as soon as you write the exit. They will not debug user code or give advice on how to write it, even if it involves their exits. |
Well it's not their exits, in so far as the product is not supplied with any. Just points where you can insert your code. However, IBM will (to a point) give advice in the face of a PMR on user code (I've got advice on ESQL before now) and you say these exits were written with involvement from IBM. Depending on the terms of the engagement, and if you mean IBM rather than a consultant who specialises in IBM then you may have detailed recourse.
(The term "IBM consultant" has been used rather liberally in my experience. There is the consultant who works on IBM, and the consultant who works for IBM. Many imply the latter but are the former.) _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
lorenmcc |
Posted: Tue Dec 11, 2012 11:54 am Post subject: |
|
|
Newbie
Joined: 11 Dec 2012 Posts: 5
|
you were lucky on the ESQL, but I guess that is on usage of something they support. I guess it depends on who you get. I opened a PMR asking specific questions on writing the exits and I was told flat out that they do not give advice or help debug user written code. |
|
Back to top |
|
 |
Vitor |
Posted: Tue Dec 11, 2012 12:14 pm Post subject: |
|
|
 Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
lorenmcc wrote: |
you were lucky on the ESQL, but I guess that is on usage of something they support. I guess it depends on who you get. |
Some of the L3 people are very nice.
lorenmcc wrote: |
I opened a PMR asking specific questions on writing the exits and I was told flat out that they do not give advice or help debug user written code. |
Then as I and my most worthy associate have said, you need to go back to whoever on the IBM side negociated the "involvement" & see what's available to you. Even if that actual person has moved on, up or out your account rep will be able to determine what's available to you. IBM's involvement typically doesn't end as soon as you've written code especially when (as you indicated) it was at their recommendation.
Or, as my other most worthy associate points out, IBM can be paid to debug your user code. Which they do very well.
Or your management can save the consultancy fee and put the money into splitting out the clusters and avoiding the need for a user exit.
There are always options. Including living with things as they are. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
 |
gbaddeley |
Posted: Tue Dec 11, 2012 3:39 pm Post subject: Re: Cluster workload exit intermittant SIGSEGV |
|
|
 Jedi Knight
Joined: 25 Mar 2003 Posts: 2538 Location: Melbourne, Australia
|
rekarm01 wrote: |
lorenmcc wrote: |
I suppose this is due to multi-threading, but I am at a loss as to how to prevent this from happening. |
The C datetime formatting functions are not thread-safe. If that's an issue, there should be variants of these functions that are, (such as ctime_r, localtime_r, etc.). Consult the platform documentation for more details. |
Agree. Do not use non-thread-safe functions in MQ exits (ie. functions that return a pointer to memory that is internal or allocated by the C libraries).
Glenn (experienced MQ exit programmer) |
|
Back to top |
|
 |
mvic |
Posted: Tue Dec 11, 2012 3:55 pm Post subject: Re: Cluster workload exit intermittant SIGSEGV |
|
|
 Jedi
Joined: 09 Mar 2004 Posts: 2080
|
lorenmcc wrote: |
Failing code:
curtime = time(NULL);
curtm = localtime (&curtime);
strftime(dt, 25, "%a %b %e %Y %H:%M:%S", curtm);
fprintf(debugf, "%s:\tEnableExit\n", dt);
|
Your call stack at the time of exception was in fprintf(). The only reasonable conclusion is that dt (you didn't include indication of what data type that is) held not-valid character data, causing the code in fprintf() to innocently try to read an invalid memory address.
I see you provided a length of 25 in strftime(), but is this sufficient for the data you expect to be stored in the buffer at dt?
Is the buffer at dt of sufficient length for the data you need to write into it and read out of it?
Is the buffer at dt 0-byte-terminated at the time you pass it to fprintf() ?
If localtime_r() is available on this system, then should this be used in preference to localtime() in an MQ exit?
Hope this helps. |
|
Back to top |
|
 |
|