ASG
IBM
Zystems
Cressida
Icon
Netflexity
 
  MQSeries.net
Search  Search       Tech Exchange      Education      Certifications      Library      Info Center      SupportPacs      LinkedIn  Search  Search                                                                   FAQ  FAQ   Usergroups  Usergroups
 
Register  ::  Log in Log in to check your private messages
 
RSS Feed - WebSphere MQ Support RSS Feed - Message Broker Support

MQSeries.net Forum IndexUser ExitsCluster workload exit intermittant SIGSEGV

Post new topicReply to topic Goto page 1, 2  Next
Cluster workload exit intermittant SIGSEGV View previous topic :: View next topic
Author Message
lorenmcc
PostPosted: Tue Dec 11, 2012 8:46 am Post subject: Cluster workload exit intermittant SIGSEGV Reply with quote

Newbie

Joined: 11 Dec 2012
Posts: 5

I am receiving a SIGSEGV (address not mapped) intermittantly in an exit I wrote for Cluster WLM exit. The purpose of the exit is to logically split our cluster so applications can test an implementation before making it live (one side of the cluster is BAU, the other side is for validation testing). The exit is live all the time (and normally does nothing except exit), but is only activated/deactivated when it receives a signal to do that (I won't go into details as to how that happens or how the split is determined as that is not where the problem is).

Now for the problem. I want to be able to log the "enable" and "disable" activities of the exit. This works most of the time, however, sometimes the exit terminates (and of course restarts the cluster process) due to a detected SIGSEGV in the fprintf function. I suppose this is due to multi-threading, but I am at a loss as to how to prevent this from happening.

The Stack trace and the code causing the SIGSEGV are attached below. This is for 64 bit Linux on Redhat, but the problem also occurs on 32bit Linux (RedHat), zLinux (RedHat) and Solaris. This does work a good portion of the time, but intermittently crashes.

any help appreciated.

Thanks

O/S Call Stack for current thread

/opt/mqm/lib64/libmqmcs_r.so(xcsPrintStackForCurrentThread+0xa0)[0x2b670b420070]
/opt/mqm/lib64/libmqmcs_r.so(signalHandlerInternal+0x5c)[0x2b670b4365ac]
/opt/mqm/lib64/libmqmcs_r.so(PrepareDumpAreas+0xd2)[0x2b670b434ac2]
/opt/mqm/lib64/libmqmcs_r.so(xcsFFSTFn+0x20d9)[0x2b670b438f89]
/opt/mqm/lib64/libmqmcs_r.so(xehExceptionHandler+0x625)[0x2b670b4332e5]
/lib64/libpthread.so.0[0x39efa0ebe0]
/lib64/libc.so.6(_IO_vfprintf+0x39)[0x39ef242b59]
/lib64/libc.so.6(_IO_fprintf+0x88)[0x39ef24cd28]
/var/mqm/exits64/hmclwley(clwlFunction+0x62b)[0x2aaaaeb54339]
/opt/mqm/lib64/libmqmr_r.so(rfxCallClusterWorkloadExit+0xf6)[0x2b670b00e4e6]
/opt/mqm/bin/amqzlwa0(xcsTerminate+0x903)[0x401e5b]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x39ef21d994]
/opt/mqm/bin/amqzlwa0(xcsTerminate+0x42)[0x40159a]



Failing code:

curtime = time(NULL);
/*char* dt = ctime(&curtime);*/
curtm = localtime (&curtime);
strftime(dt, 25, "%a %b %e %Y %H:%M:%S", curtm);
fprintf(debugf, "%s:\tEnableExit\n", dt);
fflush(debugf);



+-----------------------------------------------------------------------------+
| |
| WebSphere MQ First Failure Symptom Report |
| ========================================= |
| |
| Date/Time :- Tue December 11 2012 14:29:19 UTC |
| UTC Time :- 1355236159.970230 |
| UTC Time Offset :- -300 (EST) |
| Host Name :- chln884 |
| Operating System :- Linux 2.6.18-308.13.1.el5 |
| PIDS :- 5724H7230 |
| LVLS :- 7.0.1.6 |
| Product Long Name :- WebSphere MQ for Linux (x86-64 platform) |
| Vendor :- IBM |
| Probe Id :- XC130003 |
| Application Name :- MQM |
| Component :- xehExceptionHandler |
| SCCS Info :- lib/cs/unix/amqxerrx.c, 1.242.1.2 |
| Line Number :- 1386 |
| Build Date :- Jul 25 2011 |
| CMVC level :- p701-106-110725 |
| Build Type :- IKAP - (Production) |
| Effective UserID :- 1701 (mqm) |
| Real UserID :- 1701 (mqm) |
| Program Name :- amqzlwa0 |
| Addressing mode :- 64-bit |
| Process :- 12135 |
| Process(Thread) :- 12135 |
| Thread :- 1 |
| ThreadingModel :- PosixThreads |
| QueueManager :- mqsb_qma |
| UserApp :- FALSE |
| ConnId(1) IPCC :- 191 |
| Last HQC :- 1.0.0-58944 |
| Last HSHMEMB :- 0.0.0-0 |
| Major Errorcode :- STOP |
| Minor Errorcode :- OK |
| Probe Type :- HALT6109 |
| Probe Severity :- 1 |
| Probe Description :- AMQ6109: An internal WebSphere MQ error has occurred. |
| FDCSequenceNumber :- 0 |
| Arith1 :- 11 (0xb) |
| Comment1 :- SIGSEGV: address not mapped(0xc0) |
| |
+-----------------------------------------------------------------------------+
Back to top
View user's profile Send private message
mqjeff
PostPosted: Tue Dec 11, 2012 9:35 am Post subject: Reply with quote

Grand Master

Joined: 25 Jun 2008
Posts: 17448

Your applications should have configuration that can be easily changed such that you can tell them to put to queues that are shared in a validation cluster rather than queues that are shared in the business-as-usual cluster.

Then you can overlap these clusters across the relevant set of queue managers, and not have to worry about damaging your BAU due to a failure of an inhouse-written cluster exit.

Plus you won't have to solve your SIGSEGV problem.

What I mean to say is, EVERY SINGLE TIME YOU THINK YOU SHOULD WRITE OR INVOLVE AN MQ EXIT, you're doing things the hard way rather than the right way.
Back to top
View user's profile Send private message
Vitor
PostPosted: Tue Dec 11, 2012 9:43 am Post subject: Reply with quote

Grand High Poobah

Joined: 11 Nov 2005
Posts: 24614
Location: Ohio, USA

mqjeff wrote:
EVERY SINGLE TIME YOU THINK YOU SHOULD WRITE OR INVOLVE AN MQ EXIT, you're doing things the hard way rather than the right way.




If the answer is an MQ exit, you're asking the wrong question.
_________________
Honesty is the best policy.
Insanity is the best defence.
Back to top
View user's profile Send private message
lorenmcc
PostPosted: Tue Dec 11, 2012 10:02 am Post subject: Reply with quote

Newbie

Joined: 11 Dec 2012
Posts: 5

We have several hundred integrated apps (WebSphere AS and mainframe), across about 50 qmgrs and several thousand queues involved. I don't need a debate about whether this should have been done. This was the resolution chosen (with involvement from IBM) and I have to deliver a solution.
Back to top
View user's profile Send private message
bruce2359
PostPosted: Tue Dec 11, 2012 10:03 am Post subject: Reply with quote

Poobah

Joined: 05 Jan 2008
Posts: 7843
Location: US: west coast, almost. Otherwise, enroute.



Decades ago I attended a tech conference where one of the sessions was entitled "Exits: a guide to self-inflicted wounds."
_________________
I didn't know that Schrdinger had a cat.
Back to top
View user's profile Send private message
mqjeff
PostPosted: Tue Dec 11, 2012 10:33 am Post subject: Reply with quote

Grand Master

Joined: 25 Jun 2008
Posts: 17448

You're not likely to get meaningful advice on solving multithreading issues in MQ exits outside of a PMR process.

The very very few people that have experience with these things who do not work for Hursley tend to view it as the source of their revenue.

There are likely general purpose strategies for debugging problems with reentrant code - making sure you know how and where variables are declared, and when they can be accessed from more than one thread and etc.

There are likely guidelines on the relevant c library documentation and compiler documentation for the different build environments you're using as to which functions in the standard library can be used safely from reentrant code (i.e. it's possible that fprintf is simply not threadsafe and you have to use fprintfcxpdqrlm or something).

As you say, you have to deliver a solution and you already have involvement from IBM, so follow up that way.

But you're adding a significant production risk to all of your systems for a gain that's hard to quantify externally. This is not a debate, this is a fact.

I state this fact not because I don't believe you don't understand it, but to make sure that future readers understand it.
Back to top
View user's profile Send private message
rekarm01
PostPosted: Tue Dec 11, 2012 10:44 am Post subject: Re: Cluster workload exit intermittant SIGSEGV Reply with quote

Grand Master

Joined: 25 Jun 2008
Posts: 1262

lorenmcc wrote:
I suppose this is due to multi-threading, but I am at a loss as to how to prevent this from happening.

The C datetime formatting functions are not thread-safe. If that's an issue, there should be variants of these functions that are, (such as ctime_r, localtime_r, etc.). Consult the platform documentation for more details.
Back to top
View user's profile Send private message
lorenmcc
PostPosted: Tue Dec 11, 2012 10:47 am Post subject: Reply with quote

Newbie

Joined: 11 Dec 2012
Posts: 5

I agree on the risk, but this is the decision that was made.

As for IBM's involvement, it stops as soon as you write the exit. They will not debug user code or give advice on how to write it, even if it involves their exits.
Back to top
View user's profile Send private message
bruce2359
PostPosted: Tue Dec 11, 2012 10:57 am Post subject: Reply with quote

Poobah

Joined: 05 Jan 2008
Posts: 7843
Location: US: west coast, almost. Otherwise, enroute.

lorenmcc wrote:
I agree on the risk, but this is the decision that was made.

As for IBM's involvement, it stops as soon as you write the exit. They will not debug user code or give advice on how to write it, even if it involves their exits.

You are half-right. IBM will, for a hefty fee, come to your site to write code for you, debug your code, help you debug your code.

If this is a crisis, a significant risk to your business, then management must treat it as such.
_________________
I didn't know that Schrdinger had a cat.
Back to top
View user's profile Send private message
mqjeff
PostPosted: Tue Dec 11, 2012 11:17 am Post subject: Re: Cluster workload exit intermittant SIGSEGV Reply with quote

Grand Master

Joined: 25 Jun 2008
Posts: 17448

rekarm01 wrote:
lorenmcc wrote:
I suppose this is due to multi-threading, but I am at a loss as to how to prevent this from happening.

The C datetime formatting functions are not thread-safe. If that's an issue, there should be variants of these functions that are, (such as ctime_r, localtime_r, etc.). Consult the platform documentation for more details.


Back to top
View user's profile Send private message
Vitor
PostPosted: Tue Dec 11, 2012 11:24 am Post subject: Reply with quote

Grand High Poobah

Joined: 11 Nov 2005
Posts: 24614
Location: Ohio, USA

lorenmcc wrote:
As for IBM's involvement, it stops as soon as you write the exit. They will not debug user code or give advice on how to write it, even if it involves their exits.


Well it's not their exits, in so far as the product is not supplied with any. Just points where you can insert your code. However, IBM will (to a point) give advice in the face of a PMR on user code (I've got advice on ESQL before now) and you say these exits were written with involvement from IBM. Depending on the terms of the engagement, and if you mean IBM rather than a consultant who specialises in IBM then you may have detailed recourse.

(The term "IBM consultant" has been used rather liberally in my experience. There is the consultant who works on IBM, and the consultant who works for IBM. Many imply the latter but are the former.)
_________________
Honesty is the best policy.
Insanity is the best defence.
Back to top
View user's profile Send private message
lorenmcc
PostPosted: Tue Dec 11, 2012 11:54 am Post subject: Reply with quote

Newbie

Joined: 11 Dec 2012
Posts: 5

you were lucky on the ESQL, but I guess that is on usage of something they support. I guess it depends on who you get. I opened a PMR asking specific questions on writing the exits and I was told flat out that they do not give advice or help debug user written code.
Back to top
View user's profile Send private message
Vitor
PostPosted: Tue Dec 11, 2012 12:14 pm Post subject: Reply with quote

Grand High Poobah

Joined: 11 Nov 2005
Posts: 24614
Location: Ohio, USA

lorenmcc wrote:
you were lucky on the ESQL, but I guess that is on usage of something they support. I guess it depends on who you get.


Some of the L3 people are very nice.

lorenmcc wrote:
I opened a PMR asking specific questions on writing the exits and I was told flat out that they do not give advice or help debug user written code.


Then as I and my most worthy associate have said, you need to go back to whoever on the IBM side negociated the "involvement" & see what's available to you. Even if that actual person has moved on, up or out your account rep will be able to determine what's available to you. IBM's involvement typically doesn't end as soon as you've written code especially when (as you indicated) it was at their recommendation.

Or, as my other most worthy associate points out, IBM can be paid to debug your user code. Which they do very well.

Or your management can save the consultancy fee and put the money into splitting out the clusters and avoiding the need for a user exit.

There are always options. Including living with things as they are.
_________________
Honesty is the best policy.
Insanity is the best defence.
Back to top
View user's profile Send private message
gbaddeley
PostPosted: Tue Dec 11, 2012 3:39 pm Post subject: Re: Cluster workload exit intermittant SIGSEGV Reply with quote

Padawan

Joined: 25 Mar 2003
Posts: 1732
Location: Melbourne, Australia

rekarm01 wrote:
lorenmcc wrote:
I suppose this is due to multi-threading, but I am at a loss as to how to prevent this from happening.

The C datetime formatting functions are not thread-safe. If that's an issue, there should be variants of these functions that are, (such as ctime_r, localtime_r, etc.). Consult the platform documentation for more details.

Agree. Do not use non-thread-safe functions in MQ exits (ie. functions that return a pointer to memory that is internal or allocated by the C libraries).

Glenn (experienced MQ exit programmer)
Back to top
View user's profile Send private message
mvic
PostPosted: Tue Dec 11, 2012 3:55 pm Post subject: Re: Cluster workload exit intermittant SIGSEGV Reply with quote

Padawan

Joined: 09 Mar 2004
Posts: 1981

lorenmcc wrote:
Failing code:

curtime = time(NULL);
curtm = localtime (&curtime);
strftime(dt, 25, "%a %b %e %Y %H:%M:%S", curtm);
fprintf(debugf, "%s:\tEnableExit\n", dt);

Your call stack at the time of exception was in fprintf(). The only reasonable conclusion is that dt (you didn't include indication of what data type that is) held not-valid character data, causing the code in fprintf() to innocently try to read an invalid memory address.

I see you provided a length of 25 in strftime(), but is this sufficient for the data you expect to be stored in the buffer at dt?

Is the buffer at dt of sufficient length for the data you need to write into it and read out of it?

Is the buffer at dt 0-byte-terminated at the time you pass it to fprintf() ?

If localtime_r() is available on this system, then should this be used in preference to localtime() in an MQ exit?

Hope this helps.
Back to top
View user's profile Send private message
Display posts from previous:
Post new topicReply to topic Goto page 1, 2  Next Page 1 of 2

MQSeries.net Forum IndexUser ExitsCluster workload exit intermittant SIGSEGV
Jump to:



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Protected by Anti-Spam ACP


Theme by Dustin Baccetti
Powered by phpBB 2001, 2002 phpBB Group

Copyright MQSeries.net. All rights reserved.