MQSeries.net :: View topic - What is Your Error Handling Design for Message Flows

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » What is Your Error Handling Design for Message Flows

What is Your Error Handling Design for Message Flows

« View previous topic :: View next topic »

Author

Message

Bob

Posted: Thu Oct 04, 2001 2:01 pm Post subject:

Newbie

Joined: 03 Oct 2001
Posts: 4

I'm trying to design robust error handling for my message flows. It appears that some combination of TryCatch, Throw, Trace, and MQOUTPUT nodes are required.

The IBM examples I've seen to date, wire MQOUTPUT nodes to the failure terminals of every node. This allows the current state of the input message to be examined. In their examples, the point in the flow that the failure occurred could be determined, because they created a distinct queueName for each MQOUTPUT node. The downside is the number of queues that need to be defined, and that the original message may have been modified to the extent that it no longer can be copied to the MQINPUT queue after the exceptions been resolved.

Another approach would be to not wire any failure nodes and to use the catch terminal of the MQINPUT to process the exception. Possibly TryCatch nodes could be interspersed as well. The issue here is that as the message makes its way back to the MQINPUT catch terminal, the message is reinstated to what it was when read (or to when it passed through the TryCatch node). Since the message is reinstated, the only tracking information is in the ExceptionList variable. Possibly Compute nodes could be interspersed that entered UserExceptions that indicated into ExceptionList as the nodes were executed.

Currently, my debugging activity is done with Trace nodes and examining the NT eventlog.

Ideally, there would be a compound node (or plugin) that could be placed on the MQINPUT catch node. It would append an XML Diagnostic tag to the message that would identify the message flow, the node it was in when it died, and the contents of the ExceptionList; output the original message to some retry queue (weâ€™ve got a program to copy the message, minus the XML Diagnostic tag, back to the original queue after the error has been resolved); and send an Email with the above to a support group.

Iâ€™m interested hearing how youâ€™re coding your message flows so that when failures occur you can resolve them, without making the message flows unnecessarily complicated.

Sandman

Posted: Wed Oct 17, 2001 10:37 am Post subject:

Centurion

Joined: 16 Oct 2001
Posts: 134
Location: Lincoln, RI

Bob,

We're doing exactly what you suggested. I've developed a subflow that we attach to the MQInput node's Catch terminal.

It interrogates the ExceptionList and builds some XML diagnostic info that we then write out to a queue. What it also does it insert the input message's MessageId as the CorrelId of this "reason" message. At the end of the subflow is a Throw node, which causes a rollback and ultimately backout processing. This way, we can correlate the backed out input message to the reason message.

Miriam Kaestner

Posted: Fri Oct 19, 2001 5:50 am Post subject:

Centurion

Joined: 26 Jun 2001
Posts: 103
Location: IBM IT Education Services, Germany

I agree with both of you. Another good location to put error information is the usr folder of MQRFH2.
So I built a GenericErrorHandler subflow which is attached to the MQInput node Catch terminal, that
1. puts information from ExceptionList to MQRFH2.usr.MyError
2. ResetContentDescriptor to BLOB
3. Output original message with added RFH2 to errorQ (with NO transaction mode!)
4. Throws an error (use FlowOrder node to achieve that this step happens AFTER 1-3)
to roll back any database updates

The failure terminal of MQInput node must be connected to a dummy node (for example, a Trace node with nothing in it), so that the rolled back message is thrown away.

Then I have the original message on the errorQ together with all error information.

mapa

Posted: Fri Oct 26, 2001 2:52 am Post subject:

Master

Joined: 09 Aug 2001
Posts: 257
Location: MalmÃ¶, Sweden

Hi,

In my Errorhandling subflow I am using the supportpack ia07 - sendmail.

In a compute node I extract the information from the Exceptionlist and generate a XML message that is passed to the sendmail node.
(Reset the domain to BLOB, to make sure you don't get a parsing error of your message again if that was the cause of the exception)

I am then sending an email via a SMTP server.

I also put the failed message on a failure queue.

Best regards Magnus

[ This Message was edited by: mapa on 2001-10-26 03:53 ]

Bob

Posted: Mon Oct 29, 2001 8:43 am Post subject:

Newbie

Joined: 03 Oct 2001
Posts: 4

My thanks go to Sandman, Miriam Kaestner, and Magnus for responding.

Iâ€™m providing the method weâ€™re using for error handling and am hoping that folks will critique it.

Some of the goals of the error handling process are to:
1) Minimize the dependency on the system log (the NT Eventlog in our case) for problem resolution.
2) Capture as much of the message state as possible at the time of failure.
3) Rollback all effects of the message processing after a failure.
4) Prevent endless looping of the message flow.
5) Simplify the reinstatement of the message after the problem has been cleared.
6) Keep the message flow design as simple as possible.
7) Be notified, if an message flow exception occurs.

Warning: At times I will be stating that MQSI works a certain way. What I really mean is that this is how I think MQSI works based on IBM documentation Iâ€™ve read and what Iâ€™ve had time to test. I write this because I want to avoid prefixing everything with â€œI thinkâ€¦â€. Iâ€™ll admit ignorance on how MQSI commits or rolls back database updates when there is a MQSI component failure or server failure. From what Iâ€™m able to make out from the documentation â€œglobal transaction coordinationâ€, is only available with DB2 databases. If a message flow with a MQInput Transaction Mode of â€œYesâ€ has database inserts (all with a Transaction Mode of â€œAutomaticâ€) pending against Oracle, MSSQL, Sybase, and DB2/os390, and MQSI is in the process of telling the dbmsâ€™s to commit, I guessing that if the message flow abends or the MQSI server crashes, itâ€™s possible that the Oracle and MSSQL inserts could commit, the Sybase and DB2/os390 inserts to rollback, and the database inserts will be reapplied to all the dbmsâ€™s (thatâ€™s twice for Oracle and MSSQL) when the message is reprocessed. However, thatâ€™s worrying about database updates when the message flow doesnâ€™t throw exceptions. If the same message flow takes an exception and no failure nor catch terminals are wired, Iâ€™m assuming that all dbms inserts will be rolled back, even if the MQSI server crashes while the dbms rollbacks are pending.

MQSI has error handling and rollback capabilities. If an exception occurs, control is passed to a failure terminal (if something is wired to it) or a catch terminal (again, if something is wired) of the current node. If control canâ€™t be passed, control goes toward the MQInput node, looking for a wired catch terminal.

The challenge is that if you wire a failure terminal or catch terminal, you are responsible for rolling back everything the message flow did to this point (e.g., it took the message off the input queue, it made database updates, it wrote messages to other queues), UNLESS you cause another exception to be thrown by the logic youâ€™ve wired to the failure or catch terminal. If another exception is thrown, the transaction mode of the nodes in the message flow determine if changes those nodes made are committed or rolled back.

Another behavior to be aware of, is that when the exception handling starts going towards the MQInput node in search of a catch terminal, changes to the message are undone. Iâ€™m aware of only three nodes (Compute, Extract, and ResetContentDescriptor) that alter message content. Assume a message flow has three Compute nodes in a series, where each node adds information from a database to the message and that this added information is used by the next Compute node. Also assume, an exception occurs in the third compute node. If the failure terminal of the third compute node isnâ€™t wired, the exception handler searches towards the MQInput node looking for a catch (of a TryCatch or MQInput node). When a wired catch terminal is found, the message contents are reinstated to the way the message looked when it first passed through that MQInput or TryCatch. In the hypothetical case weâ€™ve discussed, the message wonâ€™t have any of the message changes made by the three Compute nodes when the MQInput catch is given control. Please note that any database updates made by the three Compute nodes are still ready to be committed (the Compute node transaction mode is set to â€œAutomaticâ€ and canâ€™t be changed; the transaction mode of Database nodes can be either â€œAutomaticâ€ or â€œCommitâ€).

If the nodes wired to the Catch terminal throw an exception, the exception handler continues going towards the MQInput node looking for more wired Catch terminals. If all the remaining Catch terminals are either not wired or throw additional exceptions, the MQInput node will get control. The MQInput node is complex from an error handling perspective, because the way it processes is dependant on many things (MQInputâ€™s Transaction Mode, the input queueâ€™s BackOut Threshold, whether the input queueâ€™s BackOut Queue has been defined, whether MQSI takes an exception writing to BackOut Queue or Dead Letter Queue, whether the MQInput Failure or Catch terminal is wired, and whether was an exception returned from the MQInput Out, Failure, or Catch terminals).

What weâ€™re doing at this point, is:
1) Defining two queues instead of just the one we need. The second queue name equals the first queue name || â€˜_ERRâ€™. For instance, if we define QUEUE1, we would also define QUEUE1_ERR. The BackOut Threshold of QUEUE1 is set to 1 and the BackOut Queue of QUEUE1 is set to QUEUE1_ERR. QUEUE1_ERR will be set to trigger an e-mail/pager notification program (still under construction) when the trigger depth of one is reached.
2) The MQInput Node processes QUEUE1. The MQInput Transaction Mode is set to â€œYesâ€. The MQInput Failure terminal is not wired. The MQInput Catch terminal is wired to an ERROR_HANDLER_SUBFLOW (a compound node to be described later).
3) The ErrorQueueName and the ErrorQueueManagerName, which are promoted properties of the ERROR_HANDLER_SUBFLOW, need to be set. Normally, we leave the ErrorQueueManagerName blank (it defaults to the MQSI Queuemanager) and specify the _ERR queue (QUEUE1_ERR) for the ErrorQueueName.

The above is the minimum amount of error handling. If you want â€œextra creditâ€, you can also:
1) Use TryCatch nodes. Wire the TryCatchâ€™s In terminal to the Out terminal of each Compute, Extract, and ResetContentDescriptor nodes. Wire the TryCatchâ€™s Try terminal to the next processing node in the flow. Wire the TryCatchâ€™s Catch terminal to the same ERROR_HANDLER_SUBFLOW to which the MQInput Catch terminal is wired. As mentioned above, these nodes can change the message and if youâ€™d like to see the message contents after coming through the Out terminal, this is a way to do it. In the earlier example of three compute nodes in a series, placing TryCatch nodes in the manner just described would allow us to observe the contents of added by the Compute1 and Compute2 nodes, when Compute3 throws an exception.
2) Wire the Failure terminals of the Extract node (itâ€™s the only node that passes modified message contents to its Failure terminal; the others pass the original message contents to the Failure terminal). Wire the Extract Failure terminal to the same ERROR_HANDLER_SUBFLOW to which the MQInput Catch terminal is wired.
3) If for some reason you find it advantageous to wire the Failure terminals so that you can do some extra processing that you didnâ€™t want to have in the ERROR_HANDLER_SUBFLOW, you could wire a FlowOrder node to the Failure terminal in question. Wire the FlowOrderâ€™s First terminal to your extra logic (maybe send an e-mail that weâ€™ve got problems). Wire the FlowOrderâ€™s Second terminal to the same ERROR_HANDLER_SUBFLOW to which the MQInput Catch terminal is wired. Donâ€™t wire the FlowOrderâ€™s Failure terminal (you could, but it would have to be treated like a Failure terminal again with a FlowOrder wired to it).

Youâ€™re probably asking, â€œSo Iâ€™ve done all this, what happens when an exception is thrown processing QUEUE1?â€ The answer is:
1) The database updates with a Transaction Mode of â€œAutomaticâ€ are rolled back.
2) The original message is written to QUEUE1_ERR, if possible. It might not be possible if the Max Queue Depth is reached, the queue is input inhibited, or you misspelled QUEUE1_ERR when defining it as the BackOut Queue for QUEUE1.
3) If the original message canâ€™t be written to QUEUE1_ERR, it writes it to the MQ System Dead Letter queue, if possible.
4) If the original message canâ€™t be written to QUEUE1_ERR and canâ€™t be written to the MQ System Dead Letter queue, the message flow loops until the message flow doesnâ€™t throw any exceptions, or the QUEUE1_ERR or MQ System Dead Letter queue become available.
5) If the original message is written to QUEUE1_ERR or MQ System Dead Letter queue, and if the queue count is one, the notification program should send an e-mail and/or page. A diagnostic message in XML format is written to the queue specified in the ERROR_HANDLER_SUBFLOWâ€™s ErrorQueueName for each time the ERROR_HANDLER_SUBFLOW is entered, if possible. This diagnostic message contains all the message data that a Trace node could provide (i.e., the message headers, the message body, the ExceptionList, and the DestinationList).

So, youâ€™ve gotten paged. Now what? We run a program that retrieves and formats the diagnostic messages written by ERROR_HANDLER_SUBFLOW. We determine the problem. Letâ€™s say that the database update notes failed because the database manager crashed. We get the database manager working again. Then we run another program that copies original messages on the _ERR queue to the original queue and deletes any diagnostic messages found on the _ERR queue.

Letâ€™s step back and see how weâ€™re stacking up against our goals:
Some of the goals of the error handling process are to:
1) Minimize the dependency on the system log (the NT Eventlog in our case) for problem resolution.
We should only have to go to the system log if there is a system problem.
2) Capture as much of the message state as possible at the time of failure.
As long as the diagnostic message can be written, weâ€™ll have all that MQSI provides.
3) Rollback all effects of the message processing after a failure.
Weâ€™re always throwing an exception in an error situation to force MQSI to do the work for us.
4) Prevent endless looping of the message flow.
Unless the _ERR queue and Dead Letter queue are unavailable, weâ€™ll only process a message once.
5) Simplify the reinstatement of the message after the problem has been cleared.
Just run a program with the names of the _ERR and original queue to reinstate the messages on the original queue.
6) Keep the message flow design as simple as possible.
Normally, only need to wire the MQInput Catch terminal and maybe add a TryCatch after Compute nodes.

Nowâ€™s a good time to understand the details of the ERROR_HANDLER_SUBFLOW. It consists of five nodes:
1) An InputTerminal node (renamed In).
2) A FlowOrder node (named FlowOrder1).
3) A Compute node (renamed Build Error Message) that builds a diagnostic message.
4) A MQOutput (renamed MQOutput to Error Queue) that writes the diagnostic message to the queue defined by the ErrorQueueManagerName and ErrorQueueName.
5) A Throw node (renamed ERROR_HANDLER_SUBFLOW Throw).

The wiring:
1) InputTerminal Out -> FlowOrder In.
2) FlowOrder Failure not wired.
3) FlowOrder First -> Compute In.
4) FlowOrder Second -> Throw In.
5) Compute Failure not wired.
6) Compute Out -> MQOutput In.
7) MQOutput Failure not wired.

The Build Error Message:
1) Properties: all are at default settings.
2) Logic:
DECLARE C INTEGER;
SET C = CARDINALITY(InputRoot.*[]);
DECLARE I INTEGER;
SET I = 1;
WHILE I < C DO
SET OutputRoot.*[I] = InputRoot.*[I];
SET I=I+1;
END WHILE;
-- Enter SQL below this line. SQL above this line might be regenerated, causing any modifications to be lost.
-- *****************************************************************
-- Make sure the output message is XML regardless of how it came in.
-- *****************************************************************
SET "OutputRoot"."Properties"."MessageFormat" = 'XML';
SET "OutputRoot"."XML".(XML.XmlDecl).(XML.Version)='1.0';
-- *****************************************************************
-- Capture everything available about the input message.
-- *****************************************************************
DECLARE MyCardinality INTEGER;
SET MyCardinality = CARDINALITY("InputRoot".*[]);
DECLARE MyIndex INTEGER;
SET MyIndex = 1;
WHILE MyIndex < MyCardinality DO
SET "OutputRoot"."XML"."MQI_ERRORS".*[MyIndex] = "InputRoot".*[MyIndex];
SET MyIndex = MyIndex + 1;
END WHILE;
SET "OutputRoot"."XML"."MQI_ERRORS"."HexBody" = BITSTREAM("InputBody");
SET "OutputRoot"."XML"."MQI_ERRORS"."ExceptionList" = "InputExceptionList";
SET "OutputRoot"."XML"."MQI_ERRORS"."DestinationList" = "InputDestinationList";

MQOutput to Error Queue Properties that arenâ€™t at their default settings:
1) Transaction Mode = â€œNoâ€. We want the diagnostic message written even though weâ€™re throwing an exception.
2) Persistence Mode = â€œAs Defined for Queueâ€.

ERROR_HANDLER_SUBFLOW Throw Properties that arenâ€™t at their default settings:
Message Text = â€œMQI.ERROR_HANDLER_SUBFLOW generated Throw.â€

An example Diagnostic message follows:
<?xml version="1.0"?>
<MQI_ERRORS>
<Properties>
<MessageSet/>
<MessageType/>
<MessageFormat/>
<Encoding>785</Encoding>
<CodedCharSetId>500</CodedCharSetId>
<Transactional>FALSE</Transactional>
<Persistence>TRUE</Persistence>
<CreationTime>2001-10-24 07:54:14.570</CreationTime>
<ExpirationTime>-1</ExpirationTime>
<Priority>0</Priority>
<Topic/>
</Properties>
<MQMD>
<SourceQueue>EAGLE.GET_GAS_SERVICE_INFO</SourceQueue>
<Transactional>FALSE</Transactional>
<Encoding>785</Encoding>
<CodedCharSetId>500</CodedCharSetId>
<Format/>
<Version>2</Version>
<Report>0</Report>
<MsgType>1</MsgType>
<Expiry>-1</Expiry>
<Feedback>0</Feedback>
<Priority>0</Priority>
<Persistence>1</Persistence>
<MsgId>c3e2d840d4d8e2c44040404040404040b6a1d4eed1232541</MsgId>
<CorrelId>000000000000000000000000000000000000000000000000</CorrelId>
<BackoutCount>0</BackoutCount>
<ReplyToQ>EAGLE.RDGMQ4_B6A1D4EECF47B441 </ReplyToQ>
<ReplyToQMgr>MQSD </ReplyToQMgr>
<UserIdentifier>MQBOB1 </UserIdentifier>
<AccountingToken>0bc3f1f1f2f060c9c6d4f0f10000000000000000000000000000000000000000</AccountingToken>
<ApplIdentityData/>
<PutApplType>2</PutApplType>
<PutApplName>MQBOB1MQ </PutApplName>
<PutDate>2001-10-24</PutDate>
<PutTime>07:54:14.570</PutTime>
<ApplOriginData/>
<GroupId>000000000000000000000000000000000000000000000000</GroupId>
<MsgSeqNumber>1</MsgSeqNumber>
<Offset>0</Offset>
<MsgFlags>0</MsgFlags>
<OriginalLength>121</OriginalLength>
</MQMD>
<HexBody>4c6fa7949340a58599a28996957e7ff14bf07f6f6e4ce3c5e2e36dd9c5c36e4cc9e3c5d46e4cd5c1d4c56ef1f0404040404040404040404040404040404040404040404040404040404040404040404040404c61d5c1d4c56e4cc1c7c56ef0f0f14c61c1c7c56e4c61c9e3c5d46e4c61e3c5e2e36dd9c5c36e</HexBody>
<ExceptionList>
<RecoverableException>
<File>F:/build/S000_P/src/DataFlowEngine/ImbDataFlowNode.cpp</File>
<Line>538</Line>
<Function>ImbDataFlowNode::createExceptionList</Function>
<Type>CombmMQInputNode</Type>
<Name>7bd9895b-e800-0000-0080-a0872cccdfe4</Name>
<Label>EAGLE.GET_GAS_SERVICE_INFO_1.EAGLE.GET_GAS_SERVICE_INFO</Label>
<Text>Node throwing exception</Text>
<Catalog>MQSIv201</Catalog>
<Severity>3</Severity>
<Number>2230</Number>
<UserException>
<File>F:/build/S000_P/src/DataFlowEngine/BasicNodes/ImbThrowNode.cpp</File>
<Line>229</Line>
<Function>ImbThrowNode::evaluate</Function>
<Type>ComIbmThrowNode</Type>
<Name>957f8eba-e900-0000-0080-828ea8d80971</Name>
<Label>EAGLE.GET_GAS_SERVICE_INFO_1.Throw1</Label>
<Text>User exception thrown by throw node</Text>
<Catalog>MQSIv201</Catalog>
<Severity>1</Severity>
<Number>3001</Number>
<Insert>
<Type>5</Type>
<Text/>
</Insert>
</UserException>
</RecoverableException>
</ExceptionList>
<DestinationList/>
</MQI_ERRORS>

Now that youâ€™ve read what weâ€™re doing. Iâ€™d like to give the rational for some of the design decisions:
1. Why would you want to create an _ERR queue for each queue?
1.1. For our shop, defining an _ERR queue simplifies reinstating the message to the original queue. We have a program that we can give the _ERR and original queue names and it moves the messages back to the original queue. And having a separate _ERR queue gives me more confidence that the message wonâ€™t be lost and less likely to be moved back to the wrong queue.
1.2. If you have a central queue to collect the messages, you need to make sure its queue attributes of are compatible with all the queues that may have their messages placed here. The program that reinstates the message back to the original queue is more complex.
2. Why did you put your diagnostic messages on the _ERR queue?
2.1. The _ERR queue was there; we didnâ€™t have to define anything else.
2.2. The program that moves the messages back to _ERR queue is smart enough to bypass the diagnostic messages.
2.3. By having them together, you may not have to look at the diagnosticâ€™s message BITSTREAM version of the message body, because you can see the original message, which is also on the _ERR queue.
2.4. Since diagnostic messages may not always be able to be generated, you can look in one place to see if they were and you know that they pertain to the original message.
3. Why did you use BITSTREAM to process the InputBody?
3.1. Well, I didnâ€™t really want to use BITSTREAM to populate â€œHexBodyâ€. Itâ€™s just that I needed something that would work with all message formats and I was unsuccessful getting MQSI to do a tree copy of InputBody into a CDataSection. If someone has a better idea, Iâ€™d like to hear it.
4. Why didnâ€™t you put a SendMailPlugin as part of the ERROR_HANDLER_SUBFLOW?
4.1. Weâ€™re writing a notification program that is triggered by each _ERR queue. Weâ€™re concerned that if a high volume message flow is broken, that we donâ€™t flood our e-mail system or our pagers. By doing a trigger on a queue depth of one, Iâ€™ll only get one e-mail or page instead of potentially thousands.

I welcome feedback on this topic.

Cliff

Posted: Tue Oct 30, 2001 2:20 am Post subject:

Centurion

Joined: 27 Jun 2001
Posts: 145
Location: Wiltshire

Wow! What a big posting.
I agree with pretty much all of it, but here are a few comments/differences of emphasis based on my understanding of how things work:

1 When processing an exception, changes to XA-compliant databases (configured so MQSeries is the resource manager) will be rolled back when the UOW is backed out when control returns to the MQInput node. This used to be DB2 only, other databases may now comply. Changes to non-XA-compliant databases will not be managed within a global UOW.

2 I assume you set the backout requeue queue name to be the _ERR queue? Else how do you get the original message onto the _ERR queue too?

3 You could write the diagnostic information to a file using a Trace node, if you didn't want it on the _ERR queue. This could be a way of keeping your diagnostic data centrally.

4 You could have a single _ERR queue, as the diagnostic data has the MSGID which you could use to identify your message for requeueing. Personally, I don't like the idea of duplicating all my queues! This is also less prone to typo errors.

5 Because I trust MQSeries, the underlying technology, I have complete confidence that messages won't be lost, so I'm happy to keep the belt but abandon the braces.

This is a crucial topic that affects us all, and I'm sure that while we can all agree on the principles we will have different preferences for implementation. I also look forward to hearing what others have to say!

Cliff

Display posts from previous:

Page 1 of 1

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » What is Your Error Handling Design for Message Flows

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP