Thursday, September 11, 2008

Code for resiliency - circuit breakers in loops

Scenario
One common problem I run into on different critsits are run away applications. You know the type. The application is sitting there in a tight loop allocating 1 or 2Megs of data on each iteration chewing up CPU and causing the garbage collector fits. Sometimes the code is buggy. Other times the data has changed and has pointers to larger data sets than it should. A few times a new use case was discovered.

Runaway loops in an online, web application are a bad thing. If enough users go down the same use case the application can have several threads all running away. The CPU will be pegged at or close to 100% and the verbose GC data shows tons of memory being allocated at very fast rates because the garbage collector is busy collecting it!

As more users go down the same use case more and more threads get hung up on different application servers in the farm. Pretty soon, you're recycling application servers as it is the only administrative solution for this problem. The problem with recycling is you are also dumping users who have good transactions in progress. Because the code has run away there are no other options. Hopefully those good transactions will try their transaction again when it fails and not go shopping somewhere else.

Solution
Pick some arbitrary count limit as a circuit breaker. Once the count limit is reached the loop exits. In addition, when the circuit breaker is activated the application log that this occurred and dump some of the relevant data around that point. This way when a problem occurs someone can be alerted that the circuit breaker was activated and examine the problem.

I would implement the loop exit by throwing an exception (this is an exceptional case right?) from an if statement at the beginning of the loop. Let the exception handler dump the data to the log. Make sure something is monitoring the log for the exception and that someone is alerted to examine the problem.

Resiliency in code is not difficult. But you do have to anticipate the worst will happen. Every loop has to have a way to abort if it has executed an unreasonable amount of times. Otherwise production can be very unstable.

Unreasonable Arguments Opposed
While I tend to think I'm fairly even handed a number of people have told me this is a stupid idea. Well, let me tell you why it isn't. If the circuit breaker is hit then you know you are already in an invalid state. Regardless of the reason continuing to process the request is pointless because something is not right. If we only have 800 items in a product catalog then looping a 1000 times means something is wrong. Therefore there has to be a way to abort the request before we degrade production. As for that user's request there is no way they will be able to complete it. However, since the circuit breaker was activated and data was logged and someone was alerted not only do we know that something happened but someone is not actively working the problem to figure out why the breaker was activated.

What about transaction integrity people ask? Well, if you handle exceptions properly (and that'll be another CfR article) then you won't have any problems with transaction boundaries and everything will be rolled back appropriately.

What about open locks? Again, if you have coded your exception handling properly then all locks should be released.

What about open JDBC connections? Again, if you have coded your exception handling properly then all JDBC connections should be released in the finally block, no?

No comments: