Thursday, September 25, 2008

Redundancy and unreliable software

Redundancy is often a great thing to have and has save many a production environments. However, redundancy works if the underlying hardware and software work. Making an unreliable environment redundant simply means that the redundant system is unreliable too. Failing over from one failing environment to another that will fail is not going to help availability.

Therefore, the best thing to do is to make sure that the system being copied is reliable in and of itself.

Thursday, September 18, 2008

Deploying with Resiliency - shared libraries testing

I've been talking about Coding for Resiliency but figured I should do a little focusing on Deploying with Resiliency today.

Anytime one pushes out a new release candidate it should be tested in the same configuration as it will be in the production environment.  This means that any shared libraries must be deployed out with the applications that are using them when multiple applications share the same app server.  

The best that can happen is that the production environment will remain stable as long as what was tested is what is in production.  However, the worse that can happen (and it will) is the site will become unstable.  That is not the end of it though.  Whatever deployment was put in place has to be backed out (here is where multiple cells really helps you out and if you don't know what I'm talking about go to this great article by my colleague Peter) and the previous deployment has to be re-established in production.  The decision to revert also has to be made.  This is, depending on the people running production, a difficult decision to make.  Even more difficult if a back out cell is not available. 

Therefore, depending on when it is discovered that the production site is unstable and the decision to back out is finalized it can be several hours before production can be stabilized.  

This is one reason why it is important to test prior to deployment in production and that "testing in production" is never a good idea.  

Friday, September 12, 2008

Code for resiliency - validate input

Scenario 1
I was speaking with one of my colleagues about problems they were seeing with an application. It seems that if the user typed alphabetic characters where a number was expected the application choked, died and rolled over.

Solution
I was amazed. "Don't they validate their input?" I asked. Depending on your view point surprisingly they didn't.

Always, always validate user input. If expecting a name make sure it doesn't contain characters you don't normally see in names (i.e. Joe123). If expecting a zipcode make sure it is formatted properly. But never just take the input from the user and pass it on to the next layer.

In addition, validating input will help prevent XSS attacks.

Scenario 2
You're running a B2x Web service and have complex XML documents. But modern day parsers are vulnerable to malformed XML that can contain circular references.

Solution
Acquire DataPower and have it front your Web services. DataPower has incredibly good XML validation and is able to prevent DoS attacks from malformed XML.

Thursday, September 11, 2008

This one keeps on going and going

I blogged about this problem a year and a half ago.  It continues to live on.  If you are writing non-batch applications then set this setting to unshareable.  Yes, it is unfortunate the default is shareable but we need to deal with it.  This descriptor seems to move around so look for it either in the JDBC or thread section of the descriptors. 

Alexandre Polozoff on WebSphere Performance: Day One SHAREABLE JDBC connection setting
change the descriptor to UNSHAREABLE

Code for resiliency - circuit breakers in loops

Scenario
One common problem I run into on different critsits are run away applications. You know the type. The application is sitting there in a tight loop allocating 1 or 2Megs of data on each iteration chewing up CPU and causing the garbage collector fits. Sometimes the code is buggy. Other times the data has changed and has pointers to larger data sets than it should. A few times a new use case was discovered.

Runaway loops in an online, web application are a bad thing. If enough users go down the same use case the application can have several threads all running away. The CPU will be pegged at or close to 100% and the verbose GC data shows tons of memory being allocated at very fast rates because the garbage collector is busy collecting it!

As more users go down the same use case more and more threads get hung up on different application servers in the farm. Pretty soon, you're recycling application servers as it is the only administrative solution for this problem. The problem with recycling is you are also dumping users who have good transactions in progress. Because the code has run away there are no other options. Hopefully those good transactions will try their transaction again when it fails and not go shopping somewhere else.

Solution
Pick some arbitrary count limit as a circuit breaker. Once the count limit is reached the loop exits. In addition, when the circuit breaker is activated the application log that this occurred and dump some of the relevant data around that point. This way when a problem occurs someone can be alerted that the circuit breaker was activated and examine the problem.

I would implement the loop exit by throwing an exception (this is an exceptional case right?) from an if statement at the beginning of the loop. Let the exception handler dump the data to the log. Make sure something is monitoring the log for the exception and that someone is alerted to examine the problem.

Resiliency in code is not difficult. But you do have to anticipate the worst will happen. Every loop has to have a way to abort if it has executed an unreasonable amount of times. Otherwise production can be very unstable.

Unreasonable Arguments Opposed
While I tend to think I'm fairly even handed a number of people have told me this is a stupid idea. Well, let me tell you why it isn't. If the circuit breaker is hit then you know you are already in an invalid state. Regardless of the reason continuing to process the request is pointless because something is not right. If we only have 800 items in a product catalog then looping a 1000 times means something is wrong. Therefore there has to be a way to abort the request before we degrade production. As for that user's request there is no way they will be able to complete it. However, since the circuit breaker was activated and data was logged and someone was alerted not only do we know that something happened but someone is not actively working the problem to figure out why the breaker was activated.

What about transaction integrity people ask? Well, if you handle exceptions properly (and that'll be another CfR article) then you won't have any problems with transaction boundaries and everything will be rolled back appropriately.

What about open locks? Again, if you have coded your exception handling properly then all locks should be released.

What about open JDBC connections? Again, if you have coded your exception handling properly then all JDBC connections should be released in the finally block, no?

Monday, September 8, 2008

5 9s is not easy, it can be done but you have to know what you're doing

Continuing the coverage of production outages that make the news it seems that the London Stock Exchange had a serious outage today. Most likely due to the high volume resulting from the US Government takeover of Freddie & Fannie.

We at IBM often put together high volume environments that have high availability requirements. In order to do this, and do it well, one has to make sure they have built in resiliency, enough capacity and then disaster recovery for business continuity. I've worked with a number of household name companies, world wide, on providing just such capabilities. It is disastrous when a e-commerce retailer is unable to sell product during the holiday shopping season. Things can get particularly bad for financial institutions when money is on the line.

I can't say what they did or didn't do but it certainly seems like people want answers. Reassurances are going to be hard to come by until they do a lot more ground work.

London Stock Exchange crippled by system outage | Reuters
The exchange would not say whether volume was the issue and declined to give details on what had caused the problem. But angry customers were demanding an explanation.

"We want answers as to how this happened in the first place and reassurances that it will not happened again," said Angus Rigby, chief executive of brokerage TD Waterhouse.

Friday, September 5, 2008

debug logging has no place in production

You know, I'm often confronted with this problem and I have yet to really understand why it exists. Lots of people are running applications in production and they have logging set to the debug level. From a performance perspective this is intolerably horrible! You're constantly hitting disk, garbage collecting spurious strings, serializing around the disk access that it just makes no sense to me how I keep getting javacores with stack traces in logger.debug (or worse, SystemOut.println!)... speaking of println, one should never, ever use println for logging. There is no way to control it like a logger. With JDK v1.4 (I believe, maybe it is Java 5) there is a logger in the JDK. Use it! And set your log level to WARN or ERROR but nothing more granular than that unless you're debugging a problem.

Thursday, September 4, 2008

tprof is your friend

I've collected data on various 100% CPU problems and always am amazed how useful tprof data is yet it is very unintrusive.

IBM - MustGather: 100% CPU usage on AIX
Collecting data for 100% CPU usage problems with IBM® WebSphere® Application Server on the AIX® operating system. Gathering this information before calling IBM support will help familiarize you with the troubleshooting process and save you time.