It is rare for anyone to provide details behind the root cause of a production outage. Facebook put out a report about an outage they had. If you're into troubleshooting and problem determination it is an interesting read. It sounds like they could turn off the particular function but had to completely restart the environment to do it. This is why it is important to have circuit breakers that can be activated dynamically.
One also wonders what infrastructural changes could be made in the environment to help? It sounds like the application logic continued to retry requests. This is why I'm not a fan of applications automatically retrying requests because when failure occurs the retries can quickly overwhelm the back-ends. A firewall could have at least help shut off the pipe to the database. Though the consequences to the application would have been no different and would still have required a restart since there seemed to be no way to dynamically shut off that particular function.
Certainly the error logic sounded confusing at best. And error paths through code are the ones least frequently tested so they tend to fail magnificently in production
Tuesday, September 14, 2010
I'm reminded today that after performance tests a simple check exists especially when adding more JVMs to a cluster. Count the number of exceptions in the log files. Of course, clear the logs before running the test. If the counts are not all roughly the same (or significantly skewed from the other app servers) then it is clear there are issues with that JVM that need to be checked. Sometimes it is configuration or a misplaced JAR file.