Wednesday, August 27, 2008

Disaster Recovery is not easy (just ask the FAA)

I commented on the flight plan problem yesterday and it is interesting to see that the FAA actually had a Disaster Recovery (DR) plan and running DR site in place. But it didn't work.

U.S. Airports Back to Normal After Computer Glitch - NYTimes.com
The other flight-plan facility in Salt Lake City had to handle the entire country when the Atlanta system failed but the backup system quickly overloaded, Brown said.
The interesting thing to point out here is that while the FAA had a DR plan in place it seems no one looked at the capacity of the backup site. Was this bad planning? Bad execution? I wonder, did they ever test their DR environment? It seems kind of pointless to have spent the time and effort to build out the DR environment to then have it fail when it was needed. A waste of tax payers dollars IMHO.

Unfortunately a DR site failing to accomplish what it was intended to do is more often the case than not. First off, building a DR environment is not easy. I know some pretty smart people that have gotten this wrong. Secondly, after the DR environment is built it must be tested and tested as if a real disaster has occurred. The best corollary I can think of is testing your backups. Does anyone ever restore a server to see if the backups are good? I once was working with an organization that had their servers crash due to a hardware failure. They went to recover the servers and only then discovered the backups were corrupt. It took them a few days to rebuild the servers so I got to go home.

On a tangent, seeing that this is a blog on WebSphere Application Server performance, the one mistake I have seen people make with DR from a WebSphere Application Server perspective is to have a cell cross data center boundaries (I'm sure I've blogged about this before but here it goes again). The reason this is a mistake is that networks are not reliable. TCP/IP is not guaranteed, reliable delivery. Any hiccup which can include dropped packets, static electricity or planetary alignment in the solar system that causes even a slight degradation or lag in the networks between the two data centers can reek complete havoc within the cell. And guess how hard it is to troubleshoot that problem? Yeah, tough, real tough. Thus, even when a disaster is not occurring strange "problems" can occur with the applications running in that cell that just can not be easily explained. And the more distance you put between the data centers the more likely that strange problems will occur.

Likewise, when a disaster occurs and half the cell disappears this alone can cause other problems. For one, the application servers left running in the other data center will be looking for their lost siblings and spending time, CPU cycles, RAM and network bandwidth searching. This too affects the applications running in this configuration.

Therefore, moral of the story is to never have the cell boundaries leave the data center. In fact, there are a number of reasons one should be running multiple cells within the same data center to better manage planned and unplanned maintenance. Particularly in high volume, business critical, high availability environments.

Oh, that, and hire a real good DR expert if you're planning on implementing DR. Nothing like having the DR plan fail. In the FAA's case there are no repercussions (i.e. fines imposed that cost them millions of dollars a minute). Granted, this probably did cost the airlines and the unfortunate passengers money but nothing the FAA will have to reimburse. For a lot of private enterprises there could be severe repercussions not just in terms of penalties/fines but customer loyalty, customer trust and how your enterprise is viewed as a reliable business partner going forward.

No comments: