Tuesday, July 22, 2008

Reliability and Availability

Common topics in the performance space are reliability and availability. This article goes on to describe some of the problems that can occur in such an environment. This is the challenge of building reliable systems from unreliable components. From a hardware perspective this can be done if one has enough money for all the redundancy that is necessary. If one tries to do this on the cheap they will fail.

There is an interesting point in the article that Google is trying to solve this problem with software. While there are products like WebSphere XD that provide software level solutions for some problems they can't solve the problem as easily as hardware can. For example, the database is slow. Giving the database faster disk, more RAM or a RAM backed SAN and you can eliminate that problem. There is relatively little you can do from a software perspective to fix that. Another example is a server goes down. Sure, we could route data to another server using a software component but then why not just do it from the hardware level? Okay, so there are a couple of places where software is useful like maintaining J2EE affinity but the same can be done by a hardware load balancer. It just depends on where you put the smarts.

The problem with software to try and fix this is that it introduces another, more complex, layer of hardware/software where as redundant hardware makes things a lot simpler.

A few people will argue that software is cheaper. I don't agree with that argument. I think hardware is cheaper. Especially when it makes troubleshooting that much easier (and quicker) than home built software. If it takes 6-12 months to debug software in production then that is a lot of money (and bad press earned) down the drain.

Amazon S3: For now at least, sometimes you have to reboot the cloud | News - Business Tech - CNET News
Afterward, Om Malik called cloud computing frail: "The S3 outage points to a bigger (and a larger) issue: the cloud has many points of failure--routers crashing, cable getting accidentally cut, load balancers getting misconfigured, or simply bad code. And he's right, to a degree, but there are three things that shouldn't be overlooked before writing cloud computing off as a failure.

No comments: