Monday, August 25, 2008

Operational stability

One aspect of performance problems occurs with runtime operations and the stability from a runtime perspective separate from application code or content changes. Some of the things to consider when building out a 24/7 site and the ability to maintain availability even when things take a turn for the worst. I assume you already have built out multiple cells for the deployment of your high availability environment. I prefer cells over clusters because you have more operational/runtime flexibility with cells than you do with clusters particularly when you are trying to apply maintenance (i.e. shut of the load balancing to the cell that is going to be updated).

1. Ability to make repeatable changes to the configuration. This involves scripting and testing the scripts (preferably not in production) to be sure they work as intended.

2. Identify what is the active configuration. This is important to understand which configuration is active such that the correct cells are taken out of rotation.

3. Make the scripts aware of the active configuration. One really doesn't want to have scripts making changes to the active configuration by mistake.

4. Back out. Depends on your requirements but being able to flip back to a previously working configuration as quickly as possible minimizing downtime.

No comments: