Monday, October 19, 2009

Running out of disk space

An event occurred this morning that reminds me that performance problems are sometimes environmentally driven. One application went down with various errors related to the file system. After some investigation it was discovered that the /tmp space ran out of space. Yet no one knew it had happened. Turns out that no monitors are in place to watch for resource exhaustion like low disk space.

One could argue why did /tmp fill out if daily (automated) housekeeping tasks were in place (they were not). But that is a pointless argument if someone moved a whole bunch of files in /tmp. They could have moved files in there to install them and then delete when the install is done. Or they could have been log files being transferred to another environment via ftp. In any case, an automated monitoring tool should have alerted the operators that disk space has come close to exhaustion and need to take action to remedy the situation.

If automated monitoring is not in place then the production applications that make use of /tmp space will fail causing an outage. And without alerts about low disk space it can take several hours for the operations and troubleshooting teams to figure out that the application failed because it could not write out a temporary file. A several hour outage that could have been completely avoided had the right resource monitors been in place.

Other resource monitors should be looking at CPU utilization, page file activity, etc.