It is rare for anyone to provide details behind the root cause of a production outage. Facebook put out a report about an outage they had. If you're into troubleshooting and problem determination it is an interesting read. It sounds like they could turn off the particular function but had to completely restart the environment to do it. This is why it is important to have circuit breakers that can be activated dynamically.
One also wonders what infrastructural changes could be made in the environment to help? It sounds like the application logic continued to retry requests. This is why I'm not a fan of applications automatically retrying requests because when failure occurs the retries can quickly overwhelm the back-ends. A firewall could have at least help shut off the pipe to the database. Though the consequences to the application would have been no different and would still have required a restart since there seemed to be no way to dynamically shut off that particular function.
Certainly the error logic sounded confusing at best. And error paths through code are the ones least frequently tested so they tend to fail magnificently in production