AWS Outage Takeaways: To Err is Human, But Not a Best Practice

March 6th, 2017

There is an expression, “To err is human, but to really mess things up, you need a computer.” That could describe the recent #AWSoutage, which caused many websites to malfunction. To Amazon’s credit, they were quite specific in revealing the cause of the outage, specifically a typo in a command issued by an operator.

Amazon explains it this way: “At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

Stuff happens

There is another expression that applies. The clean version is, “Stuff happens.” But  unlike hard disk drive failures or power outages or earthquakes, this outage was man-made and avoidable. The best way to avoid self-inflicted problems like this is with visibility, control, and governance—best practices.  These are often checklists that, when followed, ensure operators “do no harm.”

Alas, a checklist is only as reliable as the person who is supposed to follow it. The Amazon operator was indeed using the Playbook, i.e. following Amazon’s checklist of “Best Practices”, but still made a typo.

Automating best practices

The best way to ensure employees follow Best Practices is to automate those checks. Businesses need to make Best Practices part of the system, programmatically. Amazon has promised to do that now, by making it impossible for an operator to enter such a destructive command, stating “We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.

Automation is how Best Practices become foolproof. Indeed Amazon has a few dozen automated Best Practices already available, for a fee, to their cloud customers. That’s a good start.

Simplified governance

Building best practice checks directly into a unified cloud management platform reduces the potential for human error even more.In fact, CloudCheckr integrates over 400 Best Practice checks to not only increase reliability, but to optimize cost, security, and utilization. For example, one automated security check ensures root (unrestricted) access is not used or abused, so an individual cannot easily bring down your cloud.

Try CloudCheckr free for 14 days to explore how over 400 best practice checks can automate and optimize your cloud.

Tags: , ,