3 More AWS Outage Post-Mortems

More than one week after the now infamous Amazon Web Services outage, the company has now issued a statement explaining the incident. Also, Netflix has issued its much anticipated blog post detailing how it managed to stay up during the outage, and SimpleGeo explained its own strategy.

More details after the jump.

AWS’ Official Statement

According to AWS, the problem stems from an error that occurred during a network upgrade. The error caused a brief network outage. Elastic Block Storage (EBS) volumes went into a “re-mirroring storm” when network connectivity was restored. The sheer number of EBS volumes trying to re-mirror caused the EBS control plane to become overloaded, preventing customers from being able to create new EBS instances.

AWS downplays the impact of the EBS crash on other Availability Zones. AWS claims that only 2.5% of Relational Database Service customers with multi-zone accounts had trouble. It does, however, acknowledge that many customers rely on the EBS control plane to recover from failures: “Our EBS control plane is designed to allow users to access resources in multiple Availability Zones while still being tolerant to failures in individual zones. This event has taught us that we must make further investments to realize this design goal.”

AWS also says it will make it easier for its customers to take advantage of multiple zones. AWS also admitted to having trouble with its customer communication.

AWS is providing a 10 day credit equal to 100% usage to all customers with an attached EBS volume, relieving concerns that the company wouldn’t consider the EBS outage part of the EC2 SLA.

Netflix

Netflix engineers Adrian Cockroft, Cory Hicks and Greg Orzell posted a long lessons learned summary to the Netflix Tech Blog. The team says it will automate its zone fail-over and recovery process, host its services in multiple regions and further reduce its dependence on EBS.

SimpleGeo

SimpleGeo’s team posted an article titled How SimpleGeo Stayed Up During The AWS Downtime. The company relied mostly on multi-zone deployment. The interesting part is that SimpleGeo relied heavily on EBS, but managed to make it through the outage with very little downtime. AWS doesn’t offer a multi-zone option for EBS. SimpleGeo didn’t detail how it managed its multi-zone EBS deployments, but presumably the team has its own system for replicating data between the data stores since it wasn’t possible to create new EBS volumes during the outage.