Home 3 More AWS Outage Post-Mortems

3 More AWS Outage Post-Mortems

More than one week after the now infamous Amazon Web Services outage, the company has now issued a statement explaining the incident. Also, Netflix has issued its much anticipated blog post detailing how it managed to stay up during the outage, and SimpleGeo explained its own strategy.

More details after the jump.

AWS’ Official Statement

According to AWS, the problem stems from an error that occurred during a network upgrade. The error caused a brief network outage. Elastic Block Storage (EBS) volumes went into a “re-mirroring storm” when network connectivity was restored. The sheer number of EBS volumes trying to re-mirror caused the EBS control plane to become overloaded, preventing customers from being able to create new EBS instances.

AWS downplays the impact of the EBS crash on other Availability Zones. AWS claims that only 2.5% of Relational Database Service customers with multi-zone accounts had trouble. It does, however, acknowledge that many customers rely on the EBS control plane to recover from failures: “Our EBS control plane is designed to allow users to access resources in multiple Availability Zones while still being tolerant to failures in individual zones. This event has taught us that we must make further investments to realize this design goal.”

AWS also says it will make it easier for its customers to take advantage of multiple zones. AWS also admitted to having trouble with its customer communication.

AWS is providing a 10 day credit equal to 100% usage to all customers with an attached EBS volume, relieving concerns that the company wouldn’t consider the EBS outage part of the EC2 SLA.

Netflix

Netflix engineers Adrian Cockroft, Cory Hicks and Greg Orzell posted a long lessons learned summary to the Netflix Tech Blog. The team says it will automate its zone fail-over and recovery process, host its services in multiple regions and further reduce its dependence on EBS.

SimpleGeo

SimpleGeo’s team posted an article titled How SimpleGeo Stayed Up During The AWS Downtime. The company relied mostly on multi-zone deployment. The interesting part is that SimpleGeo relied heavily on EBS, but managed to make it through the outage with very little downtime. AWS doesn’t offer a multi-zone option for EBS. SimpleGeo didn’t detail how it managed its multi-zone EBS deployments, but presumably the team has its own system for replicating data between the data stores since it wasn’t possible to create new EBS volumes during the outage.

About ReadWrite’s Editorial Process

The ReadWrite Editorial policy involves closely monitoring the tech industry for major developments, new product launches, AI breakthroughs, video game releases and other newsworthy events. Editors assign relevant stories to staff writers or freelance contributors with expertise in each particular topic area. Before publication, articles go through a rigorous round of editing for accuracy, clarity, and to ensure adherence to ReadWrite's style guidelines.

Get the biggest tech headlines of the day delivered to your inbox

    By signing up, you agree to our Terms and Privacy Policy. Unsubscribe anytime.

    Tech News

    Explore the latest in tech with our Tech News. We cut through the noise for concise, relevant updates, keeping you informed about the rapidly evolving tech landscape with curated content that separates signal from noise.

    In-Depth Tech Stories

    Explore tech impact in In-Depth Stories. Narrative data journalism offers comprehensive analyses, revealing stories behind data. Understand industry trends for a deeper perspective on tech's intricate relationships with society.

    Expert Reviews

    Empower decisions with Expert Reviews, merging industry expertise and insightful analysis. Delve into tech intricacies, get the best deals, and stay ahead with our trustworthy guide to navigating the ever-changing tech market.