Amazon Web Services disruption issues are now into their second day as engineers work to get the last of the availability zones restored. Meanwhile, the customers affected spent the day talking with their customers while others such as Twilio were able to show how they avoided outages.
The latest update from AWS came late Thursday night:
10:58 PM PDT Just a short note to let you know that the team continues to be all-hands on deck trying to add capacity to the affected Availability Zone to re-mirror stuck volumes. It’s taking us longer than we anticipated to add capacity to this fleet. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.
AWS kept its status page updated throughout the day. By early afternoon, all but one availability zone had been restored:
12:30 PM PDT We have observed successful new launches of EBS (Elastic Block Storage) backed instances for the past 15 minutes in all but one of the availability zones in the US-EAST-1 Region. The team is continuing to work to recover the unavailable EBS volumes as quickly as possible.
Data Center Knowledge wrote that the issue is specifically about EBS. You may recall that EBS has been something Reddit has struggled with for some time. We wrote about the issue in March.
AWS briefly explained what happened in an update yesterday morning:
8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
The AWS region that went down is out of Northern Virginia. It has four availability zones. Customers have learned to use multiple availability zones within a region to avoid outages. And that’s what puzzling. How did this all happen? It may come down to how AWS defines availability zones.
The founder of FathomDB reviewed that question himself in a post on the topic:
AWS has two concepts that relate to availability – Regions and Availability Zones. They have five Regions – two in the US (one east coast, one west coast), one in Europe (Ireland), and two in Asia (Tokyo, Singapore). Each region has within it multiple “Availability Zones” (AZs), which are supposed to be isolated so that they have no single point of failure less than a natural disaster or something of that magnitude. AWS says that by “launching instances in separate Availability Zones, you can protect your applications from failure of a single location”. It’s not clear whether ‘location’ means separate datacenters or separate floors/areas of a single datacenter, but it doesn’t really matter – the point is that AZs should fail independently until a catastrophic failure occurs. [Update below: it seems likely that they are in fact separate datacenters]
The post goes on to state that AWS, not customers are to blame:
This morning, multiple availability zones failed in the us-east region. AWS broke their promises on the failure scenarios for Availability Zones. It means that AWS have a common single point of failure (assuming it wasn’t a winning-the-lottery-while-being-hit-by-a-meteor-odds coincidence). The sites that are down were correctly designing to the ‘contract’; the problem is that AWS didn’t follow their own specifications. Whether that happened through incompetence or dishonesty or something a lot more forgivable entirely, we simply don’t know at this point. But the engineers at quora, foursquare and reddit are very competent, and it’s wrong to point the blame in that direction.
BigDoor CEO Keith Smith wrote in a blog post that he and his colleagues spent the day in crisis control. Amazon is historically pretty quiet about its problems. Smith said that has made it more difficult to respond to customers:
We aren’t just sitting around waiting for systems to recover. We are actively moving instances to areas within the AWS cloud that are actually functioning. If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner.
But Twilio delighted in the outage as it gave it a chance to tout its architecture. Twilio has instituted a set of architectural design principles that minimize the impact of occasional, but inevitable, issues in underlying infrastructure. A post on the new Twilio engineering blog outlines the company’s approach.
The AWS outage will pass but there are some things that need to be explained. In particular how it treats availability zones. Customers need that clarification to better prepare for future outages that may occur.