Stop Blaming the Customers – the Fault is on Amazon Web Services

Almost as galling as the Amazon Web Services outage itself is a the litany of blog posts, such as this one and this one, that place the blame not on AWS for having a long failure and not communicating with its customers about it, but on AWS customers for not being better prepared for an outage. It’s a tendency that displays a “blame the customer” mentality I’ve been seeing a lot lately. To understand why it’s wrong one has to understand what actually happened and what claims AWS made about it services.

We covered the differences between availability zones and availability regions, and AWS’ lack of communication, in our previous coverage. Now that the dust has settled, it’s worth looking back at what happened. This timeline by Eric Kidd explains the series of events, and the various options different customers had. RightScale provides another good summary. What can we learn?

What Amazon Claims

Here’s what AWS claims about Availability Zones:

Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone.

In other words, AWS claimed that by putting your data in different availability zones, within one region, was redundant. As far as AWS’ customers were concerned, they didn’t have a single point of failure.

Amazon Relational Database Service customers have the option of paying double the regular cost of the service for a multi-zone service: “When you run your DB Instance as a Multi-AZ deployment for enhanced data durability and availability, Amazon RDS provisions and maintains a standby in a different Availability Zone for automatic failover in the event of a scheduled or unplanned outage.”

What Happened vs. What Was Supposed to Happen

The mass outage was due to Elastic Block Storage (EBS) service problems in a single Availability Zone. EBS instances can only live in one Availability Zone, but users should have been able to use their snapshots to create a new EBS instance in another availability zone. RDS depends on EBS, but RDS customers paying for multi-zone service should have had their databases failed-over to another zone automatically.

However, the “control plane” for creating new EBS instances suffered congestion, preventing any ability to failover either manually or automatically. The current assumption is that it was overloaded by customers who’s initial EBS instance failed. Kidd calls this a “bank run.”

The important thing here is that there actually was, unbeknownst to AWS customers, a single point of failure across zones: the control plane. This made AWS unable to fulfill its own failover promises. In fact, RDS customers ended up in worse shape than many others – it took over 14 hours to get many of their databases moved over, longer than those that were able to failover manually.

Multi-Region Multi-Vendor Deployments

So why not place applications in multiple regions, just to be safe? It’s not that simple. First, AWS charges more for transfers between regions. But more importantly, it’s technologically more complex. Amazon Machine Instances (AMIs) can’t just been moved from one region to another. Justin Santa Barbara writes “The different regions have different features available, different AMI ids, I think reserved instances can’t be moved between datacenters – in reality failover between regions is not realistic.”

Barbara writes that it may actually be easier to failover to an entirely separate cloud than to use regions as failover. I’m not sure that’s the case, but regional failover is certainly complicated. And based on the claims made about Availability Zones, would have seemed unnecessary before last week. After all, if each data center in the availability region is a discrete entity insulated from the failures of each other data center, then why would it be necessary add yet another data center in another region? Especially if doing so adds great expense?

Chris M Evans recommends using multiple cloud providers. To his credit, he recommended this even before the AWS outage (one of the things that bothers me about the blame the customer crowd is that their wisdom about what customers should have done comes entirely after the fact). Again, however, this adds additional complexity – and with that additional complexity, additional costs and additional risks. To many customers it seemed natural to just live with having multiple Availability Zones instead of multiple providers.

Even BigDoor CEO Keith Smith, concluded his widely cited piece on Amazon’s failure to communicate with customers by writing:

We can spend cycles designing and building technical belts and suspenders that will help us avoid a massive failure like this in the future, or we can continue to rely on a single huge partner and also continue our break-neck pace of iteration and product development.

I can’t tell you today which option we will choose. But I’m sure it will be the question on the mind of many startups across the country.

George Reese of enStratus wrote for O’Reilly Media: “In short, if your systems failed in the Amazon cloud this week, it wasn’t Amazon’s fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon’s cloud computing model.”

That misses the point. Accepting a certain amount of downtime is one thing, accepting 14 hours of downtime when you’ve already paid extra for redundancy is another.

Yes, customers accept a certain amount of risk, but that doesn’t make it their fault when Amazon screws up.

Why Didn’t Some Sites, Like SmugMug and Twilio, Go Down?

What about the companies that had the good fortune to avoid outages? Aren’t they evidence that it’s the customers’ fault for not setting things up right? Not really. Both Twilio and SmugMug boast about their “design for failure” but the important thing is that neither company relied on EBS. Had these companies been dependent on EBS, they likely would have suffered a similar fate.

What About Netflix?

What about Netflix? Netflix, as documented by Adrian Cockcroft, does use EBS.

Kidd writes about Netflix:

Run in 3 AZs, at no more than 60% capacity in each. This is the approach taken by Netflix, which sailed through this outage without no known downtime. If a single AZ fails, then the remaining two zones will be at 90% capacity. And because the extra capacity is running at all times, Netflix doesn’t need to launch new instances in the middle of a “bank run.”

It’s not clear how much Netflix uses EBS, but Cockcroft gave a presentation saying Netflix avoids it. This tweet indicates that Netflix is more reliant on S3, SimpleDB and Apache Cassandra than on EBS, but Cockcroft did note that the company was having EBS trouble during the outage.

It’s also worth noting that Cockcroft tweeted that Netflix only runs out of one region.

It’s The Customers’ Fault Because They Shouldn’t Have Been Using EBS in the First Place

I love this argument – that it’s customers’ fault for using EBS in the first place. Mashery co-founder Clay Loveless makes this case.

AWS has been offering the EBS service since 2008. It’s not considered a “beta” product. Why shouldn’t customers be able to rely on it? True it’s had issues over the years, leading some companies to decide not to use it. But AWS has happily taken money from customers for years now. If it’s a product that isn’t ready for product, AWS should have said so. (Unfortunately for customers, the EBS outage won’t count towards their SLAs.)

What the “they shouldn’t have used EBS” argument comes down to is: customers are stupid for trusting AWS to provide the service promised. It’s saying that customers that paid for multi-zone RDS replication should have expected 14+ hours of downtime. If AWS itself were to tell its customers “You should have known better than to trust our service,” we would be up in arms – wouldn’t we?

I keep seeing similar arguments. “We shouldn’t blame Dropbox for lying about its encryption, we should blame customers for trusting Dropbox.” “We shouldn’t blame Apple for not giving users control over their location logs, we should blame customers for expecting privacy.” I’m sick of it.

It might in fact be true that we can’t expect vendors to provide customers’ what they promise. But that is squarely on the shoulders of vendors, not the customers. And I’m sick of “savvy” pundits putting down customers and excusing failure and bad behavior on the part of companies.

Yes, things happen. AWS is run by humans, and humans make mistakes. AWS deserves some forgiveness . But let’s not forget who messed up.

How to Fix the Problem

In the short term, I suspect many customers will move away from using EBS and RDS. In the medium term, infrastructure-as-a-service providers need to come up with a standard system for sharing instances across clouds, whether that’s OpenStack, Cloud Foundry, Eucalyptus or something else. Customers shouldn’t have to choose between trusting only one provider or committing to a complex and potentially unreliable multi-vendor solution. The days of vendor lock-in must come to end.

Meanwhile, bloggers, analysts, journalists and other opinion-makers need to put the blame back where it belongs: on service providers that don’t live-up to their promises.

(Lead image by Ian)

Disclosure: Mashery is a ReadWriteWeb sponsor.

Facebook Comments