Last week, Amazon Web Services (AWS) suffered a brief network outage in its US-East-1 region (North Virginia). While relatively mild, the outage did impact a few AWS customer – including Engine Yard, a PaaS provider that relies on AWS to provide its services to more than 2,300 customers. Total customers affected by the outage? Three.
Three out of 2,300 is a pretty impressive ratio. Granted, this was a very minor outage for AWS. But, it could have affected more customers, if Engine Yard didn’t have a few best practices for avoiding disruptions on cloud services. Bill Platt, Engine Yard’s VP of operations, and Mark Gaydos, SVP of worldwide marketing, talked to me a bit today about how Engine Yard stays up when AWS is down.
Communication
No matter what, outages will happen at some point to one or more customers. Platt says that it’s important to “create as much transparency as possible for the customer to know what’s happening. We share what’s happening as fast as possible.”
Platt also says that the company tries to “manage ahead of the urgency.” That means, if a customer is looking for hourly updates “we’re providing information by the minute, ahead of their requested level of urgency.” In other words, don’t make the customer come looking for updates, provide them as quickly as possible.
Communication with the team, internal and external, is important as well. Platt says that “we really are partners with our infrastructure provider [Amazon], we look at it as one team.” In the event that there is an outage, the teams from Engine Yard and AWS are meeting in a “virtual war room” with real-time communications during the outage. The companies also work together to approve external publications so customers know as soon as possible what’s happening, why, and what’s being done about it.
Note that this doesn’t require a fancy toolset. The go-to tool for communication in a crisis? IRC. “It’s the most rudimentary interface that requires the smallest amount of bandwidth, and we’re sure it’s going to work.”
Isolation and Redundancy
One of the reasons that few Engine Yard customers were hit by the outage is the way that Engine Yard’s service is structured. While AWS suffers outages occasionally, it’s usually confined to a single region. If all of your eggs are in the unlucky basket, you’re toast. That’s why, according to Platt, Engine Yard has built its infrastructure to span multiple regions, AWS zones and across multiple physical infrastructures.
A problem affecting hardware, like the network issue last week, is less likely to affect customers outside of a single data center. By spreading its customers and infrastructure out, they reduce the odds that an outage will affect more than a fraction of their customer base.
The company also isolates their components, so that an outage might affect one component, but not the entire stack. For example, Platt says that the Engine Yard dashboard and the platform for running customer apps are isolated from one another. If a customer’s app goes down, the dashboard should remain up so they can monitor it and report problems to Engine Yard. If the dashboard goes down, the app should continue humming away just fine.
Engine Yard also allows customers to set it up so they can have their application mirrored to another availability zone to bring up if the usual zone goes down. This redundancy means that even if Amazon has another major event like last April’s meltdown customers should be able to be up in a short time in a different region.
Multi-zone deployments are a good idea, but they also cost more. Though it’s impossible to say exactly what the costs of redundancy are for Engine Yard’s customers across the board, Platt says that it can range from less than 25% more to 50% extra “to make sure they have true fault-tolerant, real-time” replication. This is an optional feature for Engine Yard, and customers can pick and choose the level of redundancy they want and how quickly they want to start up new instances if one fails over.
That’s on Engine Yard’s platform, but Platt says that if you’re building out on top of AWS that it’d also be in the 25-50% range for AWS services plus the costs of engineers that can set up that kind of redundancy.
Defect Removal
Platt also says that it’s important to have a defect removal policy to ensure the best service levels. This means doing a post-mortem on any outage and finding the root cause. Once the root cause is identified, the organization has to take quick action to ensure that it’s not a problem a second time. In this case, the problem was with a router on Amazon’s side which will be addressed. Sometimes the problem is in Engine Yard’s platform, and sometimes it’s in the customer’s application.
In any case, organizations that want to avoid outages need to be vigilant about identifying points of failure and addressing them immediately.
All of this planning and execution, of course, costs time and money and requires expertise which many organizations don’t have. Which is why, says Gaydos, companies turn to Engine Yard in the first place. But if you’re not a candidate for Engine Yard’s services, its best practices can at least help avoid or mitigate downtime.