Lightning may not strike twice, but it only had to strike once to take out cloud services by Microsoft and Amazon. A lightning strike in Dublin took out services to Amazon and Microsoft on August 7th – and the outage has exposed an unrelated issue with Amazon’s Elastic Block Storage (EBS) software. Today, most services are back up, but some Amazon services (EC2 Ireland and RDS Ireland) are still affected and Amazon is telling customers to restart services in another zone of its infrastructure to get back online faster.
According to the most recent update from Amazon at 3:11 PDT on August 8th, Amazon is not only being hit by fallout from the power outage, they’re also dealing with “an error in the EBS software that cleans up unused snapshots.” Typically loss of power won’t affect a data center too adversely, but as Amazon wrote in its initial summary of the event the lightning strike caused problems with the system that synchronizes to the backup generators. Even though power was restored relatively quickly, it meant that a lot of servers went down hard – and that’s never a good thing.
In fact, it’s pretty bad. According to one of Amazon’s updates, the EBS instances that went down require “manual operations” to restore volumes. That means Amazon has to make copies of the data, which means Amazon has had to install additional capacity to support the process. At 11:04 PDT on August 7th, the company was predicting 24-48 hours to complete the process. If EBS problems sound familiar, they were the root of the previous AWS outage this Spring.
If customers have Multi-AZ RDS database instances, they should be able to turn on services in another zone, but it looks like a small number of customers with Single-AZ instances are still without services now, more than 24 hours later. As mentioned previously, Microsoft Business Productivity Online Suite (BPOS) customers were also affected but service was restored within a few hours. Note that Microsoft doesn’t seem to make its Service Health Dashboard information publicly available.
Amazon is gracious enough to make its Service Health Dashboard available to anybody, but I had no luck getting someone from Amazon Web Services to comment for the story.
Given this event and the longer outage in April, it seems fair to say that relying on AWS services tied to EBS (at a minimum) need to be multi-homed if they’re mission critical. Though AWS services are designed to be fairly robust, there’s simply no compensating for any natural event that might take a data center offline and disrupt EBS volumes. One way or another, customers have to worry about redundancy even in the cloud.