Top 5 Cloud Outages of the Past Two Years: Lessons Learned

A look at outages for cloud computing services over the past two years shows that most are minimal, a subset of the entire network. Very few cloud outages have caused massive data losses.

But after reviewing most of these outages, it’s clear that cloud service providers are still adjusting to how they do upgrades or estimate loads to the network when doing maintenance. We saw these issues arise more than once.

The results stem from research conducted by Mark Williams, a cloud computing consultant based out of the United Kingdom who is writing a book about cloud computing. Overall duration of the outages are based upon Williams research. He is actively looking for more instances of outages and is asking people to report examples of problems.

Williams found 23 reports of cloud computing failure. Google had 12 outages. Amazon had five. He reported that Microsoft had four outages. Saleforce.com had two.

Here are the top five results from Williams list:

Microsoft Sidekick: March 13, 2009

Outage: 6 days

The massive outage left Sidekick customers without access to their calendar, address book, and other key aspects of their service.

Outages are one thing but data loss is an entirely different issue. Microsoft restored most of the data to Sidekick customers.

According to a Microsoft spokesperson, the data loss resulted from a system failure that created data loss in the core database and the back-up. Microsoft installed a back up process for its database to hopefully prevent data losses in the future.

Lesson Learned: Check with your cloud services provider or hosted service about its disaster recovery policy. And as always, back up your data.

Google GMail: October 16, 2008

Outage: 30 hours

It’s still unclear how many people were affected by this outage. Google says very few had problems. The issue did affect Google Apps customers. There were some desperate pleas in the Google Apps discussion group. We see this a lot. It’s that sudden realization that the CEO can’t access e-mail. Suddenly, cloud-based services are now thought of as a process for how data is created, distributed and stored. How can this problem be averted?

Lesson Learned: Google Apps does offer a premiere edition for $50. With the service, users get 24×7 phone and email support. Still, you can avoid the issue entirely by having backup for your email.

As one person stated in the discussion forum:

“For businesses/customers wanting immediate access to support via
telephone or via correspondence as well as ensuring that they are
compensated for when things do go wrong, they should look towards the
Premier edition rather than the Standard edition.
Since I’ve been using Google Apps as far back as 2006, I have not
encountered any major downtime. I keep back-up email accounts in case
things do wrong (and things DO go wrong on the odd occasion even with
internal mail systems managed in-house no matter how good the planning
and expense). I have never had to go without email for any
considerable length of time. If it’s REALLY urgent, I usually email
and then telephone the recipient to ensure that all is well.
A bit of planning and organisation goes a long way with modern digital
communications.”

Google GMail, Google Apps: August 15, 2008

Outage: 24 hours

Those affected by the outage received a 502 Server error when trying to log in to Gmail and Google Apps The outage came following a similar disruption the week before. Google attributed the initial outage to issues with its contacts system that was preventing Gmail from loading properly.

Microsoft Azure: March 13, 2009

Outage: 22 hours

The outage occurred last March before the service came out of beta. The outage left people without access to their applications.

Lesson Learned: Microsoft recommended that application owners deploy their application with multiple instances of each role.

Conclusion

Most of the outages affected email, which can easily be backed up. Generally, the outages have been pretty minor. But the outages should not be forgiven lightly as they each have their own particular crisis for the people affected. Overall, we need more comparisons to see the difference in how outages in the cloud compare to on-premise failures.