Internet Outage Last Weekend Was Preventable

If ever there was a time that seemed to demonstrate the vulnerability of cloud computing, last weekend was certainly it. Mother Nature’s capricious whims put the smackdown on an Amazon Web Service (AWS) data center in Ashburn, Virginia, Friday night, bringing down hundreds of websites and quite a few popular online services.

But human judgement and a reliance on what some call Amazon’s “sell it cheap” service performance may also have played a role in the storm’s effects on Netflix, Pinterest and Heroku.

The story seems simple on the surface: A massive line of thunderstorms plowed through the U.S. Midwest and Eastern seaboard regions on June 29. They reached Virginia with powerful straight-line winds - a phenomenon known as a derecho - that killed over 20 people and left millions without power. A huge disaster any way you look at it.

But should the storm have wiped out a big swath of the Internet as well?

Failure to Failover

It is not immediately clear why the application services for these sites were not configured for a failover mode when the data center went down, particularly since the same data center had a power outage just 15 days earlier on June 14 - a power outage that affected many of the same sites.

Some Web service providers learned their lesson from the earlier outage: HootSuite was bumped offline when the Ashburn data center went down that day, but it managed to avoid being taken down again on June 29.

HootSuite’s resilience was likely due to the company’s strategy of working with Amazon as a cloud provider. A HootSuite spokesperson told ReadWriteWeb today that the company has multiple backups across different availability zones and data centers.

But not every Web-based company has such a forward-thinking policy, laments Jason Currill, CEO of Ospero, a London-based global hosting and infrastructure company that provides supplementary and backup cloud services.

Web Companies Need Backup Cloud Providers

“This is what happens when we put all our eggs in one basket,” Currill said. “And in the cloud, when one thing happens, it all comes down like a house of cards.”

Mixed metaphors aside, Currill is frustrated by the lack of attention many large-scale Web services are providing to their cloud strategies. Too many times, these Web service providers choose not to build any redundancies into their systems.

“They always cite cost” as a reason why, Currill explained, which he finds exasperating.

“Look at Instagram. They’re nine guys in a room that just got bought by Facebook for $1 billion," Currill said. "You really think those guys don’t have a spare nickel lying around for a DR [disaster recovery] plan?”

Currill is also scornful of Netflix, another site caught in the Ashburn outages last month. The company often cites the need to keep customer costs low, but Currill believes that if Netflix had a true competitor, it would lose a lot of customers when this kind of thing happened. Churn, he added, would change Netflix's minds about the importance of proper disaster recovery and failover planning.

Competition for Amazon Web Services

Ospero both works with and competes against Amazon, and not surprisingly, Currill is glad that Amazon is getting more competition in providing computing services in the cloud. “With Amazon services, no one’s exactly surprised that they keep going down like this,” Currill said. “Their model has always been stack it high and sell it cheap.”

That could change as Google (see "Google Compute Engine a Direct Challenge to Amazon Web Services") and Microsoft step up their own Infrastructure as a Service (IaaS) offerings. “I’m expecting quality from Google and even Microsoft down the road,” he said.

Customers using any cloud services, though, should take a very realistic look at their own disaster policies. Whether its a derecho or a bad line of code, servers will go down, and Web-based businesses had better be prepared.

 

Image courtesy of Shutterstock.