Storm Warning: Why 100% Cloud Uptime Is Impossible

Guest author Mike Pav is engineering vice president of Spanning Cloud Apps, a provider of data protection solutions for the cloud.

When Amazon Web Services crashed on Christmas Eve (which brought down Netflix among other high-profile sites), Amazon offered this explanation: its elastic load balancers failed. Load balancers, as the name implies, distribute the network's workload. Among their most important functions is protecting the system's components from becoming overburdened and shutting down. 

After Amazon's outage, the Web became a virtual fount of suggestions for avoiding more such glitches. Some said Amazon's cloud customers should write their own load balancers. Others said service providers like Netflix should deploy multiple data centers as insurance against another PaaS failure.

(See also: Why Netflix's Christmas Eve Crash Was Its Own Fault)

Failure Is An Option

A month later, it seems clear to me: Cloud outages, while rare, will continue to be a fact of life.

Here's why: Perfection is simply too expensive. To achieve uptime of more than 99.99% requires an investment of  money, machine and human resources that - given the rarity of failures - just isn't worth it. The extra cost inevitably would be passed along to customers, all but negating the cloud's cost advantage. 

Instead, customers should expect PaaS providers to provide them with a well-reasoned plan for handling any disruptions.

PaaS providers should be the first to know when an outage has occurred:

  • They should be able to estimate when service will be restored.
  • They should know and be willing to report who was impacted, and whether data was irrevocably lost.
  • After an outage has been reported and until service is restored, PaaS providers should supply customers with regular status updates.
  • Once service has been restored, they should offer a detailed post-mortem as well as a plan for avoiding future interruptions.

Here's where it gets tricky: PaaS providers are understandably reluctant to offer gory details for fear that they will lose current or prospective customers. If the PaaS company in question is publicly traded, those fears will be compounded by the worry that its stock price will tumble.

The real reason to sign onto a PaaS has nothing to do with whether it claims to offer 100% uptime. You choose a PaaS provider because it offers scalability and elasticity, and the same efficiency and user experience regardless of the level of system usage. Applications can be built and delivered on a PaaS an order of magnitude faster when compared with non-cloud-based systems.

Using a PaaS not only reduces a customer's total cost of ownership - they operate on a pay-per-use model - it allows them to delegate tedious and time-consuming IT chores like system monitoring and maintenance. With that stuff out of the way, PaaS customers can focus their resources on truly adding value for their constituencies.

Even after the well-publicized outages, the reason so many high-profile companies - including Netflix - still use Amazon as their PaaS provider is because it does a great job of providing ready-to-use features. AWS isn't 100% reliable, but it can be used with very little up-front investment and scaled as needed. And that is an enormous improvement over the Information technology systems of the past.

Image courtesy of Shutterstock.