Home System Down! An Application Outage Survival Guide

System Down! An Application Outage Survival Guide

A major system outage is every business’s worst nightmare, especially if you rely on applications to generate revenue. Depending on the scale of the outage this can be a very expensive problem—eCommerce stands to lose millions in sales for every hour of downtime. Furthermore, frequent public outages can damage credibility, causing customers to search for other, more reliable alternatives.

Unfortunately, outages are inevitable. But an outage doesn’t have to be the end of the world. With a little planning and communication you can significantly mitigate the effects of a major outage on your brand and your business.

Before The Outage

It’s impossible to foresee every scenario that may lead to a disaster for your application, but that doesn’t mean you can’t prepare. Companies that plan for failure respond faster when a problem arises and can often reduce their chances of facing a major outage by preventing smaller problems from snowballing.

  • Invest in IT. If IT is underfunded, chances are they’ll be forced to cut corners and delay important software or hardware upgrades. In other words, they’re incurring technical debt—leaving important stuff to be done later when the time or resources are available. This may make it easier for performance problems to arise and go unnoticed, especially if the tooling for monitoring the application is inadequate or missing altogether.
  • Break down the silos. When developers and operations work together and are involved in every stage of the application lifecycle, it’s much easier for them to troubleshoot problems. ExactTarget, for example, set up monitoring screens displaying dashboards from every department across the organization to help give everyone visibility into the entire application, making it easier to identify and troubleshoot problems.
  • Plan for failure. Netflix knows that the best way to prepare for failure is to experience it. That’s why they constantly simulate failure with what they call their Simian Army, a set of tools that randomly induces certain failure conditions by killing off nodes/availability zones, creating artificial latency and more.

During The Outage

An outage can be a very stressful situation for everyone involved, but it’s important that everyone keeps their wits about them—now is your opportunity to prove to your customers and the world that you can handle disaster calmly and gracefully.

  • Communicate early and often. You may be tempted to pretend nothing’s wrong so as to not draw more attention to yourself, but with websites like downforeveryoneorjustme.com it’s easy for anyone to call your bluff. Be the one to tell your users what’s happening–don’t make them resort to twitter to get news and vent their frustrations (or crack jokes).
  • Argue about data, not feelings. Edmunds.com has a “data-driven” DevOps culture that allows them to respond quickly to problems, often without even calling a war room session—a couple people get together, look at their monitoring tool, and find the problem.
  • Two pairs of eyes are better than one. There should be no single source of truth when it comes to application data—the more people that have access and visibility, the better. Care.com unites their tooling between Dev, Ops and QA so that everyone’s looking at the same information. This way it’s much easier to find problems before they bring down the application.

After The Outage

Once you’ve resolved the problem and your app is back up and running, your instinct will be to return to business as usual and pretend nothing had happened. However, this is the best time to talk about what happened and re-establish credibility with your customer base. All you have to do is be completely honest.

  • Have a blameless postmortem. If people feel like they’ll be punished for speaking up about their own mistakes, they probably won’t–leaving you in the dark and often causing the same problems to happen again and again. Etsy is well known for its blameless postmortems, where it encourages employees explain their actions and rationales without fear of retribution.
  • Be transparent. When you’re done with the postmortem, publish the results on your blog. This makes you appear more trustworthy as a brand—your end users know that even if it happens again, you were always honest about what went wrong. A great example of this is Skype’s postmortem blog for a 2010 outage.
  • Apologize. Issuing an apology and taking the blame can help you get back your credibility and show your customers that you care. Be careful with your tone, however–if you come across as flippant or accusatory you might end up angering your customers even more, like Dreamhost did after a billing issue. MailChimp also did a great job with this.

Outages are a fact of life for any business that depends on web applications. But they don’t have to be a disaster—if you plan well and are transparent about what’s going on you can mitigate the effects of failure and maintain credibility in your users’ eyes.

Image courtesy of Shutterstock.

About ReadWrite’s Editorial Process

The ReadWrite Editorial policy involves closely monitoring the tech industry for major developments, new product launches, AI breakthroughs, video game releases and other newsworthy events. Editors assign relevant stories to staff writers or freelance contributors with expertise in each particular topic area. Before publication, articles go through a rigorous round of editing for accuracy, clarity, and to ensure adherence to ReadWrite's style guidelines.

Get the biggest tech headlines of the day delivered to your inbox

    By signing up, you agree to our Terms and Privacy Policy. Unsubscribe anytime.

    Tech News

    Explore the latest in tech with our Tech News. We cut through the noise for concise, relevant updates, keeping you informed about the rapidly evolving tech landscape with curated content that separates signal from noise.

    In-Depth Tech Stories

    Explore tech impact in In-Depth Stories. Narrative data journalism offers comprehensive analyses, revealing stories behind data. Understand industry trends for a deeper perspective on tech's intricate relationships with society.

    Expert Reviews

    Empower decisions with Expert Reviews, merging industry expertise and insightful analysis. Delve into tech intricacies, get the best deals, and stay ahead with our trustworthy guide to navigating the ever-changing tech market.