The recent three-day service outage of Research In Motion’s Blackberry email service caused a chill felt across the world. And I’m not just talking about the affected customers. The chill was also felt by practically every IT network service professional watching the headlines in mid October, who know that if this could happen to a company with as many resources as RIM, it can happen in their department too.
As we close down 2011, we can reflect on (and learn from) the numerous, high-profile outages that occurred: Bank of America in March; Amazon EC2, Verizon LTE and Yahoo! Mail and Microsoft in April; and then Apple and Microsoft in August. In analyzing these disasters, I’ve come up with four lessons to be learned – they’ll help protect your company’s reputation, technical integrity and customer satisfaction during technical crises.
Kevin Conklin is an executive at Prelert, which reduces the cost, frequency and duration of business critical application disruptions by as much as 90% by adding a layer of self-learning predictive IT analytics software to traditional monitoring solutions such as Microsoft SCOM and Wily Introscope. Prelert customers gain instant, often predictive identification and root cause analysis of problems while eliminating much of the need to define and maintain thresholds, rules, management templates and dashboards.
Lesson #1: Your company’s brand is on the line
IT systems are not just internal systems anymore. Most companies experienced their first painful lessons with the advent of web sites and ecommerce. But today, it seems that every company has a growing amount of exposure to potential service outages that result in many unhappy customers. This said, it’s critical that IT and line of business executives continue to get more aligned.
We must also realize that systems have a tendency to crash at inopportune times. Look at the Verizon LTE network outage in April. The company’s fastest network, the LTE network was unavailable for customers and LTE devices were unable to be activated. The crash happened just 24 hours before the latest 4G-LTE smartphone, the Samsung Droid Charge, was scheduled to launch. The outage delayed the launch by two weeks and no doubt had a significant impact on its sales and reputation.
Lesson #2: Be proactive
Given the potential losses of network service outages, one might think that IT execs are totally focused on preventing major outages. But in my experience, they’re not. The key issue that “prevents preventing” outages is the infrastructure and application monitoring systems in use today. Many were architected when a company’s IT environment could still be visualized on a couple of PowerPoint slides. Their designs were based on the idea that IT experts would define the performance thresholds, rules and exceptions necessary to identify unacceptable behavior. But today, the typical enterprise application infrastructure is so complex that it defies an IT organization’s ability to fully understand. The result, unforeseen outages that often take days to resolve.
Given the potential losses of network service outages, one might think that IT execs are totally focused on preventing major outages. But in my experience, they’re not.
These monitoring systems are still great for generating the data required to understand the systems behavior – just ask the operations center that receives tens of thousand of alerts a day. But the real challenge lies in making sense of the alerts, and taking the right action to resolve the inevitable issues quickly.
Lesson #3: When crisis strikes, communicate early and often
Face it – we live in a 24/7 world and your customers know when there is a problem. It’s best not to ignore it and hope they don’t notice.
When the Microsoft cloud crashed in September, they kept customers in the loop, promising updates at precise time increments. The official Windows Live Status site read, “We’re aware of a problem with Hotmail that’s affecting some people. We’re investigating and will provide an update by Sept. 9 11:30 p.m…”
RIM failed twice over – responding several hours into the crisis and providing little details to ease customer angst. “We understand the frustrations our customers are experiencing through the delays with the messaging an browsing…I’d like to take this opportunity to apologize unreservedly to all those people affected by this situation. We’re taking this situation extremely seriously and we’re doing everything we can to restore normal operation to our service, ” said CTO David Yach.
Don’t let your competitors be your customer’s solution to your outage crisis.
Although you can’t promise answers, providing scheduled updates will go along way with customers. And if you don’t acknowledge a problem, you can sure bet your customers will be tweeting about it.
Lesson #4: Make amends
RIM eventually offered users $100 in premium applications and in some cases, free technical support for a month. While the costs to support the offers are likely high, it is likely worth it… It’s also important to remember that if you don’t provide compensation for customer inconvenience, your competitor’s will. The day that Yahoo! Mail crashed in Aprill 2011, Microsoft wasted no time offering annoyed customers something to make them feel better. The official Hotmail account tweeted, “First 1k #ymail users to [email protected] and send feedback today get HM+ free for 1yr. SwitchToHotmail.com.”
Don’t let your competitors be your customer’s solution to your outage crisis.
With major companies like RIM, Amazon, Microsoft, Yahoo!, Bank of America, and many others all experiencing major network outages this past year, it’s time to realize that it isn’t a matter of if your IT department will someday face a crisis, but rather when your IT department will face a network crisis. Be prepared and have a plan for how your company will react and respond from both a technical and public relations perspective to minimize the aftermath.
As we look forward into 2012, we wish you smooth running networks and fast resolutions to the challenges coming your way.
Photo from Nasa.gov