Hey, Online Services: Why Can't You Keep Up With Demand?

Theoretically, online services shouldn't ever get so mobbed by customers that they can't deliver a game or service, because it should be ridiculously easy to bring on additional capacity to meet demand. And yet here in the real world, exactly these sorts of failures seem to crop up with dismaying regularity.

So it's time for these services to fess up: Why can't Johnny scale?

Scale Fail

The biggest such disaster recently was Electronic Arts's blinkered SimCity launch earlier this month. Because of EA's decision to saddle users with an onerous digital-rights management system through its Origin service, players had to be online at all times and connected to EA's servers, even in single-player mode. The demand was too much, and those SimCity users who could actually connect experienced widespread problems with gameplay.

(See also: Will SimCity Launch Disaster Stop Online-Only DRM?)

Or take the less-publicized problems that ComiXology was having early last week. At SXSW, Marvel Comics announced March 10 that they would be releasing over 700 #1 Marvel issues through the digital comic service. Six hours later, ComiXology's servers crashed in flames.

ComiXology CEO David Steinberger sent out a quick e-mail the next day acknowledging the problem, which affected not only the Marvel promotion, but also regular comic sales from other publishers like DC and Image.

"We had believed ourselves prepared — but unfortunately we became overwhelmed by the immense response. We're still struggling to keep our systems up," Steinberger wrote. "The result is that you aren't getting your comics when and where you want."

Given that I am a ComiXology user and it was less than 48 hours until Wednesday — New Comic Day — you can bet I was focused on this news. To their credit, by the time I tested the app Tuesday afternoon, everything seemed to be working fine. But the Marvel promotion has been postponed for now.

Finally, I wanted to check out the mail-sorting service known as Mailstrom, a free beta service that lets you view your email account through different lenses that enable you to clear out your inbox faster. A colleague had recommended it, and the thought of achieving Inbox Zero at long last called to me like a distant dream.

I signed up for the service, and was informed I was 4819th in line. I like in a world of big numbers, so I wasn't worried. Computers, I've heard, are fast.

But two days later, when I checked my status on the queue again, I had only risen 36 places in line. By my projections, that put me at getting the service applied to my account by Dec. 4. I sent an interrogative via Twitter, was chided a bit about projections and told to drop a line to Mailstrom's service e-mail account. I did so, and was told to swing back in a week if I hadn't been activated.

Huh.

To be fair, complaining about a service that is both free and in beta is a little like kicking a puppy. Also for the record, Mailstrom did get me activated on March 14, six days after sign-up.

But the small hiccup I had with Mailstrom just underscored a systemic problem in the online world. Given that businesses are living and operating online these days, why are so many companies failing to meet instances of high demand?

Inefficiencies of Scale

One common perception is that when you have problems of scale, the best thing to do is throw more resources at them until they go away.

This tends to be the default plan for many people. If the shark is really, really big, then sure - "we're going to need a bigger boat."

Sometimes this works. But sometimes it doesn't. In the case of the shark, perhaps a more efficient and less risky plan might have been stay out of the damn thing's food chain altogether. Alas, there are times when we have to deal with the shark anyway. Is more always better?

Consider the more reality-based problem of the airport gate. At most gates, there are two (sometimes three) terminals behind a desk, which is usually more than enough to handle a planeload of people, since much of the check-in and boarding work is done before the passengers even get to the gate.

But introduce a problem to the situation, such as a delay, and things can bottle up very quickly. Passengers line up to find out what's happening. Or switch to another flight because they're going to miss a connection. Or just generally complain. Whatever the reason, the line can become very long, slow and aggravating very quickly.

If you try the simple solution to solve this problem — add more resources, this time in the form of gate agents — it simply can't work. Remember, there are only two terminals capable of delivering information or solutions at the gate. Other gates nearby may not be busy, but they have their own passengers to care for, or they're not crewed.

This is, for all intents, a pretty fair analogy of what's not working well with today's online environments.

Architecture Matters

Applications were built, for the most part, without true scaling in mind. And why would they have been? If you're in a corporate environment, why develop a program that has to handle a sudden influx of new users or clients hitting your servers? Unless HR suddenly hires 1,000 people overnight, scaling is a slow and gradual process.

Internet commerce changed that. Vendors like Amazon had to invent whole new technologies for data management and storage just to keep up with holiday shopping. Yes, Virginia, there is a Santa Claus… and he was the necessity behind the invention of cloud computing.

But in this age of cloud computing, applications are still not optimized for the cloud. Nor, for that matter, are networks and infrastructure. It does little good to add more servers to a cluster that's bending under the weight of traffic if your app doesn't know to automatically shift to the new resources or if the network between servers vaporizes trying to handle the load.

"Fundamentally, cloud-optimized architecture is one that favors smaller and loosely coupled components in a highly distributed systems environment, more than the traditional monolithic, accomplish-more-within-the-same-memory-or-process-or-transaction-space application approach," Microsoft Architect David Chou wrote in 2011. His words still hold true today.

Given the many approaches one can take to shore up a web service to prevent a meltdown, why is this still such a problem?

  • Lack of knowledge. We know about the cloud, but we don't know the cloud. Relatively few people do, outside of the people who invented the technology. That's changing, but it still remains a huge barrier to proper cloud use.
  • Non-lateral thinking. Remember the airport problem? Passengers with smartphone apps can avoid the line to find out info or re-book themselves. Or just call the airline. But they don't, because they have to face someone right there to get their problem solved. They are not laterally thinking. (Nor are the airlines. Perhaps use apps that can actually run on commodity hardware and deploy crisis teams equipped with tablets or laptops to gates with problems.)
  • Fear of... whatever. Fear of migration. Fear of losing data. Or money. Fear of looking stupid. Pick one. Or all. Because fear of shifting to any new technology is always an issue, and always will be. Solving the "lack of knowledge" problem will help alleviate this issue, but it cannot be ignored.

These are some of the biggest obstacles to getting apps and services out there that won't crash anytime there's more than a stiff breeze blowing. There are others, and they all must be solved. Cloud vendors are in the best position to create effective solutions, but customers can take matters into their own hands and start thinking of lateral solutions to their problems.

There's an iconic capitalist image of the overrun shop owner fending off a crowd of customers all clamoring for a hot product. That scene is manageable, at least after a fashion, because there's still be some business-to-customer contact — the shopkeeper can always try to yell over the crowd, after all. Online mob scenes, though, are worse, because it sometimes means there's no B2C communication at all. It's as if people came to the store and found the doors chained shut.

The solutions are there. It's time to embrace them, before your customers walk away.

Image courtesy of Shutterstock