To provide high-availability servers, hardware clustering has been a mainstay. But there's another approach IT needs to consider: fault-tolerant hardware - systems with redundant components, so that a given failure needn't bring the system down. Think of it as a poor-man's clustering, especially when you add in virtualization to handle the fail-overs.

Fault-tolerant computers are over a third of a century old. One of the market survivors is Stratus Technologies, founded in 1980, currently occupying the last building owned by Digital Equipment Corporation before DEC was sold to Compaq (which was subsequently acquired by HP).

Daniel Dern is an independent technology and business writer, who has written one Internet user guide and thousands of features, reviews and other articles for various technology and business publications, and was the founding Editor-in-Chief at Internet World magazine and editor of Byte.com. His blog can be found at Trying Technology and he can be reached at dern@pair.com.
For mission-critical enterprise applications, the cost of downtime can be expensive, to say the least. According to Roy Sanford, CMO at Stratus Technologies, leading industry analysts peg an hour of downtime at $100,000 to $150,000.

"Clustering attempts to preserve application uptime by moving or replicating data across multiple devices," Sanford notes. "From a hardware point of view, fault-tolerance means dual components with nanosecond failover. A fault-tolerant system isn't a cluster, from the point of view of the operating system, it's a single image."

The marriage of clustering and virtualization

Clustering and virtualization do have merits, Sanford points out. "Clustering lets the additional nodes be used for load-balancing, load-sharing and even Massively Parallel Processing. Virtualization enables consolidation. But for high- availability, they have limits. For example, if you move a virtual machine form a host with 64GB of RAM to one with only 8GB, there's a good chance the machine will stall or crash.

Within the past year or so, Stratus has expanded its fault-tolerant system capabilities to also support the leading hypervisor environments from VMware, Microsoft and Xen. But, stresses Sanford, even fault-tolerant hardware and operating environments can only do so much on their own. IT also needs to be watching these systems, and not just responding to events, but pro-actively monitoring and managing systems, doing proactive remediation. To facilitate this, Stratus provides remote software.

"It's not just the technology, it's also somebody watching the system that can preemptively adjust the systems," stresses Sanford.

62 seconds of downtime

Thanks to this combination of fault-tolerant hardware and remote monitoring/management, Stratus says its systems are delivering "five nines and an eight" (99.9998) percent availability - which means five minutes or less of unscheduled outages per year. (There's an up-time meter on Stratus' web site you can check). In fact, for calendar year 2010, the 8,000 Stratus servers in forty countries had only an average of 62 seconds of unscheduled outages.

"If you save an hour a year of unscheduled downtime, these systems pay for themselves," says Sanford.

And it's not just the classic mission-critical applications that need serious high availability. "When you consolidate dozens of virtual machines to a single physical machine, that system becomes mission-critical," says Sanford.

Fault-tolerance by itself can't solve all business continuity concerns, since the hardware needs to be physically conjoined, says Sanford. A catastrophic event like a floor, earthquake, or tornado could take out the entire fault-tolerant system.

Next steps

What's the next step? Stratus' Avance software provides high availability for fail-over, doing dual data writes, and fail-over hand offs to a VM on a remote server, which can be anywhere from on the next floor to up to three miles away.

"For uptime assurance, you need technology that's purpose-built for that, not one that's designed for consolidation or compute-sharing," says Sanford. "And since technologies still fail, without proactive monitoring, you limit the resiliency of your environment."

Because preventing availability problems can be less expensive -- or stressful -- than restoring availability.

But you should still be preparing for disaster recovery, of course.