VMware's Gaetan Castelein: Transitioning to Disaster Avoidance

The evolution of VMware’s disaster recovery guidance for customers is taking it in a direction that is actually less focused on the disaster itself, and more on business continuity. That’s changing the very economics of disaster recovery (DR) software itself, according to VMware infrastructure product manager Gaetan Castelein.

In an interview with ReadWriteWeb, Castelein said that DR used to be a common process mainly for big enterprises. But as businesses everywhere are learning that disaster avoidance processes cut down on costs, the subsequent cost of implementing DR comes down as well – and that brings more businesses into the mix.

Thinking ahead about thinking ahead

“Disaster recovery is absolutely becoming much more ubiquitous,” Castelein tells RWW. “It used to be that organizations would apply disaster recovery just to a small percentage of their applications, and mostly deployed by bigger organizations because of the costs involved. A lot of organizations just relied on backup, and backup to tape – and the problem when you do that is, your recovery times are really long. You can lose a day’s worth of data, and it can take you a week to recover your application in the event of a disaster.”

In other words, larger enterprises use larger data (of course). But tape, which is still in use in financial institutions everywhere, is tape. When you’re measuring backup times with something bigger than an egg timer, the size of the job typically forces administrators to do the job less often. As a result, a business impact analysis would reveal that the impact of a disaster would be greater, on account of the fact that the amount of data lost would be proportional to the increased time between backups.

Redesigning data centers for closer proximity between storage points, and using storage-based applications to manage those points, Castelein says, creates situations where a power outage might result in only an hour’s worth of immediate data loss, which might only take that long to recover.

“When you talk about the need to have more resilience, what we’re seeing is, people now want to go beyond just disaster recovery to get also into the field of disaster avoidance,” he explains. The instigator of this change has been the growing number of occurrences when companies have seen a hurricane coming in on the radar, and rolled the dice. They may even have had DR plans in place, but without having conducted an impact analysis beforehand, they may not have been prepared for which stage of the plan to implement. That leads some customers to conclude that, with a little more skill in observing weather events, they could have spent less time arguing with themselves over whether the hurricane would hit, and more time implementing disaster avoidance contingencies that would have zero impact on the business if the hurricane missed.

“With disaster avoidance, simply, we [see that] organizations that are looking for now a solution that enables them to move applications between data centers with no downtime and no data loss,” the product manager tells us, reminding us that vMotion live migration capability is one of the growing features of its vSphere suite.

“There’s a really strong analogy to be made between what’s been happening within the data centers and what’s going to happen across data centers in the future,” explains Castelein. “If you look at ten years ago, before virtualization, people deployed applications in hardware silos. Each application had its own system, and in many cases, its own storage within the data center. Once virtualization came along, people started pulling those hardware resources and moving applications between machines, for all sorts of reasons – planned maintenance, load balancing, failure of the system. By now, you do have that extensive mobility within the data center.”

The distance problem

As bandwidth increases between data centers, Castelein believes businesses will start doing live migrations as a matter of principle, including simply for load balancing, replication, and disaster avoidance. “We’re going to have that extensive mobility between data centers is going to become a possibility. We’re not quite there today, though; we can’t do what you suggested, that move on Monday morning of two data centers, because now with the downtime, you can’t do it on Monday morning.”

Distance continues to play a role in the equation of mobility between data centers, because even the smallest degree of latency multiplied by a few terabytes becomes an unmanageable quantum of time. And the bigger your enterprise is, the broader the roadblock becomes. It’s counter-intuitive to the growth formula, enabling smaller businesses to take advantage of cloud conveniences that disappear for bigger IT shops.

Castelein concedes that this problem is not solvable for big businesses right this moment. “We’re going to get there; it’s something that we’re focused on as an organization at VMware, solving and enabling that use case. There are some specific solutions that allow it in specific circumstances, but it’s not ubiquitous yet.”

For example, one of vSphere’s key technologies for high availability failover is synchronous replication. Here is where VMware runs smack-dad into the laws of physics: “Once you have synchronous replication, data is available on both sites, but it has distance limitations,” says Castelein. “You can only do it for about 50 miles. If you go out for longer distances, you can’t use synchronous replication.”

He’s not shrugging his shoulders or sloughing off the problem to Intel; he asserts that software can solved the problems of distance and bandwidth. But once VMware accomplishes this, only then, he says, can it begin to tackle the problem of live migration over very long distances with zero downtime. While Castelein remains hopeful that these issues will be solved (in our lifetimes), he cautions that the organizational bottlenecks that keep businesses from implementing the policies that make administering high availability systems easier may only be addressable in the context of present-day solutions once those solutions have made their way to the present day. Right now, for some, it’s all conjecture.

“I don’t think hardware is the issue,” he tells us. “I think it’s more a question of developing these solutions from a software standpoint. We already have really solid applications, we have synchronous replication available already today. As an organization, what VMware needs to really focus on is, how do we enable that movement of storage? We need to cultivate that, make sure that the VM on the other side gets integrated… that we’re able to move the memory state of the VM. It’s more of a software issue at this point. It’s not that the hardware needs to evolve, it’s that we need more powerful systems, we need to have that software solution on top… It’s not a problem that VMware alone can solve, but I don’t see a blocking factor there. I don’t see something that’s going to make it impossible to solve.”