There is a continuing change in the way businesses are thinking about the concept of disaster recovery and system maintenance, brought on perhaps for the first time since the advent of the Internet itself by government agencies. In researching how huge, country-wide data networks should plan for the contingency of terrorist attack or natural disaster, agencies such as the U.S. Dept. of Homeland Security and its E.U. equivalent, ENISA, have begun adopting an emerging concept given an old name: resilience.
It’s the key lesson being taught by cloud technologies such as OpenStack, and governments are learning it: Because failures happen, systems should expect them and overcome them – in advance, if possible – rather than wait for them and react. This lesson is having an impact on key data center technologies, most notably virtualization. Now, VMware is offering advice to its customers that bears the traditional “disaster recovery” moniker, but which is in the midst of altering its tone.
VMware’s revised set of 10 Disaster Recovery Tips, unveiled earlier this month, now have greater emphasis on steps businesses can take to avoid loss and minimize impact when disasters happen. Because disasters will happen.
1. Run a full business impact analysis. As Gaetan Castelein, VMware’s senior product marketing manager for infrastructure, tells RWW, businesses only think they’re running such an analysis, but they usually stop at the beginning.
“When you’re implementing a disaster recovery solution, think of how many people often do it on an application-by-application basis, or for a subset of virtual machines,” says Castelein. “Really, few people actually do a full business impact analysis. What you really need to understand from a business standpoint is, what will be your recovery time objective and your recovery point objective for each business service that is running on your IT infrastructure? So if your data center goes down, how much time will you have to recover this application? How much data are you willing to lose? This is a business decision, and those numbers are going to change on an app-to-app basis.”
2. Identify your application dependency mapping, which refers to all the software components your application needs to run on client devices. “Once you’ve figured out that, for example, your HR application can’t be down for two hours, because a customer-facing application must be continuously available, then you’ve got to figure out, what are all the dependencies between those applications? Discuss those facing apps’ needs (for example, access to these two databases, LDAP from a security standpoint) if you can identify all the individual components that are necessary for those services. And based on that, you’ve got to come up with, ‘Here’s my disaster recovery requirement for each individual application.'”
3. Evaluate your potential recovery site, based on its distance from the possible point of disaster, and how much bandwidth available between those two points.
4. Evaluate the differences between your disaster recovery requirements and your disaster avoidance requirements. There’s a chance that the tools you implement to avoid disaster could mitigate the impact of a disaster event, reducing your outlays necessary for recovery.
5. Evaluate the different classes of disaster recovery solutions. Here, VMware is recognizing it doesn’t have a player in every camp. “There’s many different types of solutions,” says Castelein. “There’s storage-based solutions with storage-based applications; there’s application-level clustering. Obviously, we do a lot in the disaster recovery space with Site Recovery Manager and vSphere. So look at those different solutions that are available out there and see which one is best for your needs.”
6. Design your business continuity solution. The broader concept of resilience is incorporated into continuity tools that help a well-distributed network, for example, maintain full database uptime after a disaster event through strategic replication.
7. Create a solid recovery plan. This is the point in the process where you learn how much of the recovery process can be implemented in software, and how much requires manual intervention. (And how much has relied in the past on divine intervention, for that matter.)
8. Test your recovery plan. Do these drills at least twice a year, VMware recommends.
9. Automate your processes as much as possible. As VMware’s Castelein advises, “Disaster recovery on the one hand is certainly not inexpensive, but it’s a very big investment. And unfortunately without automation, we see all too often that human elements come into play, so the recovery times are much longer than they need to be. If you rely on manual processes, they need to be documented. In many cases, they fall out of sync with what’s actually running in your data center. So getting automation and having a lot of recovery processes executed through software, instead of manual processes, is a big plus.”
10. Use a checklist. VMware’s idea of a “solid” recovery plan is one that can be encapsulated into a NASA-style, well-defined checklist. Conceivably, certain elements on this checklist could be implemented as drills, effectively recovering – and thereby refreshing – some systems which haven’t even been impacted by disaster.