The Ins and Outs of Cloud-Based Disaster Recovery (DR)

by Ofir Nachmani
February 19, 2015

Outages are inevitable. As we’ve seen over the past few years, every major cloud vendor’s experienced at least one, and we can expect that they will again at some point in the future. As cloud consumers, we need to be able to use the cloud’s building blocks and unlimited resources (at least, in theory), and create service robustness and high availability. Yet, important issues, like SLAs, remain unclear when it comes to consuming resources and services from IaaS vendors.Today, more than ever, online software service vendors, have a lot to lose when their services suffer from performance degradation. They could lose significant amounts of revenue as a result of actual outages as well as diminished user loyalty. In this article, I will share baseline perceptions and methods of cloud-based DR.

The DR Site

Cloud DR integrators, or solution providers, tend to relate to three configurations. There are advantages and disadvantages to each, especially when it comes to the balance between an environment’s immunity and the costs involved. The following configurations are ordered from the least to the most costly and robust.

1 – Light Standby (Pilot Light)

Using the so-called “Pilot Light” DR model allows you to keep none or a very small amount of live resources (e.g. your database server). The Pilot Light model allows you to have your system ready to launch based on a set of configuration and initialization scripts. Once your data is stored and backed up, you can start your whole environment on a secondary site and scale it out. This allows you to automatically build your service’s underlying infrastructure stack until it can replace the main production site. With this method, you can decrease DR costs by holding minimal active resources, utilizing the on-demand nature of the cloud.

2 – Active x Active Standby

This standby method is similar to the Pilot Light model, but exists as more of an active-active structure. It provides a baseline that can absorb the first boost of traffic that is balanced out from the production environment when an outage occurs. This requires active web application servers to allow the secondary DR site to quickly handle production traffic. With this standby method, you load balance traffic from the site that is down to the active standby environment, then gradually scale out the secondary site.

3 – Multi Active x Active

The third method involves a hot DR site that includes a real time replica of what you have in production. If an outage occurs, all traffic is rerouted to this secondary site. As a result, this is obviously this is the most expensive method. You need to understand your business requirements (SLA) as well as potential losses if downtime occurs. If losses don’t include critical or financial services, you can resort to the light or warm setup. For example If you need to keep your development or test environment productive, I assume that the light setup will do.

The appropriate configuration model for your company should be based on business SLA KPIs that are derived from metrics such as RPO (Recovery Point Objective) and RTO (Recovery Time Objective). In any case, your DR site should live separately from production, located under a different cloud account and in a separate region. It is imperative that no dependencies exist between the DR and production environments. In addition, while production environments are restricted to specific providers, DR sites need to be even more secure.

Automated Recovery Drill-Downs

To make sure that you’re ready for an event, recovery drill-downs are a must. Traditionally, drill-downs were events that system administrators needed to prepare for. Now, they can be automated to occur much more frequently than in the past. Contrary to traditional datacenters, the public and private clouds enable you to efficiently exercise your system’s robustness and recovery capabilities via automation.

Automated DR drill-downs in the cloud are supported by scripts that can automatically build your secondary site from scratch as well as report on the success or failure of a test. Drill-downs should include simulations of service demands, as well, making sure to measure and test end user experiences. At the end of the test, according to the DR site’s configuration, the system should ensure that all standby resources are shut down.

It’s important to document recovery drills while they are running and save the output logs in order to analyze their success. Even if your environment replication and recovery are a success, it is important to verify whether or not data is recovered to where it can actually serve your users. Thanks to Uri Wolloch of N2W for this helpful information.

Final Note

A blurry line exists between DR and cloud migration. Replicating on-premises environments to the public cloud has its benefits, including DR and workload migration. In a hybrid cloud scenario, moving resources between the public and private cloud can serve IT with backup and recovery requirements as well as with the option to “burst out” to the public cloud.

The cloud has changed how IT needs to perceive, plan and build the availability of an online service. We are not there yet, but the cloud can definitely support the notion of SLA-based computing. This means that cloud (especially IaaS) vendors will be able to provide, value and price resources and services based on their SLAs. In addition, although the responsibility for end user experiences still in the hands of software developers (i.e. IaaS consumers), I believe that IaaS vendors ultimately strive to take DR out of their customers’ hands. Cloud vendors should make more of an effort to build appropriate SLAs for their customers, making sure to provide more features that shape the compute, storage and network baseline resources to be more robust and highly available.

This post is brought to you by VMware vCloud Air Network Services.