Cloud Disaster Recovery Options for VMware Virtual Machines

Introduction

In my day job I am often asked about the use of cloud for disaster recovery. Some organisations only operate out of a single data centre, or building, while others have a dual-site setup but want to explore a third option in the cloud. Either way, using cloud resources for disaster recovery can be a good way to learn and validate different technologies, potentially with a view to further migrations as data centre and hardware contracts expire.

This post takes a look at the different cloud-based disaster recovery options available for VMware workloads. It is not an exhaustive list but provides some ideas. Further work will be needed to build a resilient network architecture depending on the event you are looking to protect against. For example do you have available network links if your primary data centre is down, can your users and applications still route to private networks in the cloud, are your services internet facing allowing you to make your cloud site the ingress and egress point. As with any cloud resources, in particular if you are building your own services, a shared security model applies which should be fully understood before deployment. Protecting VMware workloads should only form part of your disaster recovery strategy, other dependencies both technical and process will also play a part. For more information on considering the bigger picture see Disaster Recovery Strategy – The Backup Bible Review.

Concepts

  • DRaaS (Disaster Recovery as a Service) – A managed service that will typically involve some kind of data replication to a site where the infrastructure is entirely managed by the service provider. The disaster recovery process is built using automated workflows and runbooks; such as scaling out capacity, and bringing online virtual machines. An example DRaaS is VMware Cloud Disaster Recovery which we’ll look at in more detail later on.
  • SaaS (Software as a Service) – An overlay software solution may be able to manage the protection of data and failover, but may not include the underlying infrastructure components as a whole package. Typically the provider manages the hosting, deployment, and lifecycle management of the software, but either the customer or another service provider is responsible for the management and infrastructure of the protected and recovery sites.
  • IaaS and PaaS (Infrastructure as a Service and Platform as a Service) – Various options exist around building disaster recovery solutions based on infrastructure or platforms consumed from a service provider. This approach will generally require more effort from administrators to setup and manage but may offer greater control. An example is installing VMware Site Recovery Manager (self-managed) to protect virtual machines running on VMware-based IaaS. Alternatively third party backup solutions could be used with cloud storage repositories and cloud hosted recovery targets.
  • Hybrid Cloud – The VMware Software Defined Data Centre (SDDC) can run on-premises and overlay cloud providers and hyperscalers, delivering a consistent operating platform. Disaster recovery is one of the common use cases for a hybrid cloud model, as shown in the whiteboard below. Each of the solutions covered in this post is focused around a hybrid cloud deployment of VMware software in an on-premises data centre and in the cloud.
Hybrid Cloud Use Cases

VMware Cloud Disaster Recovery

VMware Cloud Disaster Recovery (VCDR) replicates virtual machines from on-premises to cloud based scale-out file storage, which can be mounted to on-demand compute instances when required. This simplifies failover to the cloud and lowers the cost of disaster recovery. VCDR allows for live mounting of a chosen restore point for fast recovery from ransomware. Recently ransomware has overtaken events like power outages, natural disasters, human error, and hardware failure as the number one cause of disaster recovery events.

VCDR uses encrypted AWS S3 storage with AWS Key Management Service (KMS) as a replication target, protecting virtual machines on-premises running on VMware vSphere. There is no requirement to run the full SDDC / VMware Cloud Foundation (VCF), vSAN, or NSX at the replication source site. If and when required, the scale-out file system is mounted to compute nodes using VMware Cloud (VMC) on AWS, without the need to refactor or change any of the virtual machine file formats. VCDR also includes built-in audit reporting, continuous healthchecks at 30 minute intervals, and test failover capabilities.

VMware Cloud on AWS provides the VMware SDDC as a managed service running on dedicated AWS bare-metal hardware. VMware manage the full infrastructure stack and lifecycle management of the SDDC. The customer sets security and access configuration, including data location. Currently VCDR is only available using VMware Cloud on AWS as the target for cloud compute, with the following deployment options:

  • On Demand – virtual machines are replicated to the scale-out file storage, when disaster recovery is invoked an automated SDDC deployment is initiated. When the SDDC is ready the file system is mounted to the SDDC and virtual machines are powered on. Typically this means a Recovery Time Objective (RTO) of around 4 hours. For services that can tolerate a longer RTO the benefit of this deployment model is that the customer only pays for the storage used in the scale-out storage, and then pays for the compute on-demand should it ever be needed.
  • Pilot Light – a small VMware Cloud on AWS environment exists, typically 3 hosts. Virtual machines are replicated to the scale-out file storage, when disaster recovery is invoked the file system is instantly mounted to the existing SDDC and virtual machines are powered on. Depending on the number of virtual machines being brought online, the SDDC automatically scales out the number of physical nodes. This brings the RTO time down to as little as a few minutes. The customer is paying for the minimum VMware Cloud on AWS capacity to be already available but this can be scaled out on-demand, offering significant cost savings on having an entire secondary infrastructure stack.
VMware Cloud Disaster Recovery

The cloud-based orchestrator behind the service is provided as SaaS, with a connector appliance deployed on-premises to manage and encrypt replication traffic. After breaking replication and mounting the scale-out file system administrators manage virtual machines using the consistent experience of vSphere and vCenter. Startup priorities can be set to ensure critical virtual machines are started up first. At this point virtual machines are still running in the scale-out file system, and will begin to storage vMotion over to the vSAN datastore provided by the VMware Cloud on AWS compute nodes. The storage vMotion time can vary depending on the amount of data and number of nodes (more nodes and therefore physical NICs provides more network bandwidth), however the vSAN cache capabilities can help elevate any performance hit during this time. When the on-premises site is available again replication reverses, only sending changed blocks, ready for failback.

You can try out VCDR using the VMware Cloud Disaster Recovery Hands-On Lab, additional information can be found at the VMware Cloud Disaster Recovery Solution and VMware Cloud Disaster Recovery Documentation pages.

VMware Site Recovery Manager

VMware Site Recovery Manager (SRM) has been VMware’s main disaster recovery solution for a number of years. SRM enables policy-driven automation of virtual machine failover between sites. Traditionally SRM has been used to protect vSphere workloads in a primary data centre using a secondary data centre also running a VMware vSphere infrastructure. One of the benefits of the hybrid cloud model utilising VMware software in a cloud provider like AWS, Azure, Google Cloud, or Oracle Cloud, is the consistent experience of the SDDC stack; allowing continuity of solutions like SRM.

SRM in this scenario can be used with an on-premises data centre as the protected site, and a VMware stack using VMware Cloud on AWS, Azure VMware Solution (AVS), Google Cloud VMware Engine (GCVE), or Oracle Cloud VMware Solution (OCVS) as the recovery site. SRM can also be used to protect virtual machines within one of the VMware cloud-based offerings, for example failover between regions, or even between cloud providers. Of these different options Site Recovery Manager can be deployed and managed by the customer, whereas VMware Cloud on AWS also offers a SaaS version of Site Recovery Manager; VMware Site Recovery, which is covered in the next section.

VMware Site Recovery

SRM does require the recovery site to be up and running but can still prove value for money. Using the hybrid cloud model infrastructure in the cloud can be scaled out on-demand to fulfil failover capacity, reducing the amount of standby hardware required. The difference here is that vSphere Replication is replicating virtual machines to the SDDC vSAN datastore, whereas VCDR replicates to a separate scale-out file system. The minimum number of nodes may be driven by storage requirements depending on the amount of data being protected. The recovery site could also be configured active/active, or run test and dev workloads that can be shut down to reclaim compute capacity. Again storage overhead is a consideration when deploying this type of model. Each solution will have its place depending on the use case.

SRM allows for centralised recovery plans of VMs and groups of VMs, with features like priority groups, dependencies, shut down and start up customisations, including IP address changes using VMware Tools, and non-disruptive recovery testing. If you’ve used SRM before the concept is the same for using a VMware cloud-based recovery site as a normal data centre; an SRM appliance is deployed and registered with vCenter to collect objects like datastores, networks, resource pools, etc. required for failover. If you haven’t used SRM before you can try it for free using either the VMware Site Recovery Manager Evaluation, or VMware Site Recovery Hands-on Lab. Additional information can be found at the VMware Site Recovery Manager Solution and VMware Site Recovery Manager Documentation pages.

VMware Site Recovery

VMware Site Recovery is the same product as Site Recovery Manager, described above, but in SaaS form. VMware Site Recovery is a VMware Cloud based add-on for VMware Cloud on AWS. The service can link to Site Recovery Manager on-premises to enable failover to a VMware Cloud on AWS SDDC, or it can provide protection and failover between SDDC environments in different VMware Cloud on AWS regions. At the time of writing VMware Site Recovery is not available with any other cloud providers. As a SaaS solution VMware Site Recovery is naturally easy to enable, it just needs activating in the VMware Cloud portal. You can find out more from the VMware Site Recovery Solution page.

Closing Notes

For more information on the solutions listed see the VMware Disaster Recovery Solutions page, and check in with your VMware account team to understand the local service provider options relevant to you. There are other solutions available from VMware partners and backup providers. Your existing backup solution for example may offer a DRaaS add-on, or the capability to backup or replicate to cloud storage which can be used to build out your own disaster recovery solution in the cloud.

The table below shows a high level comparison of the difference between VMware Cloud Disaster Recovery and Site Recovery Manager offerings. As you can see there is a trade off between cost and speed of recovery, there are use cases for each solution and in some cases maybe both side by side. Hopefully in future these products will fully integrate to allow DRaaS management from a single interface or source of truth where multiple Recovery Point Objective (RPO) and RTO requirements exist.

SolutionService TypeReplicationFailoverRPOPricing
VMware Cloud Disaster RecoveryOn demand DRaaSCloud based file systemLive mount when capacity is available~4 hoursPer VM, per TiB of storage, list price is public here. VMC on AWS capacity may be needed*
VMware Site RecoveryHot DRaaSDirectly to failover capacityFast RTOs using pre-provisioned failover capacityAs low as 5 minutes with vSAN at the protected site, or 15 minutes without vSANPer VM, list price is public here. vSphere Replication is also needed**
VMware Site Recovery ManagerSelf-managedDirectly to failover capacityFast RTOs using pre-provisioned failover capacityAs low as 5 minutes with vSAN at the protected site, or 15 minutes without vSANPer VM, in packs of 25 VMs. vSphere Replication is also needed**
VMware Cloud Disaster Recovery (VCDR) and Site Recovery Manager (SRM) side-by-side comparison

*VMware Cloud on AWS capacity is needed depending on the deployment model, detailed above. For pilot light a minimum of 3 nodes are running all the time, these can be discounted using 1 or 3 year reserved instances. For on-demand if failover is required then the VMC capacity is provisioned using on-demand pricing. List price for both can be found here, but VMware also have a specialist team that will work out the sizing for you.

**vSphere Replication is not sold separately but is included in the following versions of vSphere: Essentials Plus, Standard, Enterprise, Enterprise Plus, and Desktop.

VMware Site Recovery Manager 8.x Upgrade Guide

This post will walk through an inplace upgrade of VMware Site Recovery Manager (SRM) to version 8.1, which introduces support for the vSphere HTML5 client and recovery / migration to VMware on AWS. Read more about what’s new in this blog post. The upgrade is relatively simple but we need to cross-check compatibility and perform validation tests after running the upgrade installer.

SRM81

Planning

  • The Site Recovery Manager upgrade retains configuration and information such as recovery plans and history but does not preserve any advanced settings
  • Protection groups and recovery plans also need to be in a valid state to be retained, any invalid configurations or not migrated
  • Check the upgrade path here, for Site Recovery Manager 8.1 we can upgrade from 6.1.2 and later
  • If vSphere Replication is in use then upgrade vSphere Replication first, following the steps outlined here
  • Site Recovery Manager 8.1 is compatible with vSphere 6.0 U3 onwards, and VMware Tools 10.1 and onwards, see the compatibility matrices page here for full details
  • Ensure the vCenter and Platform Services Controller are running and available
  • In Site Recovery Manager 8.1 the version number is decoupled from vSphere, however check that you do not need to perform an upgrade for compatibility
  • For other VMware products check the product interoperability site here
  • If you are unsure of the upgrade order for VMware components see the Order of Upgrading vSphere and Site Recovery Manager Components page here
  • Make a note of any advanced settings you may have configured under Sites > Site > Manage > Advanced Settings
  • Confirm you have Platform Services Controller details, the administrator@vsphere.local password, and the database details and password

Download the VMware Site Recovery Manager 8.1.0.4 self extracting installer here to the server, and if applicable; the updated Storage Replication Adapter (SRA) – for storage replication. Review the release notes here, and SRM upgrade documentation centre here.

Database Backup

Before starting the upgrade make sure you take a backup of the embedded vPostgres database, or the external database. Full instructions can be found here, in summary:

  • Log into the SRM Windows server and stop the VMware Site Recovery Manager service
  • From command prompt run the following commands, replacing the db_username and srm_backup_name parameters, and the install path and port if they were changed from the default settings
cd C:\Program Files\VMware\VMware vCenter Site Recovery Manager Embedded Database\bin
pg_dump -Fc --host 127.0.0.1 --port 5678 --username=db_username srm_db > srm_backup_name
  • If you need to restore the vPostgres database follow the instructions here

In addition to backing up the database check the health of the SRM servers and confirm there are no pending reboots. Log into the vSphere web client and navigate to the Site Recovery section, verify there are no pending cleanup operations or configuration issues, all recovery plans and protection groups should be in a Ready state.

Process

As identified above, vSphere Replication should be upgraded before Site Recovery Manager. In this instance we are using Nimble storage replication, so the Storage Replication Adapter (SRA) should be upgraded first. Download and run the installer for the SRA upgrade, in most cases it is a simple next, install, finish.

 

We can now commence the Site Recovery Manager upgrade, it is advisable to take a snapshot of the server and ensure backups are in place. On the SRM server run the executable downloaded earlier.

  • Select the installer language and click Ok, then Next
  • Click Next on the patent screen, accept the EULA and click Next again
  • Double-check you have performed all pre-requisite tasks and click Next
  • Enter the FQDN of the Platform Services Controller and the SSO admin password, click Next
  • The vCenter Server address is auto-populated, click Next
  • The administrator email address and local host ports should again be auto-populated, click Next
  • Click Yes when prompted to overwrite registration
  • Select the appropriate certificate option, in this case keeping the existing certificate, click Next
  • Check the database details and enter the password for the database account, click Next
  • Configure the service account to run the SRM service, again this will be retain the existing settings by default, click Next
  • Click Install and Finish once complete

 

Post-Upgrade

After Site Recovery Manager is upgraded log into the vSphere client. If the Site Recovery option does not appear immediately you may need to clear your browser cache, or restart the vSphere client service.

SRM_81

On the summary page confirm both sites are connected, you may need to reconfigure the site pair if you encounter connection problems.

SRM_81_1

Validate the recovery plan and run a test to confirm there are no configuration errors.

SRM_81_2

The test should complete successfully.

SRM_81_5

I can also check the replication status and Storage Replication Adapter status.

SRM_81_4

Site Recovery Manager 6.x Install Guide

This post will walk through the installation of Site Recovery Manager (SRM) to protect virtual machines from site failure. SRM plugs into vCenter to protect virtual machines replicated to a failover site using array based replication or vSphere replication. In the event of a site outage, or outage of components within a site meaning production virtual machines can no longer run there; SRM brings online the replicated datastore and VMs in vSphere, with a whole bunch of automated customisation options such as assigning new IP addresses, boot orders, dependencies, running scripts, etc. After a failover SRM can reverse the replication direction and protect virtual machines ready to fail back, all from within the vSphere web client.

Site Recovery Manager now has integration with the HTML5 vSphere client, see VMware Site Recovery Manager 8.x Upgrade Guide for more information.

Requirements

  • SRM is installed on a Windows machine at the protected site and the recovery site. SRM requires an absolute minimum of 2vCPU, 2 GB RAM and 5 GB disk available, more is recommended for large environments and installations with an embedded database.
  • The Windows server should have User Access Control (UAC) disabled (in the registry, not just set to never notify) as this interferes with the install.
  • Each SRM installation requires its own database, this can be embedded for small deployments, or external for large deployments.
  • A vCenter Server must be in place at both the protected site and the recovery site.
  • SRM supports both embedded and external Platform Services Controller deployments. If the external deployment method is used ensure the vCenter at the failover site is able to connect to the Platform Services Controller (i.e. it isn’t in the primary site). For more information click here.
  • The vCenter Server, Platform Services Controller, and SRM versions must be the same on both sites.
  • You will need the credentials of the vCenter Server SSO administrator for both sites.
  • For vCenter Server 6.0 U2 compatibility use SRM v6.1.1, vCenter Server 6.0 U3 use SRM v6.1.2 and for vCenter Server 6.5 and 6.5 U1 use v6.5 or v6.5.1 of SRM.
  • Check compatibility of other VMware products using the Product Interoperability Matrix.
  • If there any firewalls between the management components review the ports required for SRM in this KB.
  • SRM can be licensed in packs of 25 virtual machines, or for unlimited virtual machines on a per CPU basis with vCloud Suite. Read more about SRM licensing here.
  • Array based replication or vSphere Replication should be in place before beginning the SRM install. If you are using array based replication contact your storage vendor for best practices guide and the Storage Replication Adapter which is installed on the same server as SRM.

As well as the requirements listed above the following points are best practices which should also be taken into consideration:

  • Small environments can host the SRM installation on the same server as vCenter Server, for large environments SRM should be installed on a different system.
  • For vCenter Server, Platform Services Controller, Site Recovery Manager servers, and vSphere Replication (if applicable) use FQDN where possible rather than IP addresses.
  • Time synchronization should be in place across all management nodes and ESXi hosts.
  • It is best practice to have Active Directory and DNS servers already running at the failover site.

Installation

In this example we will be installing Site Recovery Manager using Nimble array based replication. There is a vCenter Server with embedded Platform Services Controller already installed at each site. The initial screenshots are from an SRM v6.1.1 install, but I have also validated the process with SRM v6.5.1 and vCenter 6.5 U1.

SRM

The virtual machines we want to protect are in datastores replicated by the Nimble array. For more information on the storage array pre-installation steps see the Nimble Storage Integration post referenced below. The Site Recovery Manager install, configuration, and failover guides have no further references to Nimble and are the same for all vendors and replication types.

Part 1 – Nimble Storage Integration with SRM

Part 2 – Site Recovery Manager Install Guide

Part 3 – Site Recovery Manager Configuration and Failover Guide

Installing SRM

The installation is pretty straight forward, download the SRM installer and follow the steps below for each site. We’ll install SRM on the Windows server for the primary / protected site first, and repeat the process for the DR / failover site. We can then pair the two sites together and create recovery plans.

SRM 6.5.1 (vSphere 6.5 U1) Download | Release Notes | Documentation

SRM 6.5 (vSphere 6.5) Download | Release Notes | Documentation

SRM 6.1.2 (vSphere 6.0 U3) Download | Release Notes | Documentation

SRM 6.1.1 (vSphere 6.0 U2) Download | Release Notes | Documentation

Log into the Windows server where SRM will be installed as an administrator, and right click the downloaded VMware-srm-version.exe file. Select Run as aministrator. If you are planning on using an external database then the ODBC data source must be configured, for SQL integrated Windows authentication make sure you log into the Windows server using the account that has database permissions to configure the ODBC data source, and run the SRM installer.

Select the installer language and click Ok.

SRM1

Click Next to begin the install wizard.

SRM2

Review the patent information and click Next.

SRM3

Accept the EULA and click Next.

SRM4

Confirm you have read the prerequisites located at http://pubs.vmware.com/srm-61/index.jsp by clicking Next.

SRM5

Select the destination drive and folder, then click Next.

SRM6

Enter the IP address or FQDN of the Platform Services Controller that will be registered with this SRM instance, in this case the primary site. If possible use the FQDN to make IP address changes easier if required at a later date. Enter valid credentials to connect to the PSC and click Next. If your vCenter Server is using an embedded deployment model then enter your vCenter Server information.

SRM7

Accept the PSC certificate when prompted. The vCenter Server will be detected from the PSC information provided. Confirm this is correct and click Next. Accept the vCenter certificate when prompted.

SRM8

Enter the site name that will appear in the Site Recovery Manager interface, and the SRM administrator email address. Enter the IP address or FQDN of the local server, again use the FQDN if possible, and click Next.

SRM11

In this case as we are using a single protected site and recovery site we will use the Default Site Recovery Manager Plug-in Identifier. For environments with multiple protected sites create a custom identifier. Click Next.

SRM12

Select Automatically generate a certificate, or upload one of your own if required, and click Next.

SRM13

Select an embedded or external database server and click Next. If you are using an external database you will need a DSN entry configured in ODBC data sources on the local Windows server referencing the external data source. Click Next.

SRM14

If you opted for the embedded database you will be prompted to enter a new database name and create new database credentials. Click Next.

SRM15

Configure the account to run the SRM services, if applicable, and click Next.

SRM10

Click Install to begin the installation.

SRM9

Site Recovery Manager is now installed. Repeat the process to install SRM on the Windows server in the DR / recovery site, referencing the local PSC and changing the site names as appropriate. If you are using storage based replication you also need to install the Storage Replication Adapter (SRA) on the same server as Site Recovery Manager. In this example I have installed the Nimble SRA, available from InfoSight downloads, which is just a next and finish installer.

After each site installation of SRM you will see the Site Recovery Manager icon appear in the vSphere web client for the corresponding vCenter Server.

SRMvsphereSRMvsphere2

Providing the datastores are replicated, either using vSphere replication or array based replication, we can now move on to pairing the sites and creating recovery plans in Part 3.

_______________

Part 1 – Nimble Storage Integration with SRM

Part 2 – Site Recovery Manager Install Guide

Part 3 – Site Recovery Manager Configuration and Failover Guide