Cloud Disaster Recovery Options for VMware Virtual Machines

Introduction

In my day job I am often asked about the use of cloud for disaster recovery. Some organisations only operate out of a single data centre, or building, while others have a dual-site setup but want to explore a third option in the cloud. Either way, using cloud resources for disaster recovery can be a good way to learn and validate different technologies, potentially with a view to further migrations as data centre and hardware contracts expire.

This post takes a look at the different cloud-based disaster recovery options available for VMware workloads. It is not an exhaustive list but provides some ideas. Further work will be needed to build a resilient network architecture depending on the event you are looking to protect against. For example do you have available network links if your primary data centre is down, can your users and applications still route to private networks in the cloud, are your services internet facing allowing you to make your cloud site the ingress and egress point. As with any cloud resources, in particular if you are building your own services, a shared security model applies which should be fully understood before deployment. Protecting VMware workloads should only form part of your disaster recovery strategy, other dependencies both technical and process will also play a part. For more information on considering the bigger picture see Disaster Recovery Strategy – The Backup Bible Review.

Concepts

  • DRaaS (Disaster Recovery as a Service) – A managed service that will typically involve some kind of data replication to a site where the infrastructure is entirely managed by the service provider. The disaster recovery process is built using automated workflows and runbooks; such as scaling out capacity, and bringing online virtual machines. An example DRaaS is VMware Cloud Disaster Recovery which we’ll look at in more detail later on.
  • SaaS (Software as a Service) – An overlay software solution may be able to manage the protection of data and failover, but may not include the underlying infrastructure components as a whole package. Typically the provider manages the hosting, deployment, and lifecycle management of the software, but either the customer or another service provider is responsible for the management and infrastructure of the protected and recovery sites.
  • IaaS and PaaS (Infrastructure as a Service and Platform as a Service) – Various options exist around building disaster recovery solutions based on infrastructure or platforms consumed from a service provider. This approach will generally require more effort from administrators to setup and manage but may offer greater control. An example is installing VMware Site Recovery Manager (self-managed) to protect virtual machines running on VMware-based IaaS. Alternatively third party backup solutions could be used with cloud storage repositories and cloud hosted recovery targets.
  • Hybrid Cloud – The VMware Software Defined Data Centre (SDDC) can run on-premises and overlay cloud providers and hyperscalers, delivering a consistent operating platform. Disaster recovery is one of the common use cases for a hybrid cloud model, as shown in the whiteboard below. Each of the solutions covered in this post is focused around a hybrid cloud deployment of VMware software in an on-premises data centre and in the cloud.
Hybrid Cloud Use Cases

VMware Cloud Disaster Recovery

VMware Cloud Disaster Recovery (VCDR) replicates virtual machines from on-premises to cloud based scale-out file storage, which can be mounted to on-demand compute instances when required. This simplifies failover to the cloud and lowers the cost of disaster recovery. VCDR allows for live mounting of a chosen restore point for fast recovery from ransomware. Recently ransomware has overtaken events like power outages, natural disasters, human error, and hardware failure as the number one cause of disaster recovery events.

VCDR uses encrypted AWS S3 storage with AWS Key Management Service (KMS) as a replication target, protecting virtual machines on-premises running on VMware vSphere. There is no requirement to run the full SDDC / VMware Cloud Foundation (VCF), vSAN, or NSX at the replication source site. If and when required, the scale-out file system is mounted to compute nodes using VMware Cloud (VMC) on AWS, without the need to refactor or change any of the virtual machine file formats. VCDR also includes built-in audit reporting, continuous healthchecks at 30 minute intervals, and test failover capabilities.

VMware Cloud on AWS provides the VMware SDDC as a managed service running on dedicated AWS bare-metal hardware. VMware manage the full infrastructure stack and lifecycle management of the SDDC. The customer sets security and access configuration, including data location. Currently VCDR is only available using VMware Cloud on AWS as the target for cloud compute, with the following deployment options:

  • On Demand – virtual machines are replicated to the scale-out file storage, when disaster recovery is invoked an automated SDDC deployment is initiated. When the SDDC is ready the file system is mounted to the SDDC and virtual machines are powered on. Typically this means a Recovery Time Objective (RTO) of around 4 hours. For services that can tolerate a longer RTO the benefit of this deployment model is that the customer only pays for the storage used in the scale-out storage, and then pays for the compute on-demand should it ever be needed.
  • Pilot Light – a small VMware Cloud on AWS environment exists, typically 3 hosts. Virtual machines are replicated to the scale-out file storage, when disaster recovery is invoked the file system is instantly mounted to the existing SDDC and virtual machines are powered on. Depending on the number of virtual machines being brought online, the SDDC automatically scales out the number of physical nodes. This brings the RTO time down to as little as a few minutes. The customer is paying for the minimum VMware Cloud on AWS capacity to be already available but this can be scaled out on-demand, offering significant cost savings on having an entire secondary infrastructure stack.
VMware Cloud Disaster Recovery

The cloud-based orchestrator behind the service is provided as SaaS, with a connector appliance deployed on-premises to manage and encrypt replication traffic. After breaking replication and mounting the scale-out file system administrators manage virtual machines using the consistent experience of vSphere and vCenter. Startup priorities can be set to ensure critical virtual machines are started up first. At this point virtual machines are still running in the scale-out file system, and will begin to storage vMotion over to the vSAN datastore provided by the VMware Cloud on AWS compute nodes. The storage vMotion time can vary depending on the amount of data and number of nodes (more nodes and therefore physical NICs provides more network bandwidth), however the vSAN cache capabilities can help elevate any performance hit during this time. When the on-premises site is available again replication reverses, only sending changed blocks, ready for failback.

You can try out VCDR using the VMware Cloud Disaster Recovery Hands-On Lab, additional information can be found at the VMware Cloud Disaster Recovery Solution and VMware Cloud Disaster Recovery Documentation pages.

VMware Site Recovery Manager

VMware Site Recovery Manager (SRM) has been VMware’s main disaster recovery solution for a number of years. SRM enables policy-driven automation of virtual machine failover between sites. Traditionally SRM has been used to protect vSphere workloads in a primary data centre using a secondary data centre also running a VMware vSphere infrastructure. One of the benefits of the hybrid cloud model utilising VMware software in a cloud provider like AWS, Azure, Google Cloud, or Oracle Cloud, is the consistent experience of the SDDC stack; allowing continuity of solutions like SRM.

SRM in this scenario can be used with an on-premises data centre as the protected site, and a VMware stack using VMware Cloud on AWS, Azure VMware Solution (AVS), Google Cloud VMware Engine (GCVE), or Oracle Cloud VMware Solution (OCVS) as the recovery site. SRM can also be used to protect virtual machines within one of the VMware cloud-based offerings, for example failover between regions, or even between cloud providers. Of these different options Site Recovery Manager can be deployed and managed by the customer, whereas VMware Cloud on AWS also offers a SaaS version of Site Recovery Manager; VMware Site Recovery, which is covered in the next section.

VMware Site Recovery

SRM does require the recovery site to be up and running but can still prove value for money. Using the hybrid cloud model infrastructure in the cloud can be scaled out on-demand to fulfil failover capacity, reducing the amount of standby hardware required. The difference here is that vSphere Replication is replicating virtual machines to the SDDC vSAN datastore, whereas VCDR replicates to a separate scale-out file system. The minimum number of nodes may be driven by storage requirements depending on the amount of data being protected. The recovery site could also be configured active/active, or run test and dev workloads that can be shut down to reclaim compute capacity. Again storage overhead is a consideration when deploying this type of model. Each solution will have its place depending on the use case.

SRM allows for centralised recovery plans of VMs and groups of VMs, with features like priority groups, dependencies, shut down and start up customisations, including IP address changes using VMware Tools, and non-disruptive recovery testing. If you’ve used SRM before the concept is the same for using a VMware cloud-based recovery site as a normal data centre; an SRM appliance is deployed and registered with vCenter to collect objects like datastores, networks, resource pools, etc. required for failover. If you haven’t used SRM before you can try it for free using either the VMware Site Recovery Manager Evaluation, or VMware Site Recovery Hands-on Lab. Additional information can be found at the VMware Site Recovery Manager Solution and VMware Site Recovery Manager Documentation pages.

VMware Site Recovery

VMware Site Recovery is the same product as Site Recovery Manager, described above, but in SaaS form. VMware Site Recovery is a VMware Cloud based add-on for VMware Cloud on AWS. The service can link to Site Recovery Manager on-premises to enable failover to a VMware Cloud on AWS SDDC, or it can provide protection and failover between SDDC environments in different VMware Cloud on AWS regions. At the time of writing VMware Site Recovery is not available with any other cloud providers. As a SaaS solution VMware Site Recovery is naturally easy to enable, it just needs activating in the VMware Cloud portal. You can find out more from the VMware Site Recovery Solution page.

Closing Notes

For more information on the solutions listed see the VMware Disaster Recovery Solutions page, and check in with your VMware account team to understand the local service provider options relevant to you. There are other solutions available from VMware partners and backup providers. Your existing backup solution for example may offer a DRaaS add-on, or the capability to backup or replicate to cloud storage which can be used to build out your own disaster recovery solution in the cloud.

The table below shows a high level comparison of the difference between VMware Cloud Disaster Recovery and Site Recovery Manager offerings. As you can see there is a trade off between cost and speed of recovery, there are use cases for each solution and in some cases maybe both side by side. Hopefully in future these products will fully integrate to allow DRaaS management from a single interface or source of truth where multiple Recovery Point Objective (RPO) and RTO requirements exist.

SolutionService TypeReplicationFailoverRPOPricing
VMware Cloud Disaster RecoveryOn demand DRaaSCloud based file systemLive mount when capacity is available~4 hoursPer VM, per TiB of storage, list price is public here. VMC on AWS capacity may be needed*
VMware Site RecoveryHot DRaaSDirectly to failover capacityFast RTOs using pre-provisioned failover capacityAs low as 5 minutes with vSAN at the protected site, or 15 minutes without vSANPer VM, list price is public here. vSphere Replication is also needed**
VMware Site Recovery ManagerSelf-managedDirectly to failover capacityFast RTOs using pre-provisioned failover capacityAs low as 5 minutes with vSAN at the protected site, or 15 minutes without vSANPer VM, in packs of 25 VMs. vSphere Replication is also needed**
VMware Cloud Disaster Recovery (VCDR) and Site Recovery Manager (SRM) side-by-side comparison

*VMware Cloud on AWS capacity is needed depending on the deployment model, detailed above. For pilot light a minimum of 3 nodes are running all the time, these can be discounted using 1 or 3 year reserved instances. For on-demand if failover is required then the VMC capacity is provisioned using on-demand pricing. List price for both can be found here, but VMware also have a specialist team that will work out the sizing for you.

**vSphere Replication is not sold separately but is included in the following versions of vSphere: Essentials Plus, Standard, Enterprise, Enterprise Plus, and Desktop.

AWS FSx File Server Storage for VMware Cloud on AWS

Amazon FSx for Windows File Server is an excellent example of quick and easy native AWS service integration with VMware Cloud on AWS. Hosting a Windows file share is a common setup in on-premises data centres, it might be across Windows Servers or dedicated file-based storage presenting Server Message Block (SMB) / Common Internet File System (CIFS) shares over the network. When migrating Virtual Machines to VMware Cloud on AWS, an alternative solution may be needed if the data is large enough to impact capacity planning of VMware Cloud hosts, or if it indeed resides on a dedicated storage array.

AWS FSx

FSx is Amazon’s fully managed file storage offering that comes in 2 flavours, FSx for Windows File Server and FSx for Lustre (high-performance workloads). This post will focus on FSx for Windows File Server, which provides a managed file share capable of handling thousands of concurrent connections from Windows, Linux, and macOS clients that support the industry-standard SMB protocol.

FSx is built on Windows Server with AWS managing all the underlying file system infrastructure and can be consumed by users and compute services such as VMware Cloud on AWS VMs, and Amazon’s WorkSpaces or Elastic Compute Cloud (EC2). File-based backups are automated and use Simple Storage Services (S3) with configurable lifecycle policies for archiving data. FSx integrates with Microsoft Active Directory enabling standardised user permissions and migration of existing Access Control Lists (ACLs) from on-premises using tools like Robocopy. As you would expect, file systems can be spun up and down on-demand, with a consumption-based pricing model and different performance tiers of disk. You can read more about the FSx service and additional features such as user quotas and data deduplication in the AWS FSx FAQs.

Example Setup

VMware-Cloud-FSx-Example

In the example above, FSx is deployed to the same Availability Zones as VMware Cloud on AWS for continuous availability. Disk writes are synchronously replicated across Availability Zones to a standby file server. In the event of a service disruption FSx automatically fails over to the standby server. Data is encrypted in transit and at rest, and uses the 25 Gbps Elastic Network Interface (ENI) between VMware Cloud and the AWS backbone network. There are no data egress charges for using the ENI connection, but there may be cross-AZ charges from AWS in multi-AZ configurations. For more information on the connected VPC and services see AWS Native Services Integration With VMware Cloud on AWS.

A reference architecture for Integrating Amazon FSx for Windows Servers with VMware Cloud on AWS is available from VMware, along with a write up by Adrian Roberts here. AWS FSx allows single-AZ or multi-AZ deployments, with single-AZ file systems supporting Microsoft Distributed File System Replication (DFSR) compatible with your own namespace servers, which is the model used in the VMware reference architecture. At the time of writing custom DNS names are still road mapped for multi-AZ. You can see the full table of feature support by deployment type in the Amazon FSx for Windows File Server User Guide.

FSx Setup

To provide user-based authentication, access control, and DNS resolution for FSx file shares, you can use your existing Active Directory domain or deploy AWS Managed Microsoft AD using AWS Directory Services. You will need your Active Directory details ready before starting the FSx deployment, along with the Virtual Private Cloud (VPC) and subnet information to use.

Log into the AWS console and locate FSx under Storage from the Services drop-down. In the FSx splash-screen click Create file system. On this occasion, we are creating a Windows file system.

FSx-Setup-1

Enter the file system details, starting with the file system name, deployment type, storage type, and capacity.

FSx-Setup-2

A throughput capacity value is recommended and can be customised based on the data requirements. Select the VPC, Security Group, and subnets to use. In this example, I have selected the subnets connected to VMware Cloud on AWS as defined in the ENI setup.

FSx-Setup-3

Enter the Active Directory details, including service accounts and DNS servers. If desired, you can make changes to the encryption keys, daily backup window, maintenance window, and add any required resource tags. Review the summary page and click Create file system.

FSx-Setup-4

The file system is created and will show a status of Available once complete.

FSx-Setup-5

If you’re not using the default Security Group with FSx, then the following ports will need defining in rules for inbound and outbound traffic: TCP/UDP 445 (SMB), TCP 135 (RPC), TPC/UDP 1024-65535 (RPC ephemeral port range). There may be additional Active Directory ports required for the domain the file system is being joined to.

Further to the FSx Security Group, the ENI Security Group also needs the SMB and RPC port ranges adding as inbound and outbound rules to allow communication between VMware Cloud on AWS and the FSx service in the connected VPC. In any case, when configuring Security Group or firewall rules, the source or destination should be the clients accessing the file system, or if applicable any other file servers participating in DFS Replication. AWS Security Groups are accessible in the console under VPC. You can either create a dedicated Security Group or modify an existing ruleset. The Security Group in use by the VMware Cloud ENI can be found under EC2 > ENI.

FSx-Security-Group

With the SMB ports open for the FSx and ENI Security Groups, remember that the traffic will also hit the VMware Cloud on AWS Compute Gateway. In the VMware Cloud Services Portal add the same rules to the Compute Gateway, and to the Distributed Firewall if you’re using micro-segmentation. The Compute Gateway Firewall is accessible from the Networking & Security tab of the SDDC.

VMC_GW_FW

Virtual Machines in VMware Cloud on AWS will now be able to access the FSx file shares across the ENI using the DNS name for the share or UNC path.

The FSx service in the AWS console provides some options for managing file systems. Storage capacity, throughput, and IOPS can be viewed quickly and added to a CloudWatch dashboard. CloudWatch Logs can also be ingested by vRealize Log Insight Cloud from the VMware Cloud Services Portal.

FSx-Monitoring

Alexa, Add 2 Hosts to my SDDC

Recently I was doing labs for the AWS Developer Associate exam when it occurred to me that some time ago, I read a VMware blog about using Amazon Alexa to invoke VMware Cloud Application Programming Interfaces (APIs). The post was Amazon Alexa and VMware Cloud on AWS by Gilles Chekroun, and I decided to give it a go. First up, credit to Gilles for all the code, and the process outlined below. The Alexa Developer Console has improved over the last couple of years, and therefore I have included some updated screenshots and tweaks. Finally, this is just a bit of fun!

AlexaExample

Let’s take a look at some of the service involved:

AWS Lambda is a highly scalable serverless compute service, enabling customers to run application code on-demand without having to worry about any of the underlying infrastructure. Lambda supports multiple programming languages and uses functions to execute your code upon specific triggers. Event Sources are supported AWS services, or partner services used to trigger your Lambda functions with an operational event. You only pay for the compute power required when the function or code is running, which provides a cost-optimised solution for serverless environments.

Alexa, named after the Great Library of Alexandria, is Amazon’s Artificial Intelligence (AI) based virtual assistant allowing users to make voice initiated requests or ask questions. Alexa works with echo devices to listen for a wake word, using deep learning technology running on the device, which starts the Alexa Voice Service. The Alexa Voice Service selects the correct Alexa Skill based on user intent. Intents are words, or phrases, users say to interact with skills. Skills can be used to send POST requests to Lambda endpoints, or HTTPS web service endpoints, performing logic and returning a response in JSON format. The JSON is converted to an output which is then relayed back via the echo device using text to speech synthesis. You can read more about using Alexa to invoke Lambda functions at Host a Custom Skill as an AWS Lambda Function from the Alexa Skills Kit documentation.

VMware Cloud API Access

VMware Cloud APIs can be accessed at https://vmc.vmware.com/swagger/index.html#/, you need to be authenticated with a vmc.vmware.com account.

VMCAPIs

To use the VMware Cloud APIs, first generate an API token from the Cloud Provider Hub, under My Account, API Tokens.

APIToken

Once an API token has been generated, it can be exchanged for an authentication token, or access token, by using a REST client to POST to:

https://console.cloud.vmware.com/cphub/api/auth/v1/authn/accesstoken

The body content type should be application/JSON, with {“refreshToken” : “your_generated_api_token“} included in the body of the request. A successful 200 message is returned, along with the access token. Further information can be found at Using VMware Cloud Provider Hub APIs from the VMware Cloud Provider Hub API Programming Guide, or the API Explorer 

Lambda Function & Alexa Skill

The opening step is to log into the Alexa Developer Console and create a new skill. There are built-in skills for some scenarios like smart home interaction. In this instance, I am creating a custom skill.

Alexa1

Next, I add my invocation name, which will be used to call the skill. I then import Gilles’ JSON file to populate the intents, which gives me the basis of some of the Software-Defined Data Centre (SDDC) commands, I add some extra sample dialog.

Alexa2

In the Endpoint section, I take note of the Skill ID. The Skill ID will be used to invoke my Lambda function. Over in the AWS console, I open Lambda and create the function.

Lambda1

I defined the trigger as an Alexa Skills Kit, and enable Skill ID verification with the Skill ID copied in the previous step.

Lambda2

Since I have CloudTrail enabled, my API calls to Lambda will be forward to a CloudWatch Logs stream, which we’ll take a look at shortly. I also add a Simple Notification Service (SNS) topic to email me when the Lambda function is triggered.

Lambda4

Next, I upload Gilles’ code in zip format, making a couple of tweaks to the available region settings, and the org ID, SDDC ID, and API token. The code is actually going to go ahead and exchange that API token for me.

Lambda3

I run a simple test using a pre-configured test event from the Amazon Alexa Start Session event template. Then, make a note of the Amazon Resource Name (ARN) for the Lambda function in the top right corner.

Lambda5

Back in the Alexa Developer Console, I can now set this Lambda ARN as the service endpoint. I save and build my skill model.

Alexa3

In the Test section, I can use the invocation phrase defined by the Alexa Skill to start the demo, and my intents as words to trigger VMware Cloud API calls via Lambda. In the test below, I have added 2 additional hosts to my SDDC.

Alexa4

Back in the AWS console, from the CloudWatch Logs stream, I can see the API calls been made to Lambda.

CloudWatchLogs

In the VMware Cloud Provider Hub, the Adding host(s) task in progress message appears on the SDDC and the status changes to adding hosts. Following notification that the hosts were successfully added, I ask Alexa again what the SDDC status is, and the new capacity of 8 hosts is correctly reported back.