Cloud Disaster Recovery Options for VMware Virtual Machines

Cloud Disaster Recovery Options for VMware Virtual Machines

Introduction

In my day job I am often asked about the use of cloud for disaster recovery. Some organisations only operate out of a single data centre, or building, while others have a dual-site setup but want to explore a third option in the cloud. Either way, using cloud resources for disaster recovery can be a good way to learn and validate different technologies, potentially with a view to further migrations as data centre and hardware contracts expire.

This post takes a look at the different cloud-based disaster recovery options available for VMware workloads. It is not an exhaustive list but provides some ideas. Further work will be needed to build a resilient network architecture depending on the event you are looking to protect against. For example do you have available network links if your primary data centre is down, can your users and applications still route to private networks in the cloud, are your services internet facing allowing you to make your cloud site the ingress and egress point. As with any cloud resources, in particular if you are building your own services, a shared security model applies which should be fully understood before deployment. Protecting VMware workloads should only form part of your disaster recovery strategy, other dependencies both technical and process will also play a part. For more information on considering the bigger picture see Disaster Recovery Strategy – The Backup Bible Review.

Concepts

  • DRaaS (Disaster Recovery as a Service) – A managed service that will typically involve some kind of data replication to a site where the infrastructure is entirely managed by the service provider. The disaster recovery process is built using automated workflows and runbooks; such as scaling out capacity, and bringing online virtual machines. An example DRaaS is VMware Cloud Disaster Recovery which we’ll look at in more detail later on.
  • SaaS (Software as a Service) – An overlay software solution may be able to manage the protection of data and failover, but may not include the underlying infrastructure components as a whole package. Typically the provider manages the hosting, deployment, and lifecycle management of the software, but either the customer or another service provider is responsible for the management and infrastructure of the protected and recovery sites.
  • IaaS and PaaS (Infrastructure as a Service and Platform as a Service) – Various options exist around building disaster recovery solutions based on infrastructure or platforms consumed from a service provider. This approach will generally require more effort from administrators to setup and manage but may offer greater control. An example is installing VMware Site Recovery Manager (self-managed) to protect virtual machines running on VMware-based IaaS. Alternatively third party backup solutions could be used with cloud storage repositories and cloud hosted recovery targets.
  • Hybrid Cloud – The VMware Software Defined Data Centre (SDDC) can run on-premises and overlay cloud providers and hyperscalers, delivering a consistent operating platform. Disaster recovery is one of the common use cases for a hybrid cloud model, as shown in the whiteboard below. Each of the solutions covered in this post is focused around a hybrid cloud deployment of VMware software in an on-premises data centre and in the cloud.
Hybrid Cloud Use Cases

VMware Cloud Disaster Recovery

VMware Cloud Disaster Recovery (VCDR) replicates virtual machines from on-premises to cloud based scale-out file storage, which can be mounted to on-demand compute instances when required. This simplifies failover to the cloud and lowers the cost of disaster recovery. VCDR allows for live mounting of a chosen restore point for fast recovery from ransomware. Recently ransomware has overtaken events like power outages, natural disasters, human error, and hardware failure as the number one cause of disaster recovery events.

VCDR uses encrypted AWS S3 storage with AWS Key Management Service (KMS) as a replication target, protecting virtual machines on-premises running on VMware vSphere. There is no requirement to run the full SDDC / VMware Cloud Foundation (VCF), vSAN, or NSX at the replication source site. If and when required, the scale-out file system is mounted to compute nodes using VMware Cloud (VMC) on AWS, without the need to refactor or change any of the virtual machine file formats. VCDR also includes built-in audit reporting, continuous healthchecks at 30 minute intervals, and test failover capabilities.

VMware Cloud on AWS provides the VMware SDDC as a managed service running on dedicated AWS bare-metal hardware. VMware manage the full infrastructure stack and lifecycle management of the SDDC. The customer sets security and access configuration, including data location. Currently VCDR is only available using VMware Cloud on AWS as the target for cloud compute, with the following deployment options:

  • On Demand – virtual machines are replicated to the scale-out file storage, when disaster recovery is invoked an automated SDDC deployment is initiated. When the SDDC is ready the file system is mounted to the SDDC and virtual machines are powered on. Typically this means a Recovery Time Objective (RTO) of around 4 hours. For services that can tolerate a longer RTO the benefit of this deployment model is that the customer only pays for the storage used in the scale-out storage, and then pays for the compute on-demand should it ever be needed.
  • Pilot Light – a small VMware Cloud on AWS environment exists, typically 3 hosts. Virtual machines are replicated to the scale-out file storage, when disaster recovery is invoked the file system is instantly mounted to the existing SDDC and virtual machines are powered on. Depending on the number of virtual machines being brought online, the SDDC automatically scales out the number of physical nodes. This brings the RTO time down to as little as a few minutes. The customer is paying for the minimum VMware Cloud on AWS capacity to be already available but this can be scaled out on-demand, offering significant cost savings on having an entire secondary infrastructure stack.
VMware Cloud Disaster Recovery

The cloud-based orchestrator behind the service is provided as SaaS, with a connector appliance deployed on-premises to manage and encrypt replication traffic. After breaking replication and mounting the scale-out file system administrators manage virtual machines using the consistent experience of vSphere and vCenter. Startup priorities can be set to ensure critical virtual machines are started up first. At this point virtual machines are still running in the scale-out file system, and will begin to storage vMotion over to the vSAN datastore provided by the VMware Cloud on AWS compute nodes. The storage vMotion time can vary depending on the amount of data and number of nodes (more nodes and therefore physical NICs provides more network bandwidth), however the vSAN cache capabilities can help elevate any performance hit during this time. When the on-premises site is available again replication reverses, only sending changed blocks, ready for failback.

You can try out VCDR using the VMware Cloud Disaster Recovery Hands-On Lab, additional information can be found at the VMware Cloud Disaster Recovery Solution and VMware Cloud Disaster Recovery Documentation pages.

VMware Site Recovery Manager

VMware Site Recovery Manager (SRM) has been VMware’s main disaster recovery solution for a number of years. SRM enables policy-driven automation of virtual machine failover between sites. Traditionally SRM has been used to protect vSphere workloads in a primary data centre using a secondary data centre also running a VMware vSphere infrastructure. One of the benefits of the hybrid cloud model utilising VMware software in a cloud provider like AWS, Azure, Google Cloud, or Oracle Cloud, is the consistent experience of the SDDC stack; allowing continuity of solutions like SRM.

SRM in this scenario can be used with an on-premises data centre as the protected site, and a VMware stack using VMware Cloud on AWS, Azure VMware Solution (AVS), Google Cloud VMware Engine (GCVE), or Oracle Cloud VMware Solution (OCVS) as the recovery site. SRM can also be used to protect virtual machines within one of the VMware cloud-based offerings, for example failover between regions, or even between cloud providers. Of these different options Site Recovery Manager can be deployed and managed by the customer, whereas VMware Cloud on AWS also offers a SaaS version of Site Recovery Manager; VMware Site Recovery, which is covered in the next section.

VMware Site Recovery

SRM does require the recovery site to be up and running but can still prove value for money. Using the hybrid cloud model infrastructure in the cloud can be scaled out on-demand to fulfil failover capacity, reducing the amount of standby hardware required. The difference here is that vSphere Replication is replicating virtual machines to the SDDC vSAN datastore, whereas VCDR replicates to a separate scale-out file system. The minimum number of nodes may be driven by storage requirements depending on the amount of data being protected. The recovery site could also be configured active/active, or run test and dev workloads that can be shut down to reclaim compute capacity. Again storage overhead is a consideration when deploying this type of model. Each solution will have its place depending on the use case.

SRM allows for centralised recovery plans of VMs and groups of VMs, with features like priority groups, dependencies, shut down and start up customisations, including IP address changes using VMware Tools, and non-disruptive recovery testing. If you’ve used SRM before the concept is the same for using a VMware cloud-based recovery site as a normal data centre; an SRM appliance is deployed and registered with vCenter to collect objects like datastores, networks, resource pools, etc. required for failover. If you haven’t used SRM before you can try it for free using either the VMware Site Recovery Manager Evaluation, or VMware Site Recovery Hands-on Lab. Additional information can be found at the VMware Site Recovery Manager Solution and VMware Site Recovery Manager Documentation pages.

VMware Site Recovery

VMware Site Recovery is the same product as Site Recovery Manager, described above, but in SaaS form. VMware Site Recovery is a VMware Cloud based add-on for VMware Cloud on AWS. The service can link to Site Recovery Manager on-premises to enable failover to a VMware Cloud on AWS SDDC, or it can provide protection and failover between SDDC environments in different VMware Cloud on AWS regions. At the time of writing VMware Site Recovery is not available with any other cloud providers. As a SaaS solution VMware Site Recovery is naturally easy to enable, it just needs activating in the VMware Cloud portal. You can find out more from the VMware Site Recovery Solution page.

Closing Notes

For more information on the solutions listed see the VMware Disaster Recovery Solutions page, and check in with your VMware account team to understand the local service provider options relevant to you. There are other solutions available from VMware partners and backup providers. Your existing backup solution for example may offer a DRaaS add-on, or the capability to backup or replicate to cloud storage which can be used to build out your own disaster recovery solution in the cloud.

The table below shows a high level comparison of the difference between VMware Cloud Disaster Recovery and Site Recovery Manager offerings. As you can see there is a trade off between cost and speed of recovery, there are use cases for each solution and in some cases maybe both side by side. Hopefully in future these products will fully integrate to allow DRaaS management from a single interface or source of truth where multiple Recovery Point Objective (RPO) and RTO requirements exist.

SolutionService TypeReplicationFailoverRPOPricing
VMware Cloud Disaster RecoveryOn demand DRaaSCloud based file systemLive mount when capacity is available~4 hoursPer VM, per TiB of storage, list price is public here. VMC on AWS capacity may be needed*
VMware Site RecoveryHot DRaaSDirectly to failover capacityFast RTOs using pre-provisioned failover capacityAs low as 5 minutes with vSAN at the protected site, or 15 minutes without vSANPer VM, list price is public here. vSphere Replication is also needed**
VMware Site Recovery ManagerSelf-managedDirectly to failover capacityFast RTOs using pre-provisioned failover capacityAs low as 5 minutes with vSAN at the protected site, or 15 minutes without vSANPer VM, in packs of 25 VMs. vSphere Replication is also needed**
VMware Cloud Disaster Recovery (VCDR) and Site Recovery Manager (SRM) side-by-side comparison

*VMware Cloud on AWS capacity is needed depending on the deployment model, detailed above. For pilot light a minimum of 3 nodes are running all the time, these can be discounted using 1 or 3 year reserved instances. For on-demand if failover is required then the VMC capacity is provisioned using on-demand pricing. List price for both can be found here, but VMware also have a specialist team that will work out the sizing for you.

**vSphere Replication is not sold separately but is included in the following versions of vSphere: Essentials Plus, Standard, Enterprise, Enterprise Plus, and Desktop.

Featured image by Christina @ wocintechchat.com on Unsplash

Migrate to Microsoft Azure with Azure VMware Solution

Introduction

Azure VMware Solution (AVS) is a private cloud VMware-as-a-service solution, allowing customers to retain VMware related investments, tools, and skills, whilst taking advantage of the scale and performance of Microsoft Azure.

Microsoft announced the latest evolution of Azure VMware Solution in May 2020, with the major news that AVS is now a Microsoft first party solution, endorsed and cloud verified by VMware. Microsoft are a VMware strategic technology partner, and will build and run the VMware Software Defined Data Centre (SDDC) stack and underlying infrastructure for you. The latest availability of Azure VMware Solution by region can be found here.

If you have looked at AVS by Cloud Simple before this is a new offering, consistent in architecture but now sold and supported direct from Microsoft, providing a single point of support and fully manageable from the Azure Portal. Cloud Simple were acquired by Google in late 2019.

Azure VMware Solution Explained

Azure VMware Solution is the VMware Cloud Foundation (VCF) software stack built using dedicated bare-metal Azure infrastructure, allowing you to run VMware workloads on the Microsoft Azure cloud. AVS is designed for end-to-end High Availability with built in redundancy. Microsoft own and manage all support cases, including any that may need input from VMware.

Microsoft are responsible for all the underlying physical infrastructure, including compute, network, and storage, as well as physical security of the data centres and environments. As well as hardware failure remediation and lifecycle management, Microsoft are also responsible for the deployment, patching, and upgrade of ESXi, vCenter, vSAN, NSX-T, and Identity Management. This allows the customer to consume the VMware infrastructure as a service, and rather than spending time fire fighting or applying security updates; IT staff can concentrate instead on application improvements or new projects. Host maintenance and lifecycle activities such as firmware upgrades or predictive failure remediation are all carried out with no disruption or reduction in capacity.

Microsoft’s data centres meet the high levels of perimeter and access security you would expect, with 24×7 security personnel, biometric and visual sign-in processes, strict requirements for visitors with sufficient business justification including booking, location tracking, metal detectors and security screening, security cameras including per cabinet and video archive.

The customer is still responsible for Virtual Machines and everything running within them, which includes the guest OS, software, VMware Tools, etc. Furthermore the customer also retains control over the configuration of vCenter, vSAN, NSX-T, and Identity Management. VMware administrators have full control over where their data is, and who has access to it by using Role Based Access Control (RBAC), Active Directory (AD) federation, customer managed encryption keys, and Software Defined Network (SDN) configuration including gateway and distributed firewalls.

Elevated root access to vCenter Server is also supported with AVS, and that helps to protect existing investments in third party solutions that may need certain vCenter permissions for services like backup, monitoring, Anti-Virus or Disaster Recovery. By providing operational consistency organisations are able to leverage existing VMware investments in both people skills and licensing across the VMware ecosystem, at the same time as reducing the risk in migrating to the cloud.

Connectivity between environments is visualised at a high level in the image below from Microsoft’s AVS documentation page. The orange box symbolises the VCF stack, made up of vSphere, vSAN, and NSX-T.

Azure VMware Solution

Some example scenarios where AVS may be able to resolve IT issues are as follows:

  • Data centre contract is expiring or increasing in cost:
  • Hardware or software end of life or expensive maintenance contracts
  • Capacity demand, scale, or business continuity
  • Security threats or compliance requirements
  • Cloud first strategy or desire to shift to a cloud consumption model
  • Local servers in offices are no longer needed as workforces become more remote

Azure Hybrid Benefit allows existing Microsoft customers with software assurance to bring on-premises Windows and SQL licenses to AVS. Additionally Microsoft are providing extended security updates for Windows and SQL 2008/R2 running on AVS.

There is a clear and proven migration path to AVS without refactoring whole applications and services, or even changing the VM file format or network settings. With AVS a VM can be live migrated from on-premises to Azure. Hybrid Cloud Extension (HCX) is included with AVS and enables L2 networks to be stretched to the cloud. The Azure VMware Solution assessment appliance can be deployed to calculate the number of hosts needed for existing vSphere environments, full details can be found here.

AVS Technical Specification

Azure VMware Solution uses the customers Azure account and subscription to deploy Private Cloud(s), providing a deep level of integration with Azure services and the Azure Portal. It also means tasks and features can be automated using the API. Each Private Cloud contains a vCenter Server, NSX-T manager, and at least 1 vSphere cluster using vSAN. A Private Cloud can have multiple clusters, up to a maximum of 64 hosts. Each vSphere cluster has a minimum host count of 3 and a maximum of 16. The standard node type used in Azure is the AV36, which is dedicated bare metal hardware with the following specifications:

  • CPU: Intel Xeon Gold 6140 2.3 GHz x2, 36 cores/72 hyper-threads
  • Memory: 576 GB
  • Data: 15.36 TB (8 x 1.92 TB SSD)
  • Cache: 3.2 TB (2 x 1.6 TB NVMe)
  • Network: 4 x Mellanox ConnectX-4 Lx Dual Port 25GbE

AVS uses local all-flash vSAN storage with compression and de-duplication. Storage Based Policy Management (SBPM) allows customers to define policies for IOPS based performance or RAID based protection. Storage policies can be applied to multiple VMs or right down to the individual VMDK file. By default vSAN datastore is encrypted and AVS supports customer managed external HSM or KMS solutions as well as integrating with Azure Key Vault.

An AVS Private Cloud requires at least a /22 CIDR block on deployment, which should not overlap with any of your existing networks. You can view the full requirements in the tutorial section of the AVS documentation. Access to Azure services in your subscription and VNets is achieved using an Azure ExpressRoute connection, which is a high bandwidth, low-latency, private connection with automatically provisioned Border Gateway Protocol (BGP) routing. Access to on-premises environments is enabled using ExpressRoute Global Reach. The diagram below shows the traffic flow from on-premises to AVS using ExpressRoute Global Reach. This hub and spoke network architecture also provides access to native Azure services in connected (peered) VNets, you can read the full detail here.

On-Prem to AVS

AVS Native Azure Integration

A great feature of AVS is the native integration with Azure services using Azure’s private backbone network. Although the big selling point is of course operational consistency, eventually applications can be modernised in ways that will provide a business benefit or improved user experience. Infrastructure administrators that no longer have to manage firmware updates and VMware lifecycle management are able to focus on upskilling to Azure.

Deployment of a Private Cloud with AVS takes as little as 2 hours, and some basic Azure knowledge is required  since the setup is done in the Azure Portal, and you’ll also need to create a Resource Group, VNets, subnets, a VNet gateway, and most likely an ExpressRoute too.

Screenshot 2020-09-10 at 11.28.32

To get the full value out of the solution native AWS services can be used alongside Virtual Machines. Some example integrations that can be looked at straight away are; Blob storage offering varying tiers of cost and performant object storage, Azure Files providing large scale SMB file shares with AD authentication options, and Azure Backup facilitating VM backups to an Azure Recovery Services Vault. Additional services like Azure Active Directory (AAD), Azure NetApp File Services, and Azure Application Gateway may also help modernise your environment, along with Azure Log Analytics, Azure Security Center, and Azure Update Manager.

VMware customers using, or interested in, Horizon will also note that the Horizon Cloud on Microsoft Azure service is available, and if configured accordingly will have network line of sight to your VM workloads in AVS, and native Azure services.

For more information on Azure integration see the Azure native integration section of the AVS document page, and the AVS blogs site by Trevor Davis. Further detail on Azure VMware Solution can be found at the product page, FAQ page, or on-demand webinar.