Watch a Failover from Direct Connect to Backup VPN for VMware Cloud on AWS

This post demonstrates a simulated failure of Amazon Direct Connect, with VMware Cloud (VMC) on Amazon Web Services (AWS). In this setup, the standby VPN has been configured to provide connectivity in the event of a Direct Connect failure. The environment consists of a 6 host stretched cluster in the eu-west-2 (London) region, across Availability Zones eu-west-2a and eu-west-2b.

In this instance, a pair of hosted private Virtual Interfaces (VIFs) are provided by a Cloud Connect service from a single third-party provider. A Route-Based VPN has been configured. Direct Connect with VPN as standby was introduced in SDDC v1.7. For more information, see Nico Vibert’s post here.

AWS Direct Connect: “Using AWS Direct Connect, you can establish private connectivity between AWS and your datacenter, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.”

AWS VPN: “AWS Virtual Private Network (AWS VPN) lets you establish a secure and private tunnel from your network or device to the AWS global network.”

DX_VPN_Setup

Direct Connect Outage

Before beginning, it is worth re-iterating that the following screenshots do not represent a process. Providing the backup VPN is configured correctly, then the customer/consumer of the service does not need to intervene; in the event of a real-world outage, everything highlighted below happens automatically. You may also want to review further reading: How to Deploy and Configure VMware Cloud on AWS (Part 1), How to Migrate VMware Virtual Machines to VMware Cloud on AWS (Part 2), plus additional demo post Watch VMware vSphere HA Recover Virtual Machines Across AWS Availability Zones.

Taking down the primary and secondary VIFs was carried out by the hosting third party, to help with providing evidence of network resilience. When we start out in this particular environment, the VIFs are attached and available. Servers in VMware Cloud are contactable from on-premise across the Direct Connect. The backup VPN is enabled.

DX_VPN_1DX_VPN_2

Following disabling of the interfaces by our third-party provider, the BGP and Direct Connect status changes to down.

DX_VPN_3DX_VPN_5

This is confirmed in the AWS console as both the BGP status and therefore the VIF state is down.

DX_VPN_4

With the Direct Connect down routes are redistributed using the backup VPN. The Direct Connect BGP hold timer is 90 seconds, and the BGP keepalive is 30 seconds. After 90 seconds, the VIF(s) BGP hold time expires, and traffic starts to flow through the VPN connection.

In the screenshot below you can see an on-premise monitoring solution reporting on a server hosted in VMware Cloud on AWS. The server is available over the Direct Connect, drops, and is then available over the backup VPN after we disable the interfaces to simulate a failure. The test was conducted twice.

VPN_Monitor

Watch VMware vSphere HA Recover Virtual Machines Across AWS Availability Zones

This post demonstrates a simulated failure of an Availability Zone (AZ), in a VMware Cloud on AWS stretched cluster. The environment consists of a 6 host stretched cluster in the eu-west-2 (London) region, across Availability Zones eu-west-2a and eu-west-2b.

The simulation was carried out by the VMware Cloud on AWS back-end support team, to help with gathering evidence of AZ resilience. Failover works using vSphere High Availability (HA), in the event of a host failure HA traditionally brings virtual machines online on available hosts in the same cluster. In this scenario when the 3 hosts in AZ eu-west-2a are lost, vSphere HA automatically brings virtual machines online on the remaining 3 hosts in AZ eu-west-2b. High Availability across Availability Zones is facilitated using stretched networks (NSX-T) and storage replication (vSAN).

AWS Terminology: Each Region is a separate geographic area. Each Region has multiple, isolated locations known as Availability Zones. Each Region is completely independent. Each Availability Zone is isolated, but the Availability Zones in a Region are connected through low-latency links. An Availability Zone can be a single data centre or data centre campus.

VMC_Environment

You may also want to review further reading: How to Deploy and Configure VMware Cloud on AWS (Part 1), How to Migrate VMware Virtual Machines to VMware Cloud on AWS (Part 2), plus additional demo post Watch a Failover from Direct Connect to Backup VPN for VMware Cloud on AWS. For more information on Stretched Clusters for VMware Cloud on AWS see Overview and Documentation, as well as the following:

VMware FAQ | AWS FAQ | Roadmap | Product Documentation | Technical Overview | VMware Product Page | AWS Product Page | Try first @ VMware Cloud on AWS – Getting Started Hands-on Lab

Availability Zone (AZ) Outage

Before beginning it is worth re-iterating that the following screenshots do not represent a process, the customer / consumer of the service does not need to intervene unless a specific DR strategy has been put in place. In the event of a real world outage everything highlighted below happens automatically and is managed and monitored by VMware. You will of course want to be aware of what is happening on the platform hosting your virtual machines and that is why this post will give you a feel of what to expect, it may seem a little underwhelming as it does just look like a normal vSphere HA failover.

When we start out in this particular environment the vCenter Server and NSX Manager appliances are located in AZ eu-west-2a.

vcenter-2a

nsx-2a

The AZ failure simulation was initiated by the VMware back-end team. At this point all virtual machines in Availability Zone eu-west-2a went offline, including the example virtual machines screenshot above. As expected, within 5 minutes vSphere HA automatically brought the machines online in Availability Zone eu-west-2b. All virtual machines were accessible and working without any further action.

The stretched cluster now shows the hosts in AZ eu-west-2a as unresponsive. The hosts in AZ eu-west-2b are still online and able to run virtual machines.

Host-List

The warning on the hosts located in AZ eu-west-2b is a vSAN warning because there are cluster nodes down, this is still expected behaviour in the event of host outages.

eu-west-2b

The vCenter Server and NSX Manager appliances are now located in AZ eu-west-2b.

vcenter-2b

nsx-2b

Availability Zone (AZ) Return to Normal

Once the Availability Zone outage has been resolved, and the ESXi hosts are booted, they return as connected in the cluster. As normal with a vSphere cluster Distributed Resource Scheduler (DRS) will then proceed to balance resources accordingly.

Host-List-Normal

The vSAN object resync takes place and the health checks all change to green. Again this is something that happens automatically, and is managed and monitored by VMware.

vSAN-1

vSAN-2

Using a third party monitoring tool we can see the brief outage during virtual machine failover, and a server down / return to normal email alert generated for the support team.

Monitoring

This ties in with the vSphere HA events recorded for the ESXi hosts and virtual machines which we can of course view as normal in vCenter.

VM-Logs