This post demonstrates a simulated failure of an Availability Zone (AZ), in a VMware Cloud on AWS stretched cluster. The environment consists of a 6 host stretched cluster in the eu-west-2 (London) region, across Availability Zones eu-west-2a and eu-west-2b.
The simulation was carried out by the VMware Cloud on AWS back-end support team, to help with gathering evidence of AZ resilience. Failover works using vSphere High Availability (HA), in the event of a host failure HA traditionally brings virtual machines online on available hosts in the same cluster. In this scenario when the 3 hosts in AZ eu-west-2a are lost, vSphere HA automatically brings virtual machines online on the remaining 3 hosts in AZ eu-west-2b. High Availability across Availability Zones is facilitated using stretched networks (NSX-T) and storage replication (vSAN).
AWS Terminology: Each Region is a separate geographic area. Each Region has multiple, isolated locations known as Availability Zones. Each Region is completely independent. Each Availability Zone is isolated, but the Availability Zones in a Region are connected through low-latency links. An Availability Zone can be a single data centre or data centre campus.
You may also want to review VMware Cloud on AWS Deployment Planning, and VMware Cloud on AWS Live Migration Demo. For more information on Stretched Clusters for VMware Cloud on AWS see Overview and Documentation, as well as the following external links:
Availability Zone (AZ) Outage
Before beginning it is worth re-iterating that the following screenshots do not represent a process, the customer / consumer of the service does not need to intervene unless a specific DR strategy has been put in place. In the event of a real world outage everything highlighted below happens automatically and is managed and monitored by VMware. You will of course want to be aware of what is happening on the platform hosting your virtual machines and that is why this post will give you a feel of what to expect, it may seem a little underwhelming as it does just look like a normal vSphere HA failover.
When we start out in this particular environment the vCenter Server and NSX Manager appliances are located in AZ eu-west-2a.
The AZ failure simulation was initiated by the VMware back-end team. At this point all virtual machines in Availability Zone eu-west-2a went offline, including the example virtual machines screenshot above. As expected, within 5 minutes vSphere HA automatically brought the machines online in Availability Zone eu-west-2b. All virtual machines were accessible and working without any further action.
The stretched cluster now shows the hosts in AZ eu-west-2a as unresponsive. The hosts in AZ eu-west-2b are still online and able to run virtual machines.
The warning on the hosts located in AZ eu-west-2b is a vSAN warning because there are cluster nodes down, this is still expected behaviour in the event of host outages.
The vCenter Server and NSX Manager appliances are now located in AZ eu-west-2b.
Availability Zone (AZ) Return to Normal
Once the Availability Zone outage has been resolved, and the ESXi hosts are booted, they return as connected in the cluster. As normal with a vSphere cluster Distributed Resource Scheduler (DRS) will then proceed to balance resources accordingly.
The vSAN object resync takes place and the health checks all change to green. Again this is something that happens automatically, and is managed and monitored by VMware.
Using a third party monitoring tool we can see the brief outage during virtual machine failover, and a server down / return to normal email alert generated for the support team.
This ties in with the vSphere HA events recorded for the ESXi hosts and virtual machines which we can of course view as normal in vCenter.