Author Archives: ESXsi

VMware Cloud on AWS Migration Planning & Lessons Learned

This post pulls together the workload migration and lessons learned notes I have made during evacuation of an on-premise date centre to VMware Cloud (VMC) on AWS (Amazon Web Services). The content is a work in progress and intended as a generic list of considerations and useful links, it is not a comprehensive guide. Cloud, more-so than traditional infrastructure, is constantly changing. Features are implemented regularly and transparently so always validate against official documentation. This post was last updated on September 16th 2019.

Part 1: SDDC Deployment

Part 2: Migration Planning & Lessons Learned

1. Virtual Machine Migrations

The following points should help with the planning of Virtual Machine (VM) workload migrations. An assumption is made that the Software Defined Data Centre (SDDC) is stood up and operational with monitoring, backups, Anti-Virus, etc. in place. Review Part 1: SDDC Deployment for more information. I found the SDDC deployment and getting the environment available was the easy part. Internal processes and complexity of the existing environment are going to determine how quickly you can migrate workloads to the SDDC.

We started by exporting a list of Virtual Machines from each vCenter, from that we identified the service it was running and the service owner or business owner. The biggest surprise here was the amount of servers deployed by, or for, people who had left the organisation. These servers were still being hosted, maintained, patched, but no longer needed. We were able to decommission more workloads than expected due to years of VM sprawl. Whilst VMware Cloud on AWS isn’t directly responsible for this the project forced us to evaluate each server we hosted. For remaining workloads we put together a migration flow which identified the following criteria:

  • CPU, RAM, storage requirements: identified a baseline to automatically accept and then anything above our baseline would require a manual check.
  • Network dependencies: is there a large amount of data in transit, is IP retention required, is the VLAN stretched using Hybrid Cloud Extension (HCX), load balancer requirements.
  • Data flows: used vRealize Network Insight to identify potential egress costs and additional service dependencies.
  • Additional application or organisation specific considerations: e.g. data classification, tagging / charge-back model, backups, security, monitoring, DNS, authentication, licensing or support.
  • Service Management considerations: is the service platinum/gold/silver/bronze or unclassified, do the platform Service Level Agreements (SLAs) fulfil the existing SLAs in place for each service, is the proposed migration type (i.e. amount of downtime) taking this into consideration. Involving Service Management right from the start was useful as they were able to advise on internal processes for Service Acceptance and Business Continuity.
  • Service Owner considerations: if the technical criteria above is met then the next step was to meet with service owners and get their buy-in for the migration. We migrated internal services we owned first, and then used that as a success story to onboard other services. This process involved meeting with various departments, presenting the solution and the benefits over their existing hosting, in our case DR and performance improvements, and migrating dev or test workloads first to build confidence.
  • Migration passport: one of our Senior Engineers came up with this concept as a one-pager for each service that was migrated, it consisted of migration details (change ID, date, status), migration scope (server names, locations, and notes), firewall rules, vRNI outputs, and other information such as associated documentation.

Each environment is different so these are provided as example considerations only. Use resources such as those outlined below, and , to develop your own migration strategy.

Workload_Mobility

2. Network Design

  • Research the differences and limitations around the different connection types, especially under 1Gbps – Configuring AWS Direct Connect with VMware Cloud on AWS
    • Make sure you understand the terminology around a Virtual Interface (VIF) and the difference between a Standard VIF, Hosted VIF, and Hosted Connection: What’s the difference between a hosted virtual interface (VIF) and a hosted connection? It is important to consider that VMware Cloud on AWS requires a dedicated Virtual Interface (VIF) – or a pair of VIFs for resilience. If you have a standard 1Gbps or 10Gbps connection direct from Amazon then you can create and allocate VIFs for this purpose. If you are using a hosted connection from an Amazon Partner Network (APN) for sub-1G connectivity then you may need to procure additional VIFs, or a dedicated Direct Connect with the ability to have multiple VIFs on a single circuit. This is a discussion you should have with your APN partner.

  • The Virtual Private Cloud (VPC) provided by the shadow AWS account cannot be used as a transit VPC. In other words if you want to connect to private IP addressing of native AWS services you cannot hop via VMware Cloud. In this instance a Transit Gateway can be used.
  • At the time of writing a VPN attachment must be created to connect the SDDC to a Transit Gateway, if Direct Connect is in use then the minimum requirement is 1Gbps.
  • If there is a requirement to connect multiple existing AWS VPCs, or multiple SDDCs, with on-premise networks then definitely check out VMware Cloud on AWS with Transit Gateway Demo.
  • If a backup VPN is in use then you may be able to reduce failover time using Bidirectional Forwarding Detection (BFD) which is automatically enabled by AWS, in our case it was not supported by our third party provider.
  • Use vRealize Network Insight to get an idea of dependencies and data flows that you can use to plan firewall rules and estimate egress or cross-AZ charges. In general my experience with these charges is that they have been minimal, but this depends entirely on your own environment.
  • If you want to update your default route see How to Set the Default Route in VMware Cloud on AWS: Part 1 & Part 2.
  • VMware Cloud on AWS: NSX Networking and Security eBook

3. Load Balancing & Security

  • With the acquisition of Avi Networks we can expect Avi Networks services as a paid add-on for VMware Cloud: VMware Cloud on AWS: NSX and Avi Networks Load Balancing and Security.
  • Third party load balancers such as virtual F5 can be deployed in virtual appliance format. If you are planning on using AWS Elastic Load Balancer (ELB) on a private IP address accessible on-premise ensure you have a connectivity method as outlined above.
  • The NSX Distributed Firewall (DFW) feature is included in the price of VMware Cloud, the paid for message is removed from SDDC v1.8 onwards, this was announced at VMworld 2019.
  • Another VMworld 2019 announcement was the inclusion of syslog forwarding with the free version of VMware Cloud Log Intelligence (SaaS offering for log analytics), although for troubleshooting NSX DFW logs you still need the paid for version.
  • If you are using HCX this product uses its own IPSec tunnel and therefore we could not get it working with the private IP address over a backup VPN. It was assumed that HCX would also not work with the private IP address via Transit Gateway either, due to the SDDC VPN requirement, and would need to be reconfigured to use the public IP address.
  • Another HCX consideration is that when you are stretching a network all traffic goes via the HCX Interconnects. This means you are encapsulating everything in port UDP 4500, and essentially bypassing your on-premise firewall rules while the network is stretched. It is important to double check all rules are correct before eventually moving the gateway to VMC.
  • Again if you are using HCX to migrate workloads, remember to remove stretched networks once complete. This involves shutting down the gateway on-premise, removing the L2 stretch, and changing the network in the SDDC to routed, for us the down time was around 30 seconds. The deployment of HCX in our environment, although covered by vSphere High Availability (HA), didn’t have resilience built in, therefore we decided to minimise the amount of time they were in use by planning a migration strategy around each subnet.
  • If you use NSX Service Deployments for Anti-Virus, i.e. Guest Introspection for agentless AV then you will need to deploy an agent on each VM, as this feature is still currently unavailable.

4. General

  • The Cloud Services Portal (CSP) can be integrated with enterprise federation, allowing you to control access using your organisational policies, hopefully therefore enforcing Multi-Factor Authentication (MFA) and removing access as part of a leavers process. Federation will only work with a tenant, it will not work with a master organisation.
  • It is not possible at the time of writing to easily transfer an SDDC deployed in the root/master organisation into a tenant. The process currently is a redeploy and migrate.
  • Druva offer a product that will backup Virtual Machines from VMware Cloud on AWS direct into an S3 bucket they manage, for a greenfield deployment if you are not transferring any existing licenses this could be a good option as you only pay for the capacity you use. Having a backup environment setup in AWS has many benefits but also adds a management overhead and the consideration of replicating between Availability Zones.
  • In general internal support was good once teams were educated on the platform and the slightly different operating model we were implementing. In terms of external support we have not encountered any compatibility issues yet, there was one application vendor with a published KB article stating they support running the application on VMware Cloud on AWS,  then back tracked and said they wouldn’t support it as vSphere was a version not yet GA (6.8 at the time of writing).

 

Kubernetes on vSphere with Project Pacific

There will be more apps deployed in the next 5 years than in the last 40 years (source: Introducing Project Pacific: Transforming vSphere into the App Platform of the Future). In a shift towards application focus and to address application support complexities between development and operations teams VMware have announced Project Pacific. One of the drivers behind Project Pacific is to run Kubernetes components natively in vSphere, enabling the development of portable cloud-native apps. This post gives vSphere administrators an introduction on the technology and how it is expected to work.

Kubernetes Introduction

Kubernetes is an open-source orchestration and management tool that provides a simple Application Programming Interface (API). Kubernetes enables containers to run and operate in a production-ready environment at enterprise scale by managing and automating resource utilisation, failure handling, availability, configuration, scale, and desired state. Micro-services can be rapidly published, maintained, and updated through self-service automation. Kubernetes managed containers and containers package applications and their dependencies into a distributed image that can run almost anywhere, often made up of micro-services. Kubernetes makes it easier to run applications across multiple cloud platforms, accelerates application development and deployment, increases agility, flexibility, and the ability to adapt to change.

Kubernetes uses a cluster of nodes to distribute container instances. The master node is the management plane containing the API server and scheduling capabilities. Worker nodes make up the control plane and act as compute resources for running workloads (known as pods). A pod consists of one or more running container instances, cluster data is stored in a key-value store called etcd. Kubelet is an agent that run on each cluster node ensuring containers are running in a pod. Kubernetes uses a controller to constantly monitor the state of containers and pods, in the event of an issue Kubernetes attempts to redeploy the pod. If the underlying node in a Kubernetes cluster has issues Kubernetes redeploys pods to another node. Availability is addressed by specifying multiple pod instances in a ReplicaSet. Kubernetes replica sets run all the pods active-active, in the event of a replica failing or node issue then Kubernetes self-heals by re-deploying the pod elsewhere. Kubernetes nodes can run multiple pods, each pod gets a routable IP address to communicate with other pods.

Kubernetes namespaces are commonly used to provide multi-tenancy across applications or users, and to manage resource quotas. Kubernetes namespaces segment resources for large teams working on a single Kubernetes cluster. Resources can have the same name as long as they belong to different namespaces, think of them as sub-domains and the Kubernetes cluster as the root domain that the namespace gets attached to. Pods are created on a network internal to the Kubernetes nodes. By default pods cannot talk to each other across the cluster of nodes unless a Service is created, this uses either the cluster network, the nodes network, or a load balancer to map an IP address and port of the cluster or node network to the pods IP address and port, thus allowing pods distributed across nodes to talk to each other if needed.

Kubernetes can be accessed through a GUI known as the Kubernetes dashboard, or through a command-line tool called kubectl. Both query the Kubernetes API server to get or manage the state of various resources like pods, deployments, services, ReplicaSets, etc. Labels assigned to pods can be used to look up pods belonging to the same application or tier. This helps with inventory management and with defining Services. A Service in Kubernetes allows a group of pods to be exposed by a common IP address, helping you defined network routing and load balancing policies without having to understand the IP addressing of individual pods. Persistent storage can be defined using Persistent Volume Claims in a YAML file, the storage provider then mounds a volume on the pod.

Native vSphere Integration

Project_Pacific

VMware have re-architected vSphere to include a Kubernetes control plane for managing Kubernetes workloads on ESXi hosts. The control plane is made up of a supervisor cluster using ESXi as the worker nodes, allowing workloads or pods to be deployed and run natively in the hypervisor, along side traditional workloads. This functionality is provided by a new container runtime built into ESXi called CRX. CRX optimises the Linux kernel and hypervisor, and strips some of the traditional heavy config of a Virtual Machine enabling the binary image and executable code to be quickly loaded and booted. The container runtime is able to produce some of the performance benchmarks VMware have been publishing, such as improvements even over bare metal, in combination with ESXi’s powerful scheduler. In addition the role of the Kubelet agent is handled by a new ‘Spherelet’ within each ESXi host. Kubernetes uses Container Storage Interface (CSI) to integrate with vSAN and Container Network Interface (CNI) for NSX-T to handle network routing, firewall, load balancing, etc. Kubernetes namespaces are built into vSphere and backed using vSphere Resource Pools and Storage Policies.

Developers use Kubernetes APIs to access the Software Defined Data Centre (SDDC) and ultimately consume Kubernetes clusters as a service using the same application deployment tools they use currently. This service is delivered by Infrastructure Operations teams using existing vSphere tools, with the flexibility of running Kubernetes workloads and Virtual Machine workloads side by side.

By applying application focused management Project Pacific allows application level control over policies, quota, and role-based access for Developers. Service features provided by vSphere such as High Availability (HA), Distributed Resource Scheduler (DRS) and vMotion can be applied at application level across Virtual Machines and containers. Unified visibility in vCenter for Kubernetes clusters, containers, and existing Virtual Machines is provided for a consistent view between Developers and Infrastructure Operations alike.

img_1468

At the time of writing Project Pacific is in tech preview. Additional useful video tutorials can be found at Kubernetes Academy by VMware and Project Pacific at Tech Field Day Extra at VMworld 2019. This post will be updated when more information is released, continued reading can be found as follows:

msp-banner-sample-5

 

VMware Cloud on AWS Backup VPN Failover Demo

This post demonstrates a simulated failure of Amazon Direct Connect, with VMware Cloud (VMC) on Amazon Web Services (AWS). In this setup the standby VPN has been configured to provide connectivity in the event of a Direct Connect failure. The environment consists of a 6 host stretched cluster in the eu-west-2 (London) region, across Availability Zones eu-west-2a and eu-west-2b.

In this instance a pair of hosted private Virtual Interfaces (VIFs) are provided by a Cloud Connect service from a single third party provider. A Route-Based VPN has been configured. Direct Connect with VPN as standby was introduced in SDDC v1.7. For more information see Nico Vibert’s post here.

AWS Direct Connect: “Using AWS Direct Connect, you can establish private connectivity between AWS and your datacenter, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.”

AWS VPN: “AWS Virtual Private Network (AWS VPN) lets you establish a secure and private tunnel from your network or device to the AWS global network.”

DX_VPN_Setup

Direct Connect Outage

Before beginning it is worth re-iterating that the following screenshots do not represent a process. Providing the backup VPN is configured correctly then the customer / consumer of the service does not need to intervene; in the event of a real world outage everything highlighted below happens automatically. You may also want to review VMware Cloud on AWS Deployment Planning, and additional demo posts: VMware Cloud on AWS Stretched Cluster Failover Demo and VMware Cloud on AWS Live Migration Demo.

Taking down the primary and secondary VIFs was carried out by the hosting third party, to help with providing evidence of network resilience. When we start out in this particular environment the VIFs are attached and available. Servers in VMware Cloud are contactable from on-premise across the Direct Connect. The backup VPN is enabled.

DX_VPN_1DX_VPN_2

Following disabling of the interfaces by our third party provider the BGP status changes to down, along with the Direct Connect status for both VIFs.

DX_VPN_3DX_VPN_5

This is confirmed in the AWS console as both the BGP status and therefore the VIF state are down.

DX_VPN_4

With the Direct Connect down routes are redistributed using the backup VPN. The Direct Connect BGP hold timer is 90 seconds and the BGP keep alive is 30 seconds. After 90 seconds the VIF(s) BGP hold time expires and traffic starts to flow through the VPN connection.

In the screenshot below you can see an on-premise monitoring solution reporting on a server hosted in VMware Cloud on AWS. The server is available over the Direct Connect, drops, and is then available over the backup VPN after we disable the interfaces to simulate a failure. The test was conducted twice.

VPN_Monitor

vCenter Server External PSC Converge Tool

This post gives an overview of the vCenter Server converge process using the HTML5 vSphere client. The converge functionality was added to the GUI with vSphere 6.7 U2, and enables consolidation of external Platform Services Controller (PSC) into the embedded deployment model. This was previously achieved in vSphere 6.5 onwards using a CLI tool.

Following an upgrade of 4 existing vCenter Servers with external PSC nodes I log into the vSphere client. From the drop-down menu click Administration, on the left hand task pane under Deployment I select System Configuration. The starting topology is as follows:

PSC_4

You can view a VMware produced tutorial below, or the documentation here.

As the vCenter Server appliances do not need internet access I need to mount the ISO I used for the vCenter upgrade, see here for more information. This step is not required if internet connectivity exists.

For each vCenter Server with external PSC I select Converge to Embedded.

PSC_1

Next I confirm the Single Sign-On (SSO) details and click Converge.

PSC_3

If I am logged into the vCenter Server being converged I will be kicked out while services are restarted.

PSC_5

Alternatively if I am logged into another vCenter Server in linked mode I can monitor progress.

PSC_6

Once all 4 vCenter Servers have been converged I check that each of the vCenter Servers is using the embedded PSC, SSH to the vCenter appliance in shell run:

/usr/lib/vmware-vmafd/bin/vmafd-cli get-ls-location --server-name localhost

The command should return the vCenter Server for the lookup service, and not the external PSC node. Once you are happy there are no outstanding connection to the external PSC nodes remove them by selecting them individually and clicking Decommission PSC.

PSC_7PSC_8

With the converge process now complete and the PSC nodes decommissioned, the topology is as desired with all vCenter Servers running embedded PSC.

PSC_9

At this point I needed to re-register any external appliances (such as NSX Manager) or third party services that are pointing at the lookup service URL, or referencing the old external PSC node. I also cleaned up DNS as part of the decommission process.

VMware Cloud on AWS Deployment Planning

esxsi.com

This post pulls together the notes I have made during the planning of VMware Cloud (VMC) on AWS (Amazon Web Services) deployment, and migrations of virtual machines from traditional on-premise vSphere infrastructure. It is intended as a generic list of considerations and useful links, and is not a comprehensive guide. Cloud, more-so than traditional infrastructure, is constantly changing. Features are implemented regularly and transparently so always validate against official documentation. This post was last updated on August 6th 2019.

Part 1: SDDC Deployment

1. Capacity Planning

You can still use existing tools or methods for basic capacity planning, you should also consult the VMware Cloud on AWS Sizer and TCO Calculator provided by VMware. There is a What-If Analysis built into both vRealize Business and vRealize Operations, which is similar to the sizer tool and can also help with cost comparisons. Additional key considerations are:

  • Egress costs are now a thing! Use vRealize Network Insight to understand…

View original post 1,711 more words

How VMware is Accelerating NHS Cloud Adoption

This post provides an overview of how the UK National Health Service (NHS) can benefit from VMware Cloud (VMC) on Amazon Web Services (AWS).

In November 2014 the National Information Board and Department of Heatlh and Social Care published the Personalised Health and Care 2020 paper, outlining a framework to support the NHS with making better use of data and technology to improve health and care services. The paper endorsed the use of cloud services, backing up the UK Government cloud first strategy, introduced in 2013.

In January 2018 NHS Digital released guidance for NHS and social care data: off-shoring and the use of public cloud services, along with tools for identifying and assessing data risk classification, and a cloud security one page overview. The paper states that ‘NHS and social care organisations can safely put health and care data, including non-personal data and confidential patient information, into the public cloud’. NHS and Social care providers may use cloud computing services for NHS data, providing it is hosted in the UK, or European Economic Area (EEA), or in the US where covered by Privacy Shield. Steps for understanding the data type, assessing migration risks, and implementing and monitoring data protection controls are also included in the documentation.

The Information Governance (IG) report for Amazon Web Services was updated in 2018, the score approves Amazon Web Services to host and process NHS patient data. VMware Cloud on AWS leverages Amazon’s infrastructure to provide an integrated cloud offering, delivering a highly scaleable and secure solution for NHS organisations to migrate workloads and extend their on-premise infrastructure.

The NHS can implement Secure by Design services with VMware Cloud on AWS

  • NHS organisations must be aware of the shared security model that exists between: VMware; delivering the service, Amazon Web Services (the IaaS provider); delivering the underlying infrastructure, and customers; consuming the service.
  • The NHS organisation is in complete control of the location of its data. VMware do not backup or archive customer data and therefore it is up to the NHS organisation to implement this functionality.
  • Micro-segmentation can be used to protect applications by ring-fencing virtual machines in a zero trust architecture. The risks of legacy operating systems can be mitigated by isolating them from the rest of the network.
  • NHS organisations can use Role Based Access Control (RBAC) and Multi-Factor Authentication (MFA) to control access to cloud resources. NHS organisations are in control of inbound and outbound firewall rules and can opt to route all traffic internally on private addressing.
  • VMware Cloud on AWS meets a number of security standards such as NIST, ISO, and CIS. Standard Amazon policies for physical security and secure disposal apply. Amazon use self-encrypting disks and manage the keys using Amazon Key Management Service (KMS).
  • VMware implement a number of stringent security controls, for example MFA generated time-based credentials for support staff; all logged and monitored by a Security Operations Centre (SOC), VSAN based encryption, and industry-leading commercial solutions to secure, store, and control access to tokens, secrets, passwords, etc. Full details can be found in the VMware Cloud Services on AWS Security Overview.

Additional benefits of VMware Cloud on AWS to the wider NHS, are as follows:

  • The NHS can save time and money by reducing physical or data centre footprint

    • NHS Digital reached an agreement in May 2019 to offer other NHS organisations discounted access to cloud services to help accelerate their journey to the cloud. In addition, a favourable pricing structure is in place for reserved instances should organisations commit for 1 or 3 years.
    • Commissioning new space in a data centre, or even just new hardware, can be a lengthy process. With VMware Cloud an entire virtual data centre can be deployed in around 90 minutes. Extending capacity on demand takes as little as 15 minutes.
  • The NHS can protect existing investments and move to the cloud

    • Existing VMware workloads can be migrated to VMware Cloud on AWS, and back if needed, in minutes without the need to refactor applications.
    • NHS technical staff continue to use the same tools and management capabilities that they currently use day to day.
    • In most cases where products such as Monitoring, Backups, and Anti-Virus, are licensed per host or per number of Virtual Machines (VMs) organisations can adopt a Bring Your Own Licensing (BYOL) approach.
  • The NHS can improve service performance and availability

    • VSAN replication and stretched networks can enhance Disaster Recover (DR) capabilities. The Stretched-Cluster deployment provides vSphere High Availability (HA) across 2 Amazon Availability Zones within a region with a 99.99% availability commitment. Additional DR services such as Site Recovery Manager (SRM) add-ons are also available.
    • In many cases replacing aging servers and storage infrastructure with the latest hardware and flash based VSAN can yield significant application performance benefits.
    • Physical host capacity can be scaled out dynamically and then back in when it is no longer required. NHS organisations can take advantage of easily spinning up environments to test or develop without having to manually install and configure additional hardware.
  • The NHS has private access to native AWS services

    • VMware Cloud on AWS has a private link into Amazon’s backbone network of services, ranging from storage, database, and network services, to Internet of Things (IoT), Artificial Intelligence (AI) and Machine Learning. Developers can take advantage of various managed container services, or serverless platforms.
    • Since VMware Cloud resides in Amazon’s data centres hybrid configurations can be securely implemented, for example using Amazon’s Elastic Load Balancer with the back end servers in VMC, or Amazon’s Relational Database Service with the application servers in VMC.
  • NHS technical staff will have more time to proactively make improvements to systems and processes

    • Hardware maintenance such as firmware updates, failure remediation, and upgrades are all handled by VMware, as are software updates to the hypervisor and infrastructure management layer.
    • NHS technical staff are responsible for securing applications inside the virtual machine, e.g. operating system updates and firewall configuration, ensuring that Amazon Secure by Design best practises are followed.

In summary VMware Cloud on AWS enables NHS organisations to seamlessly extend or migrate data centre workloads to the cloud, whilst enhancing security and availability options. In the example shown below an existing VMware vSphere environment has been extended to VMware Cloud on AWS, giving organisations the flexibility to run their workloads on the most suited platform. This approach is secure and easy for operational teams who may not yet have an established cloud governance process in place.

Additional notes on this design: The Internet Gateway for VMC is not in use, all routes are advertised internally and controlled using on-premise firewalls, in other words all ingress and egress traffic is via the on-premise data centres. Access to native AWS services uses the 25Gbps Elastic Network Interfaces (ENI) and is secured using the gateway firewall and Amazon Security Groups.

NHS_SDDC

Further Reading: VMware Cloud on AWS Deployment Planning | VMware Cloud on AWS Evaluation Guide | VMware Cloud On AWS On-Boarding Handbook | VMware Cloud on AWS Operating Principles

VMware Cloud on AWS FAQs | Resources | Documentation | Factbook

VMware Cloud on AWS Stretched Cluster Failover Demo

This post demonstrates a simulated failure of an Availability Zone (AZ), in a VMware Cloud on AWS stretched cluster. The environment consists of a 6 host stretched cluster in the eu-west-2 (London) region, across Availability Zones eu-west-2a and eu-west-2b.

The simulation was carried out by the VMware Cloud on AWS back-end support team, to help with gathering evidence of AZ resilience. Failover works using vSphere High Availability (HA), in the event of a host failure HA traditionally brings virtual machines online on available hosts in the same cluster. In this scenario when the 3 hosts in AZ eu-west-2a are lost, vSphere HA automatically brings virtual machines online on the remaining 3 hosts in AZ eu-west-2b. High Availability across Availability Zones is facilitated using stretched networks (NSX-T) and storage replication (vSAN).

AWS Terminology: Each Region is a separate geographic area. Each Region has multiple, isolated locations known as Availability Zones. Each Region is completely independent. Each Availability Zone is isolated, but the Availability Zones in a Region are connected through low-latency links. An Availability Zone can be a single data centre or data centre campus.

VMC_Environment

You may also want to review VMware Cloud on AWS Deployment Planning, and VMware Cloud on AWS Live Migration Demo. For more information on Stretched Clusters for VMware Cloud on AWS see Overview and Documentation, as well as the following external links:

VMware FAQ | AWS FAQ | Roadmap | Product Documentation | Technical Overview | VMware Product Page | AWS Product Page | Try first @ VMware Cloud on AWS – Getting Started Hands-on Lab

Availability Zone (AZ) Outage

Before beginning it is worth re-iterating that the following screenshots do not represent a process, the customer / consumer of the service does not need to intervene unless a specific DR strategy has been put in place. In the event of a real world outage everything highlighted below happens automatically and is managed and monitored by VMware. You will of course want to be aware of what is happening on the platform hosting your virtual machines and that is why this post will give you a feel of what to expect, it may seem a little underwhelming as it does just look like a normal vSphere HA failover.

When we start out in this particular environment the vCenter Server and NSX Manager appliances are located in AZ eu-west-2a.

vcenter-2a

nsx-2a

The AZ failure simulation was initiated by the VMware back-end team. At this point all virtual machines in Availability Zone eu-west-2a went offline, including the example virtual machines screenshot above. As expected, within 5 minutes vSphere HA automatically brought the machines online in Availability Zone eu-west-2b. All virtual machines were accessible and working without any further action.

The stretched cluster now shows the hosts in AZ eu-west-2a as unresponsive. The hosts in AZ eu-west-2b are still online and able to run virtual machines.

Host-List

The warning on the hosts located in AZ eu-west-2b is a vSAN warning because there are cluster nodes down, this is still expected behaviour in the event of host outages.

eu-west-2b

The vCenter Server and NSX Manager appliances are now located in AZ eu-west-2b.

vcenter-2b

nsx-2b

Availability Zone (AZ) Return to Normal

Once the Availability Zone outage has been resolved, and the ESXi hosts are booted, they return as connected in the cluster. As normal with a vSphere cluster Distributed Resource Scheduler (DRS) will then proceed to balance resources accordingly.

Host-List-Normal

The vSAN object resync takes place and the health checks all change to green. Again this is something that happens automatically, and is managed and monitored by VMware.

vSAN-1

vSAN-2

Using a third party monitoring tool we can see the brief outage during virtual machine failover, and a server down / return to normal email alert generated for the support team.

Monitoring

This ties in with the vSphere HA events recorded for the ESXi hosts and virtual machines which we can of course view as normal in vCenter.

VM-Logs