How to Migrate VMware Virtual Machines to VMware Cloud on AWS

This post pulls together the workload migration planning and lessons learned notes made during a real-life customer use case of evacuation an on-premise data centre to VMware Cloud (VMC) on AWS (Amazon Web Services). The content is a work in progress and intended as a generic list of considerations and useful links for both VMware and AWS, it is not a comprehensive guide. Cloud, more-so than traditional infrastructure, is continuously changing. Features are implemented regularly and transparently so always validate against official documentation. This post was last updated on September 16th 2019.

Part 1: SDDC Deployment

Part 2: Migration Planning & Lessons Learned

See Also: VMware Cloud on AWS Security One Stop Shop

1. Virtual Machine Migrations

The following points should help with the planning of Virtual Machine (VM) workload migrations to VMware Cloud on AWS. An assumption is made that the Software-Defined Data Centre (SDDC) is stood up and operational with monitoring, backups, Anti-Virus, etc. in place. Review Part 1: SDDC Deployment for more information. I found the SDDC deployment and getting the environment available was the easy part. Internal processes and complexity of the existing environment are going to determine how quickly you can migrate workloads to the SDDC.

We started by exporting a list of Virtual Machines from each vCenter, from that we identified the service it was running and the service owner or a business owner. The biggest surprise here was the number of servers deployed by, or for, people who had left the organisation. These servers were still being hosted, maintained, patched, but no longer needed. We were able to decommission more workloads than expected due to years of VM sprawl. While VMware Cloud on AWS isn’t directly responsible for this, the project forced us to evaluate each server we hosted. For remaining workloads, we put together a migration flow which identified the following criteria:

  • CPU, RAM, storage requirements: specified a baseline to automatically accept and then anything above our baseline would require a manual check.
  • Network dependencies: is there a large amount of data in transit, is IP retention required, is the VLAN stretched using Hybrid Cloud Extension (HCX), load balancer requirements.
  • Data flows: used vRealize Network Insight to identify potential egress costs and additional service dependencies.
  • Additional application or organisation specific considerations: e.g. data classification, tagging / charge-back model, backups, security, monitoring, DNS, authentication, licensing or support.
  • Service Management considerations: is the service platinum/gold/silver/bronze or unclassified, do the platform Service Level Agreements (SLAs) fulfil the existing SLAs in place for each service, is the proposed migration type (i.e. the amount of downtime) taking this into consideration. Involving Service Management right from the start was useful as they were able to advise on internal processes for Service Acceptance and Business Continuity.
  • Service Owner considerations: if the technical criteria above are met then the next step was to meet with service owners and get their buy-in for the migration. We migrated internal services we owned first and then used that as a success story to onboard other services. This process involved meeting with various departments, presenting the solution and the benefits over their existing hosting, in our case DR and performance improvements, and migrating dev or test workloads first to build confidence.
  • Migration passport: one of our Senior Engineers came up with this concept as a one-pager for each service that was migrated, it consisted of migration details (change ID, date, status), migration scope (server names, locations, and notes), firewall rules, vRNI outputs, and other information such as associated documentation.

Each environment is different, so these are provided as example considerations only. Use resources such as those outlined below, and, to develop your own migration strategy.

Workload_Mobility

2. Network Design

  • Updated Feb 2020 – see also AWS Native Services Integration With VMware Cloud on AWS
  • Research the differences and limitations around the different VMware on AWS connection types, especially under 1Gbps – Configuring AWS Direct Connect with VMware Cloud on AWS
  • Make sure you understand the terminology around a Virtual Interface (VIF) and the difference between a Standard VIF, Hosted VIF, and Hosted Connection: What’s the difference between a hosted virtual interface (VIF) and a hosted connection? It is important to consider that VMware Cloud on AWS requires a dedicated Virtual Interface (VIF) – or a pair of VIFs for resilience. If you have a standard 1Gbps or 10Gbps connection direct from Amazon then you can create and allocate VIFs for this purpose. If you are using a hosted connection from an Amazon Partner Network (APN) for sub-1G connectivity then you may need to procure additional VIFs, or a dedicated Direct Connect with the ability to have multiple VIFs on a single circuit. This is a discussion you should have with your APN partner.
  • The Virtual Private Cloud (VPC) provided by the shadow AWS account cannot be used as a transit VPC. In other words, if you want to connect to private IP addressing of native AWS services, you cannot hop via VMware Cloud. In this instance, a Transit Gateway can be used.
  • At the time of writing a VPN attachment must be created to connect the SDDC to a Transit Gateway, if Direct Connect is in use, then the minimum requirement is 1Gbps.
  • If there is a requirement to connect multiple existing AWS VPCs, or multiple SDDCs, with on-premise networks then definitely check out VMware Cloud on AWS with Transit Gateway Demo.
  • If a backup VPN is in use, then you may be able to reduce failover time using Bidirectional Forwarding Detection (BFD) which is automatically enabled by AWS, in our case, it was not supported by our third-party provider.
  • Use vRealize Network Insight to get an idea of dependencies and data flows that you can use to plan firewall rules and estimate egress or cross-AZ charges. In general, my experience with these charges is that they have been minimal, this depends entirely on your own environment but should be considered when calculating overall VMware on AWS pricing.
  • If you want to update your default route see How to Set the Default Route in VMware Cloud on AWS: Part 1 & Part 2.
  • VMware Cloud on AWS: NSX Networking and Security eBook

3. Load Balancing & Security

  • Update Feb 2020 – see also VMware Cloud on AWS Security One Stop Shop
  • With the acquisition of Avi Networks, we can expect Avi Networks services as a paid add-on for VMware Cloud: VMware Cloud on AWS: NSX and Avi Networks Load Balancing and Security.
  • Third-party load balancers such as virtual F5 can be deployed in a virtual appliance format. If you are planning on using AWS Elastic Load Balancer (ELB) on a private IP address accessible on-premise ensure you have a connectivity method as outlined above.
  • The NSX Distributed Firewall (DFW) feature is included in the price of VMware Cloud, the paid-for message is removed from SDDC v1.8 onwards, this was announced at VMworld 2019.
  • Another VMworld 2019 announcement was the inclusion of syslog forwarding with the free version of VMware Cloud Log Intelligence (SaaS offering for log analytics). However, for troubleshooting NSX DFW logs you still need the paid-for version.
  • If you are using HCX, this product uses its own IPSec tunnel and therefore we could not get it working with the private IP address over a backup VPN. It was assumed that HCX would also not work with the private IP address via Transit Gateway either, due to the SDDC VPN requirement, and would need to be reconfigured to use the public IP address.
  • Another HCX migration consideration is that when you are stretching a network, all traffic goes via the HCX Interconnects. This means you are encapsulating everything in port UDP 4500, and essentially bypassing your on-premise firewall rules while the network is stretched. It is essential to double-check all rules are correct before eventually moving the gateway to VMC.
  • Again if you are doing VMware HCX migrations, remember to remove stretched networks once complete. This involves shutting down the gateway on-premise, removing the L2 stretch, and changing the network in the SDDC to routed, for us the downtime was around 30 seconds. The deployment of HCX in our environment, although covered by vSphere High Availability (HA), didn’t have resilience built-in; therefore we decided to minimise the amount of time they were in use by planning a migration strategy around each subnet.
  • If you use NSX Service Deployments for Anti-Virus, i.e. Guest Introspection for agentless AV then you will need to deploy an agent on each VM, as this feature is still currently unavailable.

4. General

  • The Cloud Services Portal (CSP) can be integrated with enterprise federation, allowing you to control access using your organisational policies, hopefully, therefore, enforcing Multi-Factor Authentication (MFA) and removing access as part of a leavers process. Federation will only work with a tenant, it will not work with a master organisation.
  • It is not possible at the time of writing to easily transfer an SDDC deployed in the root/master organisation into a tenant. The process currently is a redeploy and migrate.
  • Druva offers a product that will backup Virtual Machines from VMware Public Cloud direct into an S3 bucket they manage, for a greenfield deployment if you are not transferring any existing licenses this could be a good option as you only pay for the capacity you use. Having a backup environment setup in AWS has many benefits but also adds a management overhead and the consideration of replicating between Availability Zones.
  • In general internal support was good once teams were educated on the platform and the slightly different operating model we were implementing. In terms of external support, we have not encountered any compatibility issues yet. There was one application vendor with a published KB article stating they support running the application on VMware Cloud on AWS, who then retracted support stating the vSphere version being run was not GA.

Understand VMware Tanzu, Pacific, and Kubernetes for VMware Administrators

This post was last updated 26/10/2019 and provides an overview of VMware Tanzu and Project Pacific.

Peanut Butter & Jelly VMware and Kubernetes

There will be more apps deployed in the next 5 years than in the last 40 years (source: Introducing Project Pacific: Transforming vSphere into the App Platform of the Future). The VMware strategy of late has seen a significant shift towards cloud-agnostic software and the integration of cloud-native application development. In November 2018 VMware announced the Acquisition of Heptio to help accelerate enterprise adoption of Kubernetes on-premise and across multi-cloud environments. In May and August, 2019 VMware announced its intent to Acquire Bitnami and Pivotal Software, following the successful launch of Pivotal Container Service (PKS) which was later re-branded VMware Enterprise PKS.

To help better address application support complexities between development and operations teams, VMware has now announced VMware Tanzu:

“In Swahili, ’tanzu’ means the growing branch of a tree. In Japanese, ’tansu’ refers to a modular form of cabinetry. At VMware, Tanzu represents our growing portfolio of solutions to help you build, run and manage modern apps.”

VMware Tanzu is a portfolio of capabilities that empowers cloud-native development by enabling build, run, and manage operations across platforms. Using VMware Tanzu Mission Control Kubernetes clusters can be created and managed from a single control point.

Another key announcement alongside VMware Tanzu was code-named Project Pacific; enabling IT operators and developers to build and run modern applications with VMware vSphere and native Kubernetes. Project Pacific is focused on re-architecting vSphere for Kubernetes containers to run alongside VMware Virtual Machines (VMs) in ESXi, enabling the development of portable cloud-native applications and micro-services, while protecting existing investments in products and skills. You can review the press release of all products in the VMware Tanzu portfolio here, and the split of build, run, manage products here.

Introduction to Kubernetes

Kubernetes is an open-source orchestration and management tool that provides a simple Application Programming Interface (API), exposing a set of capabilities for defining workloads and services. Kubernetes enables containers to run and operate in a production-ready environment at an enterprise scale by managing and automating resource utilisation, failure handling, availability, configuration, scale, and desired state. Micro-services can be rapidly published, maintained, and updated.

Containers package applications and their dependencies into a distributed image that can run almost anywhere, simplifying application path to live. Kubernetes makes it easier to run applications across multiple cloud platforms, accelerates application development and deployment, increases agility, flexibility, and the ability to adapt to change.

For VMware administrators with little exposure to DevOps, the following high-level resources can help set a foundation understanding of Kubernetes, and why VMware are making some of these critical changes in architecture and strategy. You can try Kubernetes for yourself using the Kubernetes Academy by VMware, or a Kind Way to Learn Kubernetes.

Kubernetes for Executives: “Containers encapsulate an application in a form that’s portable and easy to deploy. Containers can run on any compatible system—in any
cloud—without changes. Containers consume resources efficiently, enabling high density and utilization. Kubernetes makes it possible to deploy and run complex applications requiring multiple containers by clustering physical or virtual resources for application hosting. Kubernetes is extensible, self-healing, scales applications automatically, and is inherently multi-cloud.”

 

Introduction to Project Pacific (Run)

Kubernetes uses a cluster of nodes to distribute container instances. The master node is the management plane containing the API server and scheduling capabilities. Worker nodes make up the control plane and act as compute resources for running workloads (known as pods). VMware has re-designed vSphere to include a Kubernetes control plane for managing Kubernetes workloads on ESXi hosts. The control plane is made up of a supervisor cluster using ESXi as the worker nodes, allowing workloads or pods to be deployed and run natively in the hypervisor, along with side traditional Virtual Machine workloads. This new functionality is provided by a new container runtime built into ESXi called CRX. CRX optimises the Linux kernel and hypervisor and strips some of the traditional heavy config of a Virtual Machine enabling the binary image and executable code to be quickly loaded and booted. The container runtime produces some of the performance benchmarks VMware have been publishing, such as improvements even over bare metal, in combination with ESXi’s powerful scheduler.

To ensure containers are running in pods, an agent called a Kubelet runs on Kubernetes cluster nodes. With the supervisor cluster, the role of the Kubelet agent is handled by a new ‘Spherelet’ running on each ESXi host. Pods are created on a network internal to the Kubernetes nodes. By default, pods cannot talk to each other across the cluster of nodes unless a Service is created. A Service in Kubernetes allows a group of pods to be exposed by a common IP address, helping define network routing and load balancing policies without having to understand the IP addressing of individual pods.

Another of the great features of Kubernetes is namespaces. Namespaces are commonly used to provide multi-tenancy across applications or users, and to manage resource quotas (backed in this instance by vSphere Resource Pools). Kubernetes namespaces segment resources for large teams working on a single Kubernetes cluster. Resources can have the same name as long as they belong to different namespaces, think of them as sub-domains and the Kubernetes cluster as the root domain the namespace gets attached to. Multiple namespaces can exist within the supervisor cluster, with different storage policies assigned to them, for persistent storage, etc.

Kubernetes can be accessed through a GUI known as the Kubernetes dashboard, or through a command-line tool called kubectl. Both query the Kubernetes API server to get or manage the state of various resources like pods, deployments, and services. Labels assigned to pods can be used to look up pods belonging to the same application, tier, or service. With Project Pacific; developers use Kubernetes APIs to access the Software-Defined Data Centre (SDDC) and ultimately consume Kubernetes clusters as a service using the same application deployment tools they use currently. This service is delivered by Infrastructure Operations teams using existing vSphere tools, with the flexibility of running Kubernetes workloads and Virtual Machine workloads side by side.

By applying application-focused management Project Pacific allows application-level control over policies, quota, and role-based access for Developers. Service features provided by vSphere such as High Availability (HA), Distributed Resource Scheduler (DRS) and vMotion can be applied at application level across Virtual Machines and containers. Unified visibility in vCenter for Kubernetes clusters, containers, and existing Virtual Machines is provided for a consistent view between Developers and Infrastructure Operations alike.

The following resources provide further reading on Project Pacific for enabling Kubernetes on vSphere.

Project_Pacific

Introduction to VMware Tanzu Mission Control (Manage)

VMware Tanzu Mission Control brings together Kubernetes clusters providing operator consistency for deployment, configuration, security, and policy enforcement across multiple clouds while maintaining developer independence and self-service.

VMware Tanzu Mission Control is a Software as a Service (SaaS) control plane offering allowing administrators to deploy, monitor, and manage ALL Kubernetes clusters from a single point of control. The beauty of this approach is that lifecycle management, access management, health and diagnostics, security and configuration policies, quota management, and backup or restore capabilities are all consolidated into a single toolset.  Kubernetes clusters running on vSphere, VMware Enterprise or Essential PKS, Public Cloud (AWS, Microsoft Azure, Google Cloud Platform), and managed services or other implementations can all be attached to VMware Tanzu Mission Control. New Kubernetes clusters can also be deployed to all of these platforms from the Tanzu Mission Control interface.

For more information on VMware Tanzu Mission Control, see the product page here, and Introducing VMware Tanzu Mission Control to Bring Order to Cluster Chaos. If you are attending VMworld Europe 2019 have a look through VMware Tanzu Sessions in the content catalog and also Explore Kubernetes at VMworld 2019. At the time of writing VMware Tanzu and Project Pacific are in tech preview, this post will be updated when more information is released. Please use the comments section below if you feel any key elements are missing or not explained clearly. There are some additional useful video tutorials available from the Project Pacific at Tech Field Day Extra at VMworld 2019,

 

Watch a Failover from Direct Connect to Backup VPN for VMware Cloud on AWS

This post demonstrates a simulated failure of Amazon Direct Connect, with VMware Cloud (VMC) on Amazon Web Services (AWS). In this setup, the standby VPN has been configured to provide connectivity in the event of a Direct Connect failure. The environment consists of a 6 host stretched cluster in the eu-west-2 (London) region, across Availability Zones eu-west-2a and eu-west-2b.

In this instance, a pair of hosted private Virtual Interfaces (VIFs) are provided by a Cloud Connect service from a single third-party provider. A Route-Based VPN has been configured. Direct Connect with VPN as standby was introduced in SDDC v1.7. For more information, see Nico Vibert’s post here.

AWS Direct Connect: “Using AWS Direct Connect, you can establish private connectivity between AWS and your datacenter, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.”

AWS VPN: “AWS Virtual Private Network (AWS VPN) lets you establish a secure and private tunnel from your network or device to the AWS global network.”

DX_VPN_Setup

Direct Connect Outage

Before beginning, it is worth re-iterating that the following screenshots do not represent a process. Providing the backup VPN is configured correctly, then the customer/consumer of the service does not need to intervene; in the event of a real-world outage, everything highlighted below happens automatically. You may also want to review further reading: How to Deploy and Configure VMware Cloud on AWS (Part 1), How to Migrate VMware Virtual Machines to VMware Cloud on AWS (Part 2), plus additional demo post Watch VMware vSphere HA Recover Virtual Machines Across AWS Availability Zones.

Taking down the primary and secondary VIFs was carried out by the hosting third party, to help with providing evidence of network resilience. When we start out in this particular environment, the VIFs are attached and available. Servers in VMware Cloud are contactable from on-premise across the Direct Connect. The backup VPN is enabled.

DX_VPN_1DX_VPN_2

Following disabling of the interfaces by our third-party provider, the BGP and Direct Connect status changes to down.

DX_VPN_3DX_VPN_5

This is confirmed in the AWS console as both the BGP status and therefore the VIF state is down.

DX_VPN_4

With the Direct Connect down routes are redistributed using the backup VPN. The Direct Connect BGP hold timer is 90 seconds, and the BGP keepalive is 30 seconds. After 90 seconds, the VIF(s) BGP hold time expires, and traffic starts to flow through the VPN connection.

In the screenshot below you can see an on-premise monitoring solution reporting on a server hosted in VMware Cloud on AWS. The server is available over the Direct Connect, drops, and is then available over the backup VPN after we disable the interfaces to simulate a failure. The test was conducted twice.

VPN_Monitor