This post demonstrates a simulated failure of Amazon Direct Connect, with VMware Cloud (VMC) on Amazon Web Services (AWS). In this setup, the standby VPN has been configured to provide connectivity in the event of a Direct Connect failure. The environment consists of a 6 host stretched cluster in the eu-west-2 (London) region, across Availability Zones eu-west-2a and eu-west-2b.
In this instance, a pair of hosted private Virtual Interfaces (VIFs) are provided by a Cloud Connect service from a single third-party provider. A Route-Based VPN has been configured. Direct Connect with VPN as standby was introduced in SDDC v1.7. For more information, see Nico Vibert’s post here.
AWS Direct Connect: “Using AWS Direct Connect, you can establish private connectivity between AWS and your datacenter, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.”
AWS VPN: “AWS Virtual Private Network (AWS VPN) lets you establish a secure and private tunnel from your network or device to the AWS global network.”
Direct Connect Outage
Before beginning, it is worth re-iterating that the following screenshots do not represent a process. Providing the backup VPN is configured correctly, then the customer/consumer of the service does not need to intervene; in the event of a real-world outage, everything highlighted below happens automatically. You may also want to review further reading: How to Deploy and Configure VMware Cloud on AWS (Part 1), How to Migrate VMware Virtual Machines to VMware Cloud on AWS (Part 2), plus additional demo posts: Watch VMware vSphere HA Recover Virtual Machines Across AWS Availability Zones and Watch a Virtual Machine Live Migration to VMware Cloud on AWS.
Taking down the primary and secondary VIFs was carried out by the hosting third party, to help with providing evidence of network resilience. When we start out in this particular environment, the VIFs are attached and available. Servers in VMware Cloud are contactable from on-premise across the Direct Connect. The backup VPN is enabled.
Following disabling of the interfaces by our third-party provider, the BGP and Direct Connect status changes to down.
This is confirmed in the AWS console as both the BGP status and therefore the VIF state is down.
With the Direct Connect down routes are redistributed using the backup VPN. The Direct Connect BGP hold timer is 90 seconds, and the BGP keepalive is 30 seconds. After 90 seconds, the VIF(s) BGP hold time expires, and traffic starts to flow through the VPN connection.
In the screenshot below you can see an on-premise monitoring solution reporting on a server hosted in VMware Cloud on AWS. The server is available over the Direct Connect, drops, and is then available over the backup VPN after we disable the interfaces to simulate a failure. The test was conducted twice.