This post demonstrates a simulated failure of Amazon Direct Connect, with VMware Cloud (VMC) on Amazon Web Services (AWS). In this setup the standby VPN has been configured to provide connectivity in the event of a Direct Connect failure. The environment consists of a 6 host stretched cluster in the eu-west-2 (London) region, across Availability Zones eu-west-2a and eu-west-2b.
In this instance a pair of hosted private Virtual Interfaces (VIFs) are provided by a Cloud Connect service from a single third party provider. A Route-Based VPN has been configured. Direct Connect with VPN as standby was introduced in SDDC v1.7. For more information see Nico Vibert’s post here.
AWS Direct Connect: “Using AWS Direct Connect, you can establish private connectivity between AWS and your datacenter, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.”
AWS VPN: “AWS Virtual Private Network (AWS VPN) lets you establish a secure and private tunnel from your network or device to the AWS global network.”
Direct Connect Outage
Before beginning it is worth re-iterating that the following screenshots do not represent a process. Providing the backup VPN is configured correctly then the customer / consumer of the service does not need to intervene; in the event of a real world outage everything highlighted below happens automatically. You may also want to review VMware Cloud on AWS Deployment Planning, and additional demo posts: VMware Cloud on AWS Stretched Cluster Failover Demo and VMware Cloud on AWS Live Migration Demo.
Taking down the primary and secondary VIFs was carried out by the hosting third party, to help with providing evidence of network resilience. When we start out in this particular environment the VIFs are attached and available. Servers in VMware Cloud are contactable from on-premise across the Direct Connect. The backup VPN is enabled.
Following disabling of the interfaces by our third party provider the BGP status changes to down, along with the Direct Connect status for both VIFs.
This is confirmed in the AWS console as both the BGP status and therefore the VIF state are down.
With the Direct Connect down routes are redistributed using the backup VPN. The Direct Connect BGP hold timer is 90 seconds and the BGP keep alive is 30 seconds. After 90 seconds the VIF(s) BGP hold time expires and traffic starts to flow through the VPN connection.
In the screenshot below you can see an on-premise monitoring solution reporting on a server hosted in VMware Cloud on AWS. The server is available over the Direct Connect, drops, and is then available over the backup VPN after we disable the interfaces to simulate a failure. The test was conducted twice.