Storage Connectivity Loss with VMCP

This post looks at VM Component Protection and how it helps protect vSphere 6 environments from storage connectivity loss. When a host loses a storage device it marks it in one of the following states:

PDL (Permanent Device Loss)

A device will be marked as permanently lost if the storage array responds with a SCSI sense code marking the device as unavailable. This could be in the event of a failed LUN or one which has been unmapped at a storage array level whilst active in vSphere. As the array and the host can still communicate SCSI sense codes are issued regarding the state of the device, at this point the host will stop sending I/O requests and label the device permanently unavailable.

APD (All Paths Down)

If the PDL SCSI code is not returned from a device then this is marked as All-Paths-Down (APD) and the ESXi host continues to send I/O requests until the host receives a response. This could be in the event of a fibre channel switch or HBA failure. The ESXi host is not able to determine if the device loss is permanent (PDL) or transient (APD) and therefore it indefinitely retries virtual machine I/O from the hostdagent. In vSphere 5.x an APD timeout was introduced for non-virtual machine I/O.

VMCP (VM Component Protection)

VMCP is a high availability feature, introduced in vSphere 6.x, to help detect and respond to PDL and APD events. If a device enters permanent device loss state vSphere can take the following actions:

  • Do nothing (disabled)
  • Issue an event to notify administrators
  • Restart the virtual machines on a host which still has access to the storage

If a device enters all paths down state vSphere can take the following actions:

  • Do nothing (disabled)
  • Issue an event to notify administrators
  • Restart the virtual machines on other hosts only if there is sufficient capacity to do so (conservative)
  • Restart the virtual machines on other hosts regardless of the response from the HA master (aggressive)

It is also possible to configure a delayed VM failover for APD and automatically reset a virtual machine if APD recovers before the VM failover timeout. This is useful for applications which may become unstable after a storage outage.

Speaking from personal experience a storage connectivity loss can be troublesome to identify and keep on top of, especially when intermittent. VMCP can’t fix any underlying issues with your storage array or at the fabric layer, but it can quickly do the leg work to determine which hosts still have access to the storage; automatically bringing virtual machines back online where possible.

Configuring VMCP

In a vSphere 6.x environment VMCP can be configured within the vSphere HA options of the manage tab at cluster level. As this is a new feature it needs to be switched on within the vSphere web client. First tick the box to enable VM Component Protection, and then configure the relevant responses for datastores in PDL and APD states.

VMCP

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s