This post talks about the importance of using Storage Policies with VMware Cloud on AWS to ensure the most efficient consumption of the available vSAN capacity. As a VMware Cloud on AWS customer, this is something we initially overlooked, allowing the default policy to remain in place for some time. The end result was burning through the storage capacity quicker than expected.
It is important to be aware that VMware requires 30% free/slack space to keep vSAN operational. Whilst compute features such as Elastic DRS can be disabled, to maintain the integrity of vSAN and associated Service Level Agreements (SLAs) clusters will automatically scale out in the event a storage threshold is hit. Customers should monitor their datastore and capacity usage to avoid unexpected charges of additional hosts being added.
First of all, this excellent blog by Glenn Sizemore identifies How Storage Policies influence usable capacity in VMware Cloud on AWS. When capacity planning, you should also review the Storage Capacity and Data Redundancy section of the VMware Cloud documentation.
SDDC Storage Options
When deploying the SDDC, the customer has the option of deploying a Stretched Cluster. Although a single cluster provides High Availability (HA) between hosts, offering 99.9% availability, it is restricted to a single Availability Zone (AZ) within a region. A Stretched Cluster is spread across 2 Availability Zones within a region, backed by a 99.99% availability, with a third acting as witness. At the time of writing, it is not possible to mix cluster types within an SDDC.
There are currently 3 methods of consuming storage with VMware Cloud on AWS:
- Direct Attached NVMe
- i3.metal instances provide fixed capacity in the form of NVMe SSDs with high IOPS. This storage type is suitable for most use cases, including workloads with high transaction rates such as databases, high-speed analytics, and virtual desktops. The i3.metal instances offer 36 CPU, 512GiB RAM, and 10TiB direct-attached NVMe high IOPS vSAN storage.
- Elastic vSAN
- r5.metal instances provide dynamic capacity using Amazon Elastic Block Storage (EBS), suitable for high or changing capacity needs and lower transaction rates; data warehousing, batch processing, disaster recovery, etc. The r5.metal instances offer 48 CPU, 768GiB RAM, and 15-35TiB AWS Elastic Block Storage (EBS) providing cloud-native Elastic vSAN made up of General Purpose SSDs (GP2).
- External Storage
- Finally, storage can be scaled with external storage from a Managed Service Provider (MSP) like Faction, who are being followed by others such as Rackspace and Netapp. Currently, the VMware Cloud public roadmap also lists ‘External datastore and guest OS storage access for both DX and ENI connected 3rd party storage’ as developing, although this may change.
- There are additional instance types in development with further storage options.
- Management servers are stored on the first cluster provisioned (Cluster0) with datastores as administrative delegation points. The management datastore is managed by VMware, and the workload datastore is for the customer to consume.
- Data-at-rest encryption is provided by vSAN using the AWS Key Management Service (KMS). VMware Operations own the KMS relationship and Customer Master Key (CMK). The service is FIPS 140-2 compliant with full auditing, although this is sufficient for most use cases there currently isn’t a supported use case for customers who must own the KMS relationship themselves.
- Storage efficiencies can typically be achieved at a ratio of 1.5 using vSAN dedupe and compression.
With VMware Cloud on AWS, the customer sets the desired end state and VMware manage the configuration. This means the technical details are not that in-depth; the customer tells VMware how many hosts they want in the cluster, and the policies to apply. Storage Policies can be applied to dynamic groups using tagging, individual Virtual Machines (VMs), or Virtual Machine Disks (VMDKs), allowing for granular object based configuration.
Storage Policies can be used to define things like disk stripes, IOPS limits, space or cache reservation, and availability. In this particular use case, we are interested in weighing up the availability options with space efficiency:
- No data redundancy: requires 1 host, 100GB of data used writes 100GB of data in the back end.
- RAID1 FTT1: RAID1, requires 3 hosts, 100 GB of data used writes 200GB of data in the back end. In this scenario, vSAN adds a second copy of the data, and a witness copy to prevent a split-brain situation. The object stays available with protection against up to 1 failed component, such as a host or disk. Although the storage consumption has doubled, reads are load balanced to accelerate performance, writes still need to be synchronously committed.
- RAID1 FTT2: RAID1, requires 5 hosts, 100GB of data used writes 300GB of data in the back end. You can include more failures to tolerate but the storage consumption continues to increase.
- RAID5 or RAID6 with Erasure Coding: by implementing an erasure coding policy, instead of storing a complete copy data gets broken up into multiple segments. We can lose any 1 chunk of that data and not suffer data loss; however, there is additional I/O associated with managing the parity copy. Furthermore, in the event of a failure, the data has to be rebuilt, meaning the potential for a compute and I/O overhead during this time. Despite this, the policy proves useful for space efficiency where workloads may not be hugely performance intensive.
- RAID5 FTT1: RAID5 configuration with erasure coding and failures to tolerate set to 1, expect this to require approximately 1.3x capacity.
- RAID6 FTT2: RAID6 configuration with erasure coding and failures to tolerate set to 2, expect this to require approximately 1.5x capacity.
Each of the capacity estimates above is within a vSAN fault domain, which equates to an Availability Zone for VMC on AWS. If you are using a Stretched Cluster, further attributes can be applied concerning the Availability Zone (or site) location of the data. You can choose from the following options when configuring Storage Policies:
- Dual-site mirroring (stretched cluster)
- None, keep data on primary (stretched cluster)
- None, keep data on secondary (stretched cluster)
When using dual-site mirroring, the amount of storage consumed is doubled. For example, a RAID5 FTT1 policy using dual-site mirroring would require approximately 2.6x capacity. A RAID6 FTT2 policy using dual-site mirroring would require 3x capacity; in other words 100GB of data would consume 300GB on disk but be resilient across fault-domains / sites.
- In a Stretched Cluster, there is a feature called read locality; keeping reads within the same AZ. Remember though that writes must be synchronous across both AZs.
- Not all your workloads will need vSphere HA protection across AZs, for example, domain controllers, backup proxies, dev/test workloads, or workloads where failover is provided in the application stack such as SQL Always On.