This post talks about the importance of using Storage Policies with VMware Cloud on AWS to ensure the most efficient consumption of the available vSAN capacity. As a VMware Cloud on AWS customer, this is something we initially overlooked, allowing the default policy to remain in place for some time. The end result was burning through the storage capacity quicker than expected.
It is important to note here that VMware recommends keeping 30% free/slack space to keep vSAN operational. While compute features such as Elastic DRS can be disabled, to maintain the integrity of vSAN and associated Service Level Agreements (SLAs) clusters will automatically scale out in the event a storage threshold is hit. Customers should monitor their datastore and capacity usage to avoid unexpected charges of additional hosts being added.
First of all, this excellent blog by Glenn Sizemore identifies How Storage Policies influence usable capacity in VMware Cloud on AWS. When capacity planning, you should also review the Storage Capacity and Data Redundancy section of the VMware Cloud documentation.
SDDC Storage Options
When deploying the SDDC, the customer has the option of deploying a Stretched Cluster. Although a single cluster provides High Availability (HA) between hosts with a 99.9% availability guarantee, it is restricted to a single Availability Zone (AZ) within a region. A Stretched Cluster is spread across 2 Availability Zones within a region and backed by a 99.99% availability guarantee, with a third acting as the witness. At the time of writing, it is not possible to mix cluster types within an SDDC.
There are currently 3 methods of consuming storage with VMware Cloud on AWS:
- Direct Attached NVMe
- i3.metal instances provide fixed capacity in the form of NVMe SSDs with high IOPS. This storage type is suitable for most use cases, including workloads with high transaction rates such as databases, high-speed analytics, and virtual desktops. The i3.metal instances offer 36 CPU, 512GiB RAM, and 10TiB direct-attached NVMe high IOPS vSAN storage.
- Elastic vSAN
- r5.metal instances provide dynamic capacity using Amazon Elastic Block Storage (EBS), suitable for high or changing capacity needs and lower transaction rates; data warehousing, batch processing, disaster recovery, etc. The r5.metal instances offer 48 CPU, 768GiB RAM, and 15-35TiB AWS Elastic Block Storage (EBS) providing cloud-native Elastic vSAN made up of General Purpose SSDs (GP2).
- External Storage
- Finally, storage can be scaled with external storage from a Managed Service Provider (MSP) like Faction, who are closely followed by others such as Rackspace and Netapp. Currently, the VMware Cloud public roadmap also lists ‘External datastore and guest OS storage access for both DX and ENI connected 3rd party storage’ as developing, although this may change.
- There are additional instance types in development with further storage options.
- Management servers are stored on the first cluster provisioned (Cluster0) with datastores as administrative delegation points. The management datastore is managed by VMware, and the workload datastore is for the customer to consume.
- Data-at-rest encryption is provided by vSAN using the AWS Key Management Service (KMS). VMware Operations own the KMS relationship and Customer Master Key (CMK). The service is FIPS 140-2 compliant with full auditing, although this is sufficient for most use cases there currently isn’t a supported use case for customers who must own the KMS relationship themselves.
With VMware Cloud on AWS, the customer sets the desired end state and VMware manage the configuration. This means the technical details are not that in-depth; the customer tells VMware how many hosts they want in the cluster, and the policies to apply. Storage Policies can be applied to one to many Virtual Machines (VMs), Virtual Machine Disks (VMDKs), or VMDKs for container persistent volumes. In other words, they are applied at the object level, rather than for an entire datastore.
Storage Policies can be used to define things like disk stripes, IOPS limits, space or cache reservation, and availability. In this particular use case, we are interested in weighing up the availability options with space efficiency:
- No data redundancy: requires 1 host, 100GB of data used writes 100GB of data in the back end.
- Tolerate 1 host failure: RAID1, requires 3 hosts, 100 GB of data used writes 200GB of data in the back end. In this scenario, vSAN adds a second copy of the data, and a witness copy to prevent a split-brain situation. We can lose any 1 of the 3 hosts, and the object stays available. Although the storage consumption has doubled, reads are load balanced to accelerate performance, writes still need to be synchronously committed.
- Tolerate 2 host failures: RAID1, requires 5 hosts, 100GB of data used writes 300GB of data in the back end. You get the picture, this means that the storage consumption could get quite high.
- Erasure Coding: by implementing an Erasure Coding policy, instead of storing a complete copy of the data, it instead gets broken up into multiple data segments. We can lose any 1 chunk of that data and not suffer data loss; however, there is additional I/O associated with managing the parity copy. Furthermore, in the event of a failure, the data has to be rebuilt, meaning the potential for a compute and I/O overhead during this time. Despite this, the policy proves useful for space efficiency where workloads may not be hugely performance intensive.
If you are using a Stretched Cluster, further attributes can be applied concerning the Availability Zone (or site) location of the data. You can choose from the following options when configuring Storage Policies:
- None (standard cluster)
- Dual-site monitoring (stretched cluster)
- None, keep data on primary (stretched cluster)
- None, keep data on secondary (stretched cluster)
- In a Stretched Cluster, there is a feature called read locality; keeping reads within the same AZ. Remember though that writes must be synchronous across both AZs.
- Data transfer fees are $0.02/GB for cross-AZ traffic, tools like Live Optics can be used to predict application read and writes.
- Not all your workloads will need vSphere HA protection across AZs, for example, developer workloads or workloads where failover is provided in the application stack such as SQL Always On.