How Can CloudHealth by VMware Help Me?

How Can CloudHealth by VMware Help Me?

Introduction

Over the past 12 months we have seen further growth within the cloud, as many organisations scale or create new digital services in response to the coronavirus pandemic. Improved speed and agility has allowed businesses to pivot where traditional siloed infrastructure may have caused them to stall.

As the usage of cloud services expands, standardising and consolidating cloud tooling becomes important for financial management, operational governance, and security and compliance. Visibility into distributed system architectures across many accounts or subscriptions, or even multi-cloud, is another key challenge. For some customers cloud workloads are not optimised or configured to best standards, many will spend more than their anticipated budget, and others may accidentally expose data or services.

Those with an established cloud strategy may decide to implement a Cloud Centre of Excellence (CCoE); responsible for cloud operations, security, and financial management. The CCoE will navigate the security and configuration landscape of cloud assets, automating response and remediation to configuration drift or threats. As the team grows in maturity optimisations are made continuously and automatically, inline with the key drivers of the business. This is where CloudHealth comes in.

CloudHealth by VMware is a multi-cloud SaaS solution managing more than $11B of public cloud spend for over 10,000 customers. CloudHealth accelerates business transformation in the cloud by providing a single platform solution for visibility into AWS, Microsoft Azure, Google Cloud Platform, Oracle Cloud Infrastructure, VMware Cloud on AWS, and on-premises VMware based environments. The key functionality is broken down into the 2 products we’ll look at below.

CloudHealth Multicloud Platform

CloudHealth takes data from cloud platforms, data centres, and third party tools for application, security, and configuration management. Data is ingested and aggregated using CloudHealth’s integrated data layer, which performs analysis on usage, performance, cost, and security posture. CloudHealth becomes a single source for multi-cloud management across environments, strengthening security and compliance, consolidating management, and improving collaboration between previously siloed teams of people and tools.

Data and assets can be categorised by tags or other metadata, and viewed in logical business groups known as perspectives . Perspectives provide a breakdown for cost allocation using dynamic groups such as line of business, department, cost centre, or project. The output can be used to identify trends and build dashboards and reports. This approach simplifies financial management, saves time, aids with budgeting and forecasting, and encourages accountability through accurate chargeback or showback.

CloudHealth Cost Dashboard

Whilst visibility is great, to really have a positive impact on operations we need to know what to do with the data collected. CloudHealth presents back cost optimisation recommendations and security risks, but can also carry out remediation actions automatically.

Cost optimisation is where you can save money, using AWS as an example, based on things like; EC2 instances that are oversized or on an inefficient purchase plan, elastic IP addresses or EBS volumes that are not attached to any resources, snapshots that have not been deleted. In the physical on-premises world all of these issues were common as part of VM sprawl, they impacted capacity planning and resource consumption but were mostly hidden or swallowed as part of the wider infrastructure cost. As organisations shift from large capital investments to ongoing revenue and consumption based pricing, oversized or unused resources literally convert to money going out of the door every single month.

CloudHealth Health Check

Recommendations and actions are where CloudHealth carries out remediation for incorrectly configured or under-utilised resources. Policies can also be used to define desired states and ensure operational compliance. For example, an organisation may want to report on untagged resources, connected accounts, or open ports. The number of available actions currently appears to only cover AWS and Azure, but with support recently added for Oracle Cloud Infrastructure, and Google Cloud Platform before that, hopefully this functionality will continue to be built out.

CloudHealth Remediation Actions

At the time of writing CloudHealth is priced based on cloud spend, and can be purchased as a 1, 2, or 3 year prepaid commitment, or variable pricing based on the previous months cloud spend. A free trial is available to uncover ROI in your own environment from CloudHealth here.

Where VMware environments are in use with vRealize Operations, the CloudHealth management pack for vRealize Operations can be installed. Bringing CloudHealth dashboards and prospects into vROps allows IT ops teams to track on-premises infrastructure and public cloud costs from a single interface. The CloudHealth management pack for vROps can be downloaded from the VMware Marketplace, instructions are here.

CloudHealth Secure State

By default CloudHealth provides real-time information on security risk exposure, but for deep-dive visibility and remediation those who are serious about security will want to look at Secure State. CloudHealth Secure State is available with CloudHealth or standalone, and currently supports AWS, Azure, and GCP.

Dashboards within CloudHealth Secure State enable at-a-glance checks on security posture and compliance. There are over 700 built-in security rules and compliance frameworks that can be used as security guardrails, with the ability to add custom rules and frameworks on top.

As systems become distributed over multiple accounts, subscriptions, or even clouds, the dynamics of securing an organisations assets shift significantly. Previously all services were contained within a data centre, firstly using perimeter firewalls and then with micro-segmentation. IT teams were generally in control and had visibility throughout the corporate network. Nowadays a developer or user responsible for a service can potentially open applications or data to the public, either on purpose or by accident. Cloud security guardrails form an important baseline for security posture and cloud strategy. Security guardrails are made up of critical must-have configurations in policies with auto-remediation actions attached, they help avoid mistakes or configuration drift to ultimately reduce security risk.

CloudHealth Secure State gives further visibility into resource relationships and context, using the Explore UI. Explore enables a powerful model of multi-cloud or account architectures, with visual topology diagrams of complex environments. Cyber security analysts or operations centres can drill down into individual resources with all interoperable components and dependencies already mapped out.

CloudHealth Secure State Dashboard
CloudHealth Secure State Compliance

Cloud Disaster Recovery Options for VMware Virtual Machines

Cloud Disaster Recovery Options for VMware Virtual Machines

Introduction

In my day job I am often asked about the use of cloud for disaster recovery. Some organisations only operate out of a single data centre, or building, while others have a dual-site setup but want to explore a third option in the cloud. Either way, using cloud resources for disaster recovery can be a good way to learn and validate different technologies, potentially with a view to further migrations as data centre and hardware contracts expire.

This post takes a look at the different cloud-based disaster recovery options available for VMware workloads. It is not an exhaustive list but provides some ideas. Further work will be needed to build a resilient network architecture depending on the event you are looking to protect against. For example do you have available network links if your primary data centre is down, can your users and applications still route to private networks in the cloud, are your services internet facing allowing you to make your cloud site the ingress and egress point. As with any cloud resources, in particular if you are building your own services, a shared security model applies which should be fully understood before deployment. Protecting VMware workloads should only form part of your disaster recovery strategy, other dependencies both technical and process will also play a part. For more information on considering the bigger picture see Disaster Recovery Strategy – The Backup Bible Review.

Concepts

  • DRaaS (Disaster Recovery as a Service) – A managed service that will typically involve some kind of data replication to a site where the infrastructure is entirely managed by the service provider. The disaster recovery process is built using automated workflows and runbooks; such as scaling out capacity, and bringing online virtual machines. An example DRaaS is VMware Cloud Disaster Recovery which we’ll look at in more detail later on.
  • SaaS (Software as a Service) – An overlay software solution may be able to manage the protection of data and failover, but may not include the underlying infrastructure components as a whole package. Typically the provider manages the hosting, deployment, and lifecycle management of the software, but either the customer or another service provider is responsible for the management and infrastructure of the protected and recovery sites.
  • IaaS and PaaS (Infrastructure as a Service and Platform as a Service) – Various options exist around building disaster recovery solutions based on infrastructure or platforms consumed from a service provider. This approach will generally require more effort from administrators to setup and manage but may offer greater control. An example is installing VMware Site Recovery Manager (self-managed) to protect virtual machines running on VMware-based IaaS. Alternatively third party backup solutions could be used with cloud storage repositories and cloud hosted recovery targets.
  • Hybrid Cloud – The VMware Software Defined Data Centre (SDDC) can run on-premises and overlay cloud providers and hyperscalers, delivering a consistent operating platform. Disaster recovery is one of the common use cases for a hybrid cloud model, as shown in the whiteboard below. Each of the solutions covered in this post is focused around a hybrid cloud deployment of VMware software in an on-premises data centre and in the cloud.
Hybrid Cloud Use Cases

VMware Cloud Disaster Recovery

VMware Cloud Disaster Recovery (VCDR) replicates virtual machines from on-premises to cloud based scale-out file storage, which can be mounted to on-demand compute instances when required. This simplifies failover to the cloud and lowers the cost of disaster recovery. VCDR allows for live mounting of a chosen restore point for fast recovery from ransomware. Recently ransomware has overtaken events like power outages, natural disasters, human error, and hardware failure as the number one cause of disaster recovery events.

VCDR uses encrypted AWS S3 storage with AWS Key Management Service (KMS) as a replication target, protecting virtual machines on-premises running on VMware vSphere. There is no requirement to run the full SDDC / VMware Cloud Foundation (VCF), vSAN, or NSX at the replication source site. If and when required, the scale-out file system is mounted to compute nodes using VMware Cloud (VMC) on AWS, without the need to refactor or change any of the virtual machine file formats. VCDR also includes built-in audit reporting, continuous healthchecks at 30 minute intervals, and test failover capabilities.

VMware Cloud on AWS provides the VMware SDDC as a managed service running on dedicated AWS bare-metal hardware. VMware manage the full infrastructure stack and lifecycle management of the SDDC. The customer sets security and access configuration, including data location. Currently VCDR is only available using VMware Cloud on AWS as the target for cloud compute, with the following deployment options:

  • On Demand – virtual machines are replicated to the scale-out file storage, when disaster recovery is invoked an automated SDDC deployment is initiated. When the SDDC is ready the file system is mounted to the SDDC and virtual machines are powered on. Typically this means a Recovery Time Objective (RTO) of around 4 hours. For services that can tolerate a longer RTO the benefit of this deployment model is that the customer only pays for the storage used in the scale-out storage, and then pays for the compute on-demand should it ever be needed.
  • Pilot Light – a small VMware Cloud on AWS environment exists, typically 3 hosts. Virtual machines are replicated to the scale-out file storage, when disaster recovery is invoked the file system is instantly mounted to the existing SDDC and virtual machines are powered on. Depending on the number of virtual machines being brought online, the SDDC automatically scales out the number of physical nodes. This brings the RTO time down to as little as a few minutes. The customer is paying for the minimum VMware Cloud on AWS capacity to be already available but this can be scaled out on-demand, offering significant cost savings on having an entire secondary infrastructure stack.
VMware Cloud Disaster Recovery

The cloud-based orchestrator behind the service is provided as SaaS, with a connector appliance deployed on-premises to manage and encrypt replication traffic. After breaking replication and mounting the scale-out file system administrators manage virtual machines using the consistent experience of vSphere and vCenter. Startup priorities can be set to ensure critical virtual machines are started up first. At this point virtual machines are still running in the scale-out file system, and will begin to storage vMotion over to the vSAN datastore provided by the VMware Cloud on AWS compute nodes. The storage vMotion time can vary depending on the amount of data and number of nodes (more nodes and therefore physical NICs provides more network bandwidth), however the vSAN cache capabilities can help elevate any performance hit during this time. When the on-premises site is available again replication reverses, only sending changed blocks, ready for failback.

You can try out VCDR using the VMware Cloud Disaster Recovery Hands-On Lab, additional information can be found at the VMware Cloud Disaster Recovery Solution and VMware Cloud Disaster Recovery Documentation pages.

VMware Site Recovery Manager

VMware Site Recovery Manager (SRM) has been VMware’s main disaster recovery solution for a number of years. SRM enables policy-driven automation of virtual machine failover between sites. Traditionally SRM has been used to protect vSphere workloads in a primary data centre using a secondary data centre also running a VMware vSphere infrastructure. One of the benefits of the hybrid cloud model utilising VMware software in a cloud provider like AWS, Azure, Google Cloud, or Oracle Cloud, is the consistent experience of the SDDC stack; allowing continuity of solutions like SRM.

SRM in this scenario can be used with an on-premises data centre as the protected site, and a VMware stack using VMware Cloud on AWS, Azure VMware Solution (AVS), Google Cloud VMware Engine (GCVE), or Oracle Cloud VMware Solution (OCVS) as the recovery site. SRM can also be used to protect virtual machines within one of the VMware cloud-based offerings, for example failover between regions, or even between cloud providers. Of these different options Site Recovery Manager can be deployed and managed by the customer, whereas VMware Cloud on AWS also offers a SaaS version of Site Recovery Manager; VMware Site Recovery, which is covered in the next section.

VMware Site Recovery

SRM does require the recovery site to be up and running but can still prove value for money. Using the hybrid cloud model infrastructure in the cloud can be scaled out on-demand to fulfil failover capacity, reducing the amount of standby hardware required. The difference here is that vSphere Replication is replicating virtual machines to the SDDC vSAN datastore, whereas VCDR replicates to a separate scale-out file system. The minimum number of nodes may be driven by storage requirements depending on the amount of data being protected. The recovery site could also be configured active/active, or run test and dev workloads that can be shut down to reclaim compute capacity. Again storage overhead is a consideration when deploying this type of model. Each solution will have its place depending on the use case.

SRM allows for centralised recovery plans of VMs and groups of VMs, with features like priority groups, dependencies, shut down and start up customisations, including IP address changes using VMware Tools, and non-disruptive recovery testing. If you’ve used SRM before the concept is the same for using a VMware cloud-based recovery site as a normal data centre; an SRM appliance is deployed and registered with vCenter to collect objects like datastores, networks, resource pools, etc. required for failover. If you haven’t used SRM before you can try it for free using either the VMware Site Recovery Manager Evaluation, or VMware Site Recovery Hands-on Lab. Additional information can be found at the VMware Site Recovery Manager Solution and VMware Site Recovery Manager Documentation pages.

VMware Site Recovery

VMware Site Recovery is the same product as Site Recovery Manager, described above, but in SaaS form. VMware Site Recovery is a VMware Cloud based add-on for VMware Cloud on AWS. The service can link to Site Recovery Manager on-premises to enable failover to a VMware Cloud on AWS SDDC, or it can provide protection and failover between SDDC environments in different VMware Cloud on AWS regions. At the time of writing VMware Site Recovery is not available with any other cloud providers. As a SaaS solution VMware Site Recovery is naturally easy to enable, it just needs activating in the VMware Cloud portal. You can find out more from the VMware Site Recovery Solution page.

Closing Notes

For more information on the solutions listed see the VMware Disaster Recovery Solutions page, and check in with your VMware account team to understand the local service provider options relevant to you. There are other solutions available from VMware partners and backup providers. Your existing backup solution for example may offer a DRaaS add-on, or the capability to backup or replicate to cloud storage which can be used to build out your own disaster recovery solution in the cloud.

The table below shows a high level comparison of the difference between VMware Cloud Disaster Recovery and Site Recovery Manager offerings. As you can see there is a trade off between cost and speed of recovery, there are use cases for each solution and in some cases maybe both side by side. Hopefully in future these products will fully integrate to allow DRaaS management from a single interface or source of truth where multiple Recovery Point Objective (RPO) and RTO requirements exist.

SolutionService TypeReplicationFailoverRPOPricing
VMware Cloud Disaster RecoveryOn demand DRaaSCloud based file systemLive mount when capacity is available~4 hoursPer VM, per TiB of storage, list price is public here. VMC on AWS capacity may be needed*
VMware Site RecoveryHot DRaaSDirectly to failover capacityFast RTOs using pre-provisioned failover capacityAs low as 5 minutes with vSAN at the protected site, or 15 minutes without vSANPer VM, list price is public here. vSphere Replication is also needed**
VMware Site Recovery ManagerSelf-managedDirectly to failover capacityFast RTOs using pre-provisioned failover capacityAs low as 5 minutes with vSAN at the protected site, or 15 minutes without vSANPer VM, in packs of 25 VMs. vSphere Replication is also needed**
VMware Cloud Disaster Recovery (VCDR) and Site Recovery Manager (SRM) side-by-side comparison

*VMware Cloud on AWS capacity is needed depending on the deployment model, detailed above. For pilot light a minimum of 3 nodes are running all the time, these can be discounted using 1 or 3 year reserved instances. For on-demand if failover is required then the VMC capacity is provisioned using on-demand pricing. List price for both can be found here, but VMware also have a specialist team that will work out the sizing for you.

**vSphere Replication is not sold separately but is included in the following versions of vSphere: Essentials Plus, Standard, Enterprise, Enterprise Plus, and Desktop.

Featured image by Christina @ wocintechchat.com on Unsplash

How to Upgrade to vRealize Operations Manager 8.3

Introduction

Recently I installed vRealize Operations Manager 8.2 in my home lab environment. Less than a week later 8.3 was released – of course it was! The new version has some extra features like 20-second peak metrics and VMware Cloud on AWS objects, but what I’m interested to look at is the new Cloud Management Assessment (CMA). The vSphere Optimisation Assessment (VOA) has been around for a while to show the value of vRealize Operations (vROps) and optimise vSphere environments. The CMA is the next logical step in extending that capability out into VMware Cloud and vRealize Cloud solutions. You can read more in the What’s New in vRealize Operations 8.3 blog. This post walks through the steps required to upgrade vRealize Operations Manager from 8.2 to 8.3.

vRealize Operations Manager 8.3 Upgrade Guide

The upgrade process is really quick and easy for a single node standard deployment. The upgrade may take longer if you have multiple distributed nodes that the software update needs pushing out to, or if you need to clone any custom content. If you are upgrading from vROps 8.1.1 or earlier you will need to upgrade End Point Operations Management agents using the steps detailed here. The agent builds for 8.3 and 8.2 are the same.

Before upgrading vRealize Operations Manager we’ll run the Upgrade Assessment Tool; a non-intrusive read only software package that produces a report showing system validation checks and any removed or discontinued metrics. The latter point is important to make sure you don’t lose any customisation like dashboards or management packs as part of the upgrade. Here are some additional points for the upgrade:

  • Take a snapshot or backup of the existing vRealize Operations Manager before starting
  • Check the existing vRealize Operations Manager is running on ESXi 6.5 U1 and later, and managed by vCenter 6.5 or later
  • Check the existing vRealize Operations Manager is running at least hardware version 11
  • You can upgrade to vROps 8.3 from versions 7.0 and later, check the available upgrade paths here
  • If you are using any other VMware solutions check product interoperability here
  • If you need to backup and restore custom content review the Upgrade, Backup and Restore section of the vRealize Operations 8.3 documentation here

vROps 8.3 Upgrade Checks

First, download the vRealize Operations Manager upgrade files, you’ll need the Virtual Appliance Upgrade for 8.x or 7.x pak file, and the Upgrade Assessment Tool pak file. The vRealize Operations 8.3 release notes can be found here.

Browse to the FQDN or IP address of the vRealize Operations Manager master node /admin, and log in with the admin credentials.

vRealize Operations Manager admin login

From the left-hand navigation pane browse to Software Update. Click Install Software Update and upload the Upgrade Assessment Tool pak file. Follow the steps to accept the End User License Agreement (EULA) and click Install.

Check the status of the software bundle from the Software Update tab. Once complete, click Support and Support Bundles. Highlight the bundle and click the download icon to obtain a copy of the report.

vRealize Operations Manager support bundle

Extract the downloaded zip file and expand the apuat-data and report folders. Open index.html.

vRealize Operations Manager system validation

System validation checks and impacted components can be viewed. For any impacted components you can drill down into the depreciated metric and view any applicable replacements.

vRealize Operations Manager content validation

vROps 8.3 Upgrade Process

Following system and content validation checks the next step is to run the installer itself. Navigate back to the Software Update tab and click Install Software Update. Upload the vRealize Operations Manager 8.3 upgrade pak file.

vRealize Operations Manager software update

When upload and staging is complete click Next.

vRealize Operations Manager software upload

Accept the End User License Agreement (EULA) and click Next.

vRealize Operations Manager EULA

Review the update information and click Next.

vRealize Operations Manager update information

Click Install to begin the software update.

vRealize Operations Manager install software update

You can monitor the upgrade process from the Software Update page, however after about 5 minutes you will be logged out.

vRealize Operations Manager update in progress

After logging back in, it takes around a further 15-20 minutes before the update is finalised and the cluster is brought back online. Refresh the System Status and System Update pages when complete.

vRealize Operations Manager update complete

I can now log back into vROps. The Cloud Management Assessment can be accessed from the Quick Start page by expanding View More, selecting Run Assessments and clicking VMware vRealize Cloud Management Assessment.

vRealize Operations Manager Cloud Management Assessments