vRealize Operations Capacity Shows 100% Cluster Utilisation

Recently we were examining a vSphere cluster where vRealize Operations Manager was showing 100% CPU utilisation, with zero capacity remaining. However, the usage of all resources in the cluster was generally low. We know that the cluster capacity is based on demand rather than usage. CPU demand is the amount of CPU resources a virtual machine would use if there were no CPU contention or limit. Sometimes, this can cause a little confusion when we look at the utilisation metrics of the cluster.

This type of behaviour is actually expected because of how vRealize Operations interprets the data. When virtual machines have latency sensitivity set to high, all of the CPU is requested by the virtual machine in order to reserve it. Since vRealize Operations Manager cannot differentiate between latency sensitivity reservations and legitimate CPU requests, we see CPU and/or memory contention alerts. More information can be found in the KB article Virtual Machine(s) Workload badge reports constant 100+ score in VMware vRealize Operations Manager (2145552). The KB article suggests that if latency sensitivity cannot be set back to normal, then a custom group can be created to disable the alerts.

This scenario is well documented. However what if latency sensitivity is not enabled or configured beyond the default setting, but the symptoms are the same? In this case, the cluster is dedicated to running SQL workloads.

From using the metrics view of the cluster under the environment tab, we can see high peaks for the CPU co-stop and CPU ready values every night. The discrepency seems to be caused by the behaviour of the virtual machines in claiming all available CPU resource at a specific time. Whilst this might sound environmentally specific, there are a number of scenarios where this could be the case and a workaround is needed.

Beyond changing the behaviour of the virtual machines, some available options are as follows:

  • Action the rightsize recommendations to ensure we are not over allocating CPU resources
  • Follow the steps outlined in the KB article above to ignore/disable the alerts
  • Follow the steps outlined below to set a maintenance schedule, disregarding metrics where the peak is at a consistent time every day or night
  • In the capacity policy change the setting of the time remaining calculations

Updating how the time remaining is calculated may be a last resort, but can provide a slightly different interpretation of the data. You can see the description of each setting, and how the associated projection graph changes in the screenshots below. The default policy uses conservative capacity planning which takes the higher values, whereas aggressive uses the averages values of resource utilisation.

To update this setting either change the default policy, or create a new policy to assign to specific objects like a cluster. Follow the policy based steps outlined below, disregarding the maintenance schedule. You can find out more information on how remaining time is calculated in the blog Rightsizing VMs with vRealize Operations.

vRealize Operations Conservative Capacity Policy
vRealize Operations Aggressive Capacity Policy

Setting a Maintenance Schedule

The following steps will walk through creating a maintenance schedule with associated capacity policy. You can also change the time remaining calculations from the capacity policy, with or without a maintenance schedule. The screenshots are from vROps 8.6, but previous versions of 8.x should be a similar process.

  • First, create the maintenance schedule. From the left hand navigation pane, expand Configure and select Maintenance Schedules.
  • Click Add. Enter the name, time zone, and time configuration of the schedule. Click Save.
  • Next, we need to create a policy. From the Configure menu again, select Policies.
  • Click Add. Enter the name, and select a policy to clone. Click Create Policy.
  • Select the policy from the list, and click Edit Policy.
  • Select the Capacity block, and then choose the object type.
vRealize Operations Capacity Policy

Here if required you can change the policy for time remaining calculations, mentioned above, as well as manually change the alert thresholds. When considering the time remaining calculations, the default conservative policy will take the highest resource utilisation to project the time remaining before this crosses the usable capacity threshold. The aggressive policy will use the mean average resource utilisation to project the time remaining before this average crosses the usable capacity threshold. Both policies are of use, aggressive may be better suited to smaller organisations wanting to sweat hardware assets.

  • Make any desired changes to the policy per the description above. Scroll down to Maintenance Schedule and select the schedule created earlier. Click Save.
  • Next, select Groups and Objects. Choose a custom group or object to apply the policy to, and click Save.
vRealize Operations Assigned Policy
  • Now that the policy is configured and assigned to an object, it is active and in use.
vRealize Operations Active Policy
  • When we check back on the maintenance schedule we can now see the linked policy.
vRealize Operations Maintenance Schedule

There are additional ways of setting maintenance schedules, the example above is relevant to the described use case to disregard metrics during a certain time interval. You can also manually enter maintenance through both the vROps UI and API, see Maintenance Mode for vRealize Operations Objects, Part 1 by Thomas Kopton, or create dynamic groups containing hosts in maintenance mode, see Maintenance Mode for vRealize Operations Objects, Part 2.

Multi-Cloud Management with vRealize Operations

This post will take a look at how vRealize Operations (vROps) can provide a single monitoring and visibility tool into your on-premises data centre, native public cloud services, and hybrid cloud platforms like VMware Cloud on AWS, or Azure VMware Solution. vRealize Operations provides VMware customers with monitoring and alerting, troubleshooting and remediation, dashboards and reporting, performance and capacity management, cost visibility and comparison, and security compliance.

vROps for Cloud-First

The vRealize Operations Manager instance itself can either be self-hosted (on-premises) where the customer is responsible for lifecycle management, hosting and availability, or Software-as-a-Service (SaaS). When using SaaS, vRealize Operations Cloud is hosted and maintained by VMware, and consumed as a service by the customer. Whilst the self-managed vRealize Operations is packaged into Standard, Advanced, and Enterprise editions, vROps Cloud comes in one edition only which has feature parity with enterprise, plus some additional capabilities like near-real-time 20 second monitoring. You can compare features between Standard, Advanced, Enterprise, and Cloud editions in the vRealize Operations Solution Brief.

In the UK, the closest locality for vROps Cloud is currently Frankfurt, you can review compliance and data processing information in the VMware Cloud Trust Centre. When looking at public cloud or hybrid cloud, including SaaS options, you may also want to review VMware’s award winning sustainability initiatives including a commitment to net zero carbon emissions by 2030 across VMware global operations, all VMware Cloud solutions and VMware Cloud Provider Partners.

vROps also now integrates with CloudHealth, providing advanced financial management and optimisation recommendations for native cloud resources in Azure, AWS, Google Cloud Platform, and Oracle Cloud Platform. As well as overall cost savings, finance teams can use cloud health with resource tagging to bill individual departments for the exact capacity they have used. This empowers service or application owners to look after their digital assets and only use resources or hold data that they really need. The power of CloudHealth can be brought into vROps using the new management pack.

Hybrid Cloud Examples

The example below shows a customer with a hybrid cloud setup. In this scenario they may choose to host big data services in the Microsoft Azure cloud, and VMware workloads across on-premises and Azure VMware Solution. The hyperscaler is interchangeable and could be AWS, Google Cloud, Oracle Cloud, or a combination of cloud providers. Using vRealize Operations we are able to provide a consistent operating model across platforms from a single SaaS based UI.

When onboarding with vRealize Operations Cloud, the primary contact on the account will receive an activation email to enable the subscription. A Cloud Customer Success Manager will carry out the activation steps with you. Once onboarded rolling updates are carried out automatically for new features. You can also take a look at the vRealize Operations Cloud Solution Overview.

vRealize Operations with Azure

The cloud proxy is an OVF appliance deployed to the vCenter Server. This proxy forms a tunnel using HTTPS to send data to the SaaS based control plane. The OVA requires HTTPS access outbound to a set of URLs, which can be found in the vRealize Operations Cloud Documentation.

The same cloud proxy model can be used for Azure VMware Solution. There are some points to be aware of with Azure VMware Solution, such as limited visibility into management VMs (as this is part of a managed service). Nothing problematic but these are listed in the Known Limitations section of the documentation. If you are running an ‘on-premises’ or self-managed version of vRealize Operations, instead of the SaaS version, then at this time the vRealize Operations Manager appliance cannot run directly on Azure VMware Solution.

Native Azure services can be added using an Azure AD app registration with service principal/client secret. Instructions can be found in the Configuring Microsoft Azure section of the documentation, you can also find a list of Supported Azure Services for vROps. Again, this doesn’t have to be Microsoft Azure, it could be AWS.

AWS works slightly different in that, when configuring VMware Cloud on AWS for use with vRealize Operations Cloud, the integration happens through an API token, since both solutions are native to the VMware Cloud Services Portal (CSP), see Configuring VMC on AWS in vROps Cloud.

Native AWS services can be added using an IAM generated access key and secret. Instructions can be found in the VMware documentation under Add a Cloud Account for AWS, you can also find a list of Supported AWS Services for vROps.

vRealize Operations with AWS

Additional Resources

VMware Hands-on-Labs are a fantastic free resource giving access to sandpit environments with step by step instructions for nearly all VMware solutions. Some example Hands-on-Labs for vROps are listed below, along with further video and written documentation.

  • HOL-2101-91-CMP – Getting Started with vRealize Operations – Lightning Lab
  • HOL-2101-06-CMP – vRealize Operations Advanced Topics
  • HOL-2101-04-CMP – vRealize Operations – Optimize and Plan vSphere Capacity and Costs
vRealize Operations Troubleshooting Workbench

The following sessions are available at VMworld 2021, and if you’re reading this after the event the sessions will also be made available on-demand.

  • A Big Update on vRealize Operations [MCL1277] Technical level 100
  • vROps Dashboarding 101 and Beyond [VMTN2843] Technical level 200
  • Manage Public Cloud with CloudHealth and vRealize [MCL1247] Technical level 100
  • An End-to-End Demo of Taming Public Clouds with CloudHealth and vRealize [MCL1439] Technical level 300 (Tech+ pass)
  • Track Sustainability Goals in Datacenter with vRealize Operations [VMTN2802] Technial level 200
  • Accelerate Your VDI Management with vRealize Operations [MCL1899] Business level 100
  • Next-Gen Infra and Apps Operations Management with vROps – Design Studio [UX2539]
  • Consistent Cloud Operations with vCenter and vRealize Operations [MCL2611] Technical level 100
  • An End-to-End Demo – Operationalizing VMware Cloud Foundation with vRealize [MCL1442] Technical level 300 (Tech+ pass)
  • A Cloud Management Journey from Monolith to Modern Apps with vRealize Suite [GWS-HOL-2201-08-CMP] Technical level 200 (Tech+ pass)
  • Design Principles: Cloud Architecture Design and Operations [MCL2151] Technical level 200
  • Get Close to 100% Automation to Get to True Cloud Operations at Scale [MCL2023] Technical level 300 (Tech+ pass)
vRealize Operations ESXi Configuration Dashboard