Super Metric

I use around ~50 super metrics in typical engagement. By and large, it’s good enough for my customers’ requirements. On the other hand, I have seen folks like Ronald Buder and Brandon Gordon, who build very advance formula, and would like to have more capabilities. Here are 3 enhancements that would go a long way in making super metric more useful:

Ability to specify a condition

Prior to 6.3, super metric applies to every member of the group or the parent object. If you are counting the number of VMs in a cluster, it will give you all VMs.

With 6.3, you can add condition. You can count only VMs that are powered on, or VMs with >8 vCPU. Another example, you can count how many VMs in a Datastore which have latency above certain number.

Ability to have IF THEN ELSE

Prior to 6.3, super metric works in 1 formula. You cannot apply Formula 1 for condition A and Formula 2 if Condition A is not met. A use case here is you are checking VM Uptime. If you have VMware Tools running, you use the Tools heartbeat to decide that the VM is up. If the VMware Tools is not running, you use VM utilization to decide.

The IF THEN ELSE can be combined with AND, OR and NOT. This enables you to build a more comprehensive logic.

You can chain it to create IF THEN ELSEIF.

Ability to combine expression

You can have AND, OR, NOT. Enough said 🙂

Ability to compare. You can have less than, less than or equal to, greater than, greater than or equal to.

The where clause cannot point to another object, but can point to different metric in the same object. For example, you cannot count the number of VMs in a cluster with CPU Contention metric > SLA of that cluster. The phrase “SLA of that cluster” belongs to the cluster object, not VM object.

That right operand must also be a number. It cannot be another super metric or variable.

The where clause cannot be combined with AND, OR, NOT. This means you cannot have “where VM CPU > 4 and VM RAM > 16”. The reason is that ‘where’ clause calculation is running on the vR Ops node where the data is retrieved, while the rest of all operators (AND, OR, NOT) are running on the node where the super metric expression is executed. Other operators are executed when all data has already retrieved. The retrieved data does not contain metric values for each member object but aggregated values of these objects.

As expected, you will find the new operators in the super metric editor, as shown below.

The following screenshot, courtesy of Brandon Gordon, shows a brief description of the operators:

The [x,y,z] array is actually available since earlier release. What you can do now is x, y, z are independent expressions and all their results are put into the array. They are no longer limited to just constant or metric.

Resource Alias

The name of resource is rather long. If you have a lot of resources in the formula, the whole formula can be hard to read. You can now have a name for the resource. Here is an example:

Notice that CPUload includes the depth clause and where clause, not just the metric.

VM Memory metrics

I discuss the limitation in sizing VM RAM in this blog. In a nutshell, the hypervisor does not have visibility into how the Guest OS manages its RAM. Some applications, such as JVM and databases, manage their own RAM. The guest OS does not have visibility to how the app manages its RAM. This is why RAM sizing is best done at Guest OS and App levels.

vR Ops 6.3 brings Guest OS RAM metrics. Yes, it is agentless. There is no need to deploy agents on every VM. How does it work then, if there is no network connection to the VM? VMware Tools comes to the rescue! vR Ops talks to vCenter, which in turns talks to the ESXi via management network. The new version of VMware Tools pulls these additional counters. ESXi retrieves them and passes them to vCenter.

This feature was actually available since vSphere 6.0 Update 1. Yes, that means you need a minimum of ESXi 6.0U1, vCenter 6.0U1 and the VM must be running Tools from ESXi 6.0U1. You do not need to upgrade to vSphere Update 2.

The table shows a variety of VMs with the Guest OS data. I’ve added the Active RAM from hypervisor as a comparison.

Here is the list of metrics. I’m using the internal name as the table above already has the friendly name.

Internal name

Description

guest|mem.free_latest

This is one the 3 major counters for capacity & performance monitoring. The other 2 counters are Page-in Rate and Commit Ratio.
In Windows, this is the Free Memory counter. This excludes the cached memory. If this number drops to a low number, Windows is running out of Free RAM. While that number varies per application and use case, I’d generally keep this number > 500 MB for server VM and >100 MB for VDI VM. I set a lower number for VDI because they add up. If you have 10K users, that’s 1 TB of RAM.

guest|mem.needed_latest

The amount of memory needed by the Guest OS. Below this amount, the Guest OS may swap.
This is Total RAM - Free RAM. It includes the Cached RAM, which Standby + Modified.
The Standby memory (which can be significant on Windows, less so on Linux) can be split into 3: FreeAndZero, Cold and Hot. MemNeeded will count the hot part of the buffer cache as being required by the OS.

guest|page.inRate_latest

The rate of reads going through the underlying paging/cache system. It includes not just swapfile I/O, but cacheable reads as well (double, pages/s). The Rate the Guest OS brings memory back from disk to DIMM per second. A page that was paged out earlier, has to be brought back first before it can be used. This creates performance issue as the application is waiting longer, as disk is much slower than RAM.
Windows does not page out any Large Pages. A process can have concurrent mixed usage of Large and non-Large page in Windows. The page size isn’t a system-wide setting that all processes use. The same is likely true for Linux Huge Pages

guest|page.outRate_latest

The opposite of the above. This is not as important as the above. Just because a block of memory is moved to disk that does not mean the application experiences memory problem. In many cases, the page that was moved out is the idle page.

guest|page.size_latest

Size of the page. In Windows, this is 4 KB by default.
This is not the size of the pagefile.sys in c:\.

guest|mem.physUsable_latest

Physically Usable Memory
Based on a sample of 9 VMs (Windows and Linux), this looks like VM Configured RAM - Hardware used. Since Hardware Used is near 0, this value is near the Configured RAM

guest|swap.spaceRemaining_latest

The amount of swap space remaining, taking into account the possibility of swapfile growth where possible. A low remaining will trigger paging. If the system is configured to run without a swapfile, this will return zero

guest|hugePage.size_latest

Current size of Huge Page.
This should be 2 MB in Windows.

guest|hugePage.total_latest

Total number of Huge Pages.
This is Linux specific.

guest|contextSwapRate_latest

Context Swap Rate per second in Windows/Linux

guest|mem.activeFileCache_latest

Active File Cache Memory. This is the actively in-use subset of the file cache. Unused file cache and non-file backed anonymous buffers (mallocs etc) are not included.
This seems to be the Cache Bytes in Windows

Let’s compare them with the RAM counters from Windows. The list below is from Windows 10 Performance Monitor.

I’m not sure if they are enabled by default. If not, it’s a matter of enabling from the Policy, as shown below:

This is what it looks like in VM object. Finally! 🙂

Reduction in Metrics

This is one of my favourite, as I do have customers struggle with the long list of metrics. This should also improve vR Ops scalability. The example below is from ESXi Host. Quite a number of the capacity metrics are now hidden, as they are needed by default.

The reduction can be seen in the Self Monitoring, which has improved a lot in 6.3 also. You can see the number of metrics dropped on the following chart.

The reduction translates into less resource utilisation (CPU, Disk, RAM). I’ve added CPU as an example. Notice the load is also less spiky.

Drill down via Line chart

One popular use case is the ability to automatically plots all the children value when you select a parent. There are many examples of this, such as:

You select a cluster, and you want to automatically have a line chart of all its ESXi CPU Demand. If you have 8 hosts in that cluster, then you get 8 line charts.

You select a data center, and want to automatically have a line chart of all its clusters No of VM too have a sense of VM growth among clusters.

See the following screenshot. Can you notice how it’s done?

Hint: it’s done differently than in other widgets.

The way you do this is by knowing relationship among objects. You choose the metrics you want to display, not the parent. In the following example, I need to show the ESXi CPU contention on all ESXi in a cluster. So I pick the ESXi object, not the cluster object.

You do not have to specify the relationship (parent, child, self, etc.). vRealize Operations actually automatically figures out the relationship. Unlike other widgets, where you must specify, the View Widget has that intelligence built-in. Nice!

Can you spot a performance issue that happened in the past in the selected cluster below?

The above screenshot shows one of the ESXi experienced a spike in CPU Contention. It touched 9%, which is a high number as the number at ESXi level is the average of all its VMs. One of the VM likely experiencing a much higher number, as most VMs have low CPU Contention. The reason why most have low value because your ESXi has enough cores to serve quite a number of VMs.

Property now accompany metrics

One widget customers use heavily is the Object List widget. It can list any objects along with its metrics. In 6.3, you can now list its property. This makes it a lot more useful.

Heat Map: Zoom and Grouping

I use heat map a lot, especially in Configuration and Capacity use cases. They are also useful in NOC (big screen or projector). They are not so useful in performance as they can only show latest value. Since vR Ops collect data every 5 minutes, that means anything beyond 5 minutes cannot be shown.

The other limitation of Heat Map, which is addressed in 6.3, is scalability. When you have lots of objects, it can be difficult to see. 6.3 groups the objects, and allows you to drill down.

I then drilled down into the selected group. It reveals a lot of more objects.

Sportier looking

I’m a big fan of UI and UX. While underlying architecture matters, the human experience is what we see every time we deal with the system. There are 3 UI enhancements that I spotted as I compared 6.2.1 widgets with 6.3 widgets.

Scoreboard widget

The Scoreboard widget now provides more visual themes than just 2 themes. This is useful when you have multiple Scoreboard widgets in 1 screen. You can use 1 theme for VM and another theme for Infrastructure objects. They help in differentiating objects easily.

There is a small usability enhancement. When you choose Fixed View, the size controls do not appear as it’s not relevant. Choose Fixed Size and they will appear.

Scoreboard Health widget

Here is what it looks like in 6.2. Notice the font for the object name is not so clear. It does not work well if you need to show it on the NOC (big screen projector). The other problem is long name is truncated. Some objects, such as Disk Device and NSX port group, are very long.

Notice the border? Yup, I’m not a big fan either J Personally, I prefer not to see the border. I use this widget to see a lot of objects, so the border does get in the way.

Here is what it looks in 6.3. I definitely find this more usable. Thank you UX team!

Forensic widget

I use forensic widget to quickly know where an object spends 95% of its time. The chart below shows that the ESXi has barely any CPU stress. 95% of the time, the value is not even 0.002%. Once you get used to this widget, it’s a great complement to other visualisation.

As you can see above, in 6.2 the UI is looking a little dated.

This is what it looks like in 6.3. Notice the grid lines make it easier to read. There is also peak and low, so it’s easier to see the minimum and maximum.

GUI Editor for XML interaction

No more manually modifying XML file and figuring out what the metric names are! There is now a wizard that guides you along the way.

Once you select the Adapter Kind, the wizard automatically moves into the Resource Kind. No more typing!

Maintenance Schedule

The maintenance schedule has more flexibility. A few limitations in 6.2 that were addressed in 6.3:

You cannot specify the start date. You can only specify the start time.

You cannot specify the expiry date on this schedule. Often you want to schedule only for a fixed period, such as a few months or weeks.

You cannot specify the number of runs. Sometimes you want to specify that you only need to run this a few times.

As a comparison, here is what the maintenance schedule editor looks like in 6.2:

6.3 addresses the above limitation as you can see in the following screenshot.

Note: The new Maintenance Scheduler is not backward compatible. All previously created maintenance schedules will no longer be available and should be created again.

New VM properties

VM folder and VM Datastore are now available via the View widget. If a VM has >1 datastore, it will show all of them, separated by commas. If you have a nested folder, it will show all of them too.

Steps (Details)

Follow the names exactly. They are hardcoded in the dashboards.
Names are Case Sensitive!
If you do not follow, import will work, but you get hourglass icon.

Part 1: Policy and Metrics

Import the policy. Choose Skip import to ensure nothing is overwritten. You will actually not overwrite anything as the file you import is a dummy policy. All it has is super metrics.

It should take around 1 minutes. You will get this when done.

The purpose of the policy import is to merely import the super metrics. We have to enable them manually. If you are curious the list of super metrics you are getting, the list looks something like this:

Once imported, enable the super metrics in your base policy. Yes, you can bulk enable by selecting multiple lines (as shown below). Use the Actions menu to enable them all.

After you import the Performance SLA super metrics, review their settings. Do adjust the SLA accordingly if you know the performance of your IaaS. If you are running Balance power management, change the CPU SLA to 10, 20, 30 accordingly.

Create 1 policy for each Tier. This has to be based on your active policy, so the inheritance works properly. In the example below, my base policy is called OneCloud Default Policy. Make sure you choose the right one.

You must use the following names for the Policy:

Tier 1

Tier 2

Tier 3

Enable the correct SLA for each tier. In the example below, I’m enabling Tier 2. From the big red number 1, you can see I’m editing a policy named Tier 2. You can see it’s being selected in the background, behind the dialog box.

See the big red number 2: It shows the Performance SLA that should belong to Tier 2. As a result, I only enabled them (see the big red number 3). The easiest is to specify “Tier 2” in the filter, so only Tier 2 super metrics are shown.

I do not enable the super metrics for Tier 1 (see the big red number 4).

Here is enable example, this time I’m using version 6.6:

Click Save to end the editing.

Part 2: Group Type and Group

Create these group types carefully:

Class of Service

VM Types

Tenants

Multi-tier Applications

Single-tier Applications

Application Tier

Your group import will fail if you do not have the group type.

If you mistyped and saved it, do not edit it to correct it. Delete it, and create a new one. The reason is the key wasn’t updated when you edit, only the label.

Once created, import the groups.

For the Service Tiers groups, you need to associate them to the correct policy. To do that, edit the group, and choose the respective policy. The following example shows for Tier 2.

Do the same steps for Tier 1 (Gold) and Tier 3 (Bronze).

BTW, you can also assign the policy to its associated group via the policy library. Your choice. Below is an example. Use the green plus sign, as I circle it below:

You know you got the policy associated when it appears in the Active Policies. The screenshot below show I’ve activated all 3 Tiers

Part 3: View and Dashboard

Import the view, then the dashboard. Choose Overwrite if you’re importing for the 2nd time, or have the old OYW views/dashboards.

The lists shown below is partial. There are >100 in total. I use View widget as they are flexible.

Import the Dashboards. You can import them in any order. When you are done, it looks something like this.

XML Files

Recreate the XML files. They cannot be imported. I use copy paste, even on the file names

Once imported, take your well deserved coffee break! It you have a large environment, it can take an hour for all the dashboards, super metrics, policies, groups, to be applied. During the process, you may see the known error while trying to open a dashboard. Just wait an hour or so.

When things go wrong

If your dashboard has hourglass icon, likely it’s because a metric or object is missing. The root cause is likely a missing group.

You should not need to do any of these things. But if things go wrong, there are a couple of things you can check. First, ensure each Policy actually applies to the correct object. For example, you can see below that I’ve applied the policy named Tier 2 to a group called Tier 2. Under the Assigned Groups, column, it shows it’s being applied to 1 group and it impacts 302 objects.

The same goes with super metrics. In the following example, a super metric is being applied to Tier 2 policy. It’s not applied to other policies, as it does not make sense.

If import fail, you will see the error message. Simply rename the duplicate object, then reimport.

You cannot re-import. The reason is the ID remains the same. Delete the existing object, then reimport. It is safeto delete.

Hope you find the material useful. If you do, go back to the Main Page. It gives you the big picture so you can see how everything fits together.

I try not to duplicate videos already made by others, and will link to theirs instead if I find theirs to be relevant. In general, the videos are applicable to all 6.x versions as the features I used are fairly basic.

My video has no sound for 2 reasons. I tend to make wrong pronunciation and my England isn’t exactly clear. I added music from YouTube as compensation, hope you like it 🙂

The workshop does not cover Installation & Configuration as there are many materials covering it, plus I’m only given around 4 hours to cover both vRealize Operations and Log Insight 🙂

Here is the videos so far:

How to determine if a VM slowness is not caused by your shared infrastructure

Matthias Eisner shows you how to Tags to group objects. He showed how to create tags, and he created a custom dashboard to show an application. Justin, our UI Architect, also shares about Custom Dashboard.