Journey to the Virtual World

Tag Archives: metrics

[Update: Gal Zellermayer from the vRealize Operations product team corrected me. In End Point, you can only change interval in the resource level and not per individual metric. Thank you Gal for the expertise!]

Following the previous blogs, where I cover the End Point agent installation, I will now cover

how to enable additional metrics to be collected. vRealize Operations 6.1 comes with hundreds of metrics and properties. Not all of them are enabled, meaning data is not being collected. You can customise what vRealize Operations collect, by modifying the policy. Go to the Policy Library screen, as shown below.

From there, edit the policy you want by selecting it, then click on the edit (pencil) icon on top. A large dialog box, called Edit Monitoring Policy, opens. From here, go to Step 5. Collect Metrics and Properties. You will have something that looks like the following:

In my instance, it has 44,075 metrics, properties and supermetrics. That’s a lot of information that vRealize Operations can potentially collect and analyse for you. Certainly, you do not need most of them. In large scale implementation of vRealize Operations, I recommend you disable what you do not need. This will speed up performance and improve usability.

Back to End Point Operations. Click on the Object Type drop down. From there, go to EP Ops Adapter. Expand it, like what I have shown below. You can see AIX, HPUX, etc. Scroll down until you see the Guest OS that you need. In my case, I’m interested in Windows, and I have selected that.

Can you see how many metrics and properties does vRealize Operations have for Windows?

Yes, that’s 460. That’s a lot of information. We now have great visibility inside Microsoft Windows.

Browse through what you are after. They are Windows counter, although the name maybe different to what Windows call it. You can also use the Filter to filter the list. Enable the metric or property that you like. Properties is typically used in Configuration, while metric used in Performance or Capacity.

You can enable multiple lines at the same time. In the following example, I have selected a mix of properties and metrics. Go to Actions, choose State, then Enable.

Go ahead and enable what you need. You can move to the next set of metrics, without leaving the dialog box. At the end, click on Save button to close the dialog box.

Once you enable them, they will appear on the Windows objects within minutes. In the following screenshot, I have added Memory Commit Limit and Memory Committed Bytes, as I think they are good indicator whether Windows need more RAM or not.

You can also monitor the EP Agent itself. From the screenshot, you can see that it’s collecting more metrics now. I actually added something like 100 metrics, as I’m curious to see what happens to the agent performance 🙂 The good news is the JVM Free Memory remains constant. It did not drop drastically. The EP Agent uses Java. I’ve also verified that the JVM Total Memory is 24 MB. So we’re good here.

BTW, isn’t the icon cool? 🙂

At this junction, you may ask what counter is missing? One that I can think of is CPU Run Queue.

Like this:

As I shared in my book, it is critical to know what those counters in vCenter and vRealize Operations mean. This enables you to pick the right counters for the right purpose. It also leads to correct interpretation.

In this article, let’s take storage disk space capacity. I added the word space, as storage has 2 capacity: IOPS capacity and Disk Space.

Let’s start with vCenter, as that’s the source and foundation. I’m using a datastore cluster, which has 3 datastores. Each has 1 TB, mapped to a 1 TB LUN. Let’s verify what contributes to the Free space column.

To do that, let’s add all the VMs. Hmm… they do not add up to what I saw at Datastore level. Something does not tally.

Can you guess 4 reasons contributing to this discrepancy?

Let’s browse the datastore. We found the first reason. I have non-VM objects. In this case, I have ISO files.

I mention that there are 4 reasons. Can you guess the other 3 reasons? The following screenshot explains the next 2 reasons.

The following screenshot shows the 4th reason. That particular VM has its CDROM coming from another datastore. Once I addressed the reasons, the total column makes more sense.

Once I addressed the above 4 reasons, the total tally. It confirms what I thought, which is the Free column is based on Thin provisioning.

Now that we know exactly what values we have at vCenter, we can go to vRealize Operations. We then pick up metrics that matches what we have in vCenter. This normally involves some trial and error. Here are the counters you should use:

Let’s review the counter further. I did add a 200 GB thin provisioned vmdk and 100 GB thick provisioned vmdk. So the total is 300 GB. vRealize Operations showed in the above. The Used Space (GB) metric went up by 100 GB, proving that it is based on Thin Provisioning. The Total Provisioned Consumed Space (GB) went up by 300 GB.

Do not use the following counters as the collection is less frequent:

Disk Space | Freespace (GB)

Disk Space | Total Used (GB)

Disk Space | Provisioned Space (GB)

As you can see below, their values are correct, but they do not get the frequent update.

Summary:

To see the total capacity in your datastore, use Capacity | Total Capacity (GB)

To see the space consumed in your datastore, use Capacity | Used Space (GB)

If you prefer to see the consumption number in %, use Capacity | Used Space (%)

To see the free in your datastore, use Capacity | Available Space (GB)

Now, the above is based on Thin Provision numbers. If you are doing your planning based on the Thick Provision number, use Capacity | Total Provisioned Consumer Space (GB). But take note that this number does not include non-VM (e.g. ISO) and VMs that are not registered to vCenter. The following screenshot proves that it does not.

The above works well for a datastore. What about at the Datastore Cluster level, since this is where you should be doing your capacity management?

There are less counters, so we need to use super metrics.

To see the total capacity in your datastore cluster, do a super metric to Sum (Datastore: Capacity | Total Capacity (GB))

To see the space consumed, based on Thin Provision, use Disk Space | Total Used (GB)

To see the free space, based on Thin Provision, do a super metric to
Sum (Datastore: | Capacity | Available Space (GB) )

Total space consumed, based on Thick Provision, do a super metric to
Sum (Datastore: Capacity | Total Provisioned Consumer Space (GB) )

Storage

If you look at the ESXi and VM metric groups for storage in the vCenter performance chart, it is not clear how they relate to one another at first glance. You have storage network, storage adapter, storage path, datastore, and disk metric groups that you need to check. How do they impact on one another?

I have created the following diagram to explain the relationship. The beige boxes are what you are likely to be familiar with. You have your ESXi host, and it can have NFS Datastore, VMFS Datastore, or RDM objects. The blue colored boxes represent the metric groups.

NFS and VMFS datastores differ drastically in terms of counters, as NFS is file-based while VMFS is block-based. For NFS, it uses the vmnic, and so the adapter type (FC, FCoE, or iSCSI) is not applicable. Multipathing is handled by the network, so you don’t see it in the storage layer. For VMFS or RDM, you have more detailed visibility of the storage. To start off, each ESXi adapter is visible and you can check the counters for each of them. In terms of relationship, one adapter can have many devices (disk or CDROM). One device is typically accessed via two storage adapters (for availability and load balancing), and it is also accessed via two paths per adapter, with the paths diverging at the storage switch. A single path, which will come from a specific adapter, can naturally connect one adapter to one device. The following diagram shows the four paths:

A storage path takes data from ESXi to the LUN (the term used by vSphere is Disk), not to the datastore. So if the datastore has multiple extents, there are four paths per extent. This is one reason why I did not use more than one extent, as each extent adds four paths. If you are not familiar with extent, Cormac Hogan explains it well on this blog post.

For VMFS, you can see the same counters at both the Datastore level and the Disk level. Their value will be identical if you follow the recommended configuration to create a 1:1 relationship between a datastore and a LUN. This means you present an entire LUN to a datastore (use all of its capacity).

The following screenshot shows how we manage the ESXi storage. Click on the ESXi you need to manage, select the Manage tab, and then the Storage subtab. In this subtab, we can see the adapters, devices, and the host cache. The screen shows an ESXi host with the list of its adapters. I have selected vmhba2, which is an FC HBA. Notice that it is connected to 5 devices. Each device has 4 paths, so I have 20 paths in total

Let’s move on to the Storage Devices tab. The following screenshot shows the list of devices. Because NFS is not a disk, it does not appear here. I have selected one of the devices to show its properties.

If you click on the Paths tab, you will be presented with the information shown in the next screenshot, including whether a path is active. Note that not all paths carry I/O; it depends on your configuration and multipathing software. Because each LUN typically has four paths, path management can be complicated if you have many LUNs.

The story is quite different on the VM layer. A VM does not see the underlying shared storage. It sees local disks only. So regardless of whether the underlying storage is NFS, VMFS, or RDM, it sees all of them as virtual disks. You lose visibility in the physical adapter (for example, you cannot tell how many IOPSs on vmhba2 are coming from a particular VM) and physical paths (for example, how many disk commands traveling on that path are coming from a particular VM). You can, however, see the impact at the Datastore level and the physical Disk level. The Datastore counter is especially useful. For example, if you notice that your IOPS is higher at the Datastore level than at the virtual Disk level, this means you have a snapshot. The snapshot IO is not visible at the virtual Disk level as the snapshot is stored on a different virtual disk.

Network

My apology that I cannot publish information on Network as it’s not provided as free pages by the publisher. The information is covered in my book.