Monthly Archives: April 2015

Network

We will cover network in this blog. It applies to all tiers, as you should not have drop packets in any tier, and your network utilisation should in healthy ranges. As network is normally shared, it’s also easier to monitor per physical data center.

It’s coming to 2017. You should be on 10GE, as the chance of ESXi saturating 1 GE is not something you can ignore. The chance of ESXi saturating 2x 10 GE link is quite low, unless you run vSphere FT and VSAN (or other form of distributed storage)

To help you monitor, you can create the following:

A line chart showing the maximum network drop packets at the physical data center level. I use a physical data center as they eventually share the same core switches.

A line chart showing the maximum and average ESXi vmnic at the same level as per above.

To recap, we need to create the following:

A line chart showing the maximum network drop packets in the physical DC.

A line chart showing the maximum and average ESXi vmnic utilization in the physical DC.

I use physical data center, not virtual data center. Can you guess why?

It’s easier to manage the network per physical data center. Unless your network is stretched, problems do not span across. Review this excellent article by Ivan, one of the networking industry authority.

The problem is how to choose ESXi from the same data center? It is possible for a physical data center to have multiple vCenter servers. On the other hand, it is also possible for vRealize Operations World object, or even a single vCenter, to span multiple physical data centers. So you need to manually determine the right object, so you get all the ESXi in that physical data center. For example, if you have 1 vRealize Operations managing 2 physical data centers, you definitely cannot use the World object. It will span across both data centers.

The screenshot below shows the super metric formula to get the maximum network drop packet at a vCenter data center object. Notice I use depth=3, as the data center object is 3 level above ESXi host object.

I did a preview of the super metric. As you can see above, it’s a flat line of 0. That’s what you should expect. No dropped packet at all from every host in your data center.

Dropped packet is much easier to track, as you expect 0 everywhere. Utilization is harder. If your ESXi has mix 10G and 1G vmnic, generally speaking you would expect the 10G to dominate the data. This is where consistent design & standard matter. Without it, you need to apply a different formula for different configuration of ESXi host.

Let’s look at the Maximum first, then Average. As I shared in this blog, you want to ensure that not a single vmnic is saturated. This means you need to track it at the vmnic level, not ESXi host level. Tracking at the ESXi Host level, as shown in the following screenshot, can hide the data at vmnic level. Take an example. Your ESXi has 8 x 1 Gb NIC. You are seeing a throughput of 4 Gbps. At the ESXi host level, it’s only 50% utilized. But that 4 Gbps is unlikely to be spread evenly. There is a potential at a vmnic is saturated, while others are hardly utilized.

As I shared in this blog, the super metric formula you need to copy-paste is

The above is based on 4 vmnic per ESXi. If you have 2x 10 Gb, then you just need vmnic0 and vmnic1. If you have 6 vmnic, then you have to add vmnic4 and vmnic5.

The above will give you per ESXi host. You then need to apply it per physical data center. Please review this blog post.

Ok, the above will get us the maximum. We then apply the same approach for average. The great thing about taking the average at individual vmnic is you do not have to worry about how many vmnics an ESXi host has. If you use the data at the ESXi Host level, as shown in the screenshot below, you need to divide the number by the number of vmnics.

Once you have the Maximum and Average, you want to ensure that the Maximum is not near your physical limit, and the Average is showing a healthy utilization. A number near the physical limit means you have a risk of capacity. A number with low utilization means you over provisioned the hardware.

BTW, there is 1 physical NIC that is not monitored in the above. Can you guess which one?

Yes, it’s the iLO NIC. That does not show up as vmnic. Good thing is generally there is very little traffic there, and certainly no data traffic.

If you land into this Part 4 directly, I’d recommend that you review Part 1 first.

Compute Tier 2 and 3 (lowest)

To recap, you need to create line charts showing the following:

The maximum CPU contention and average CPU contention for all VMs in the cluster

The Maximum RAM contention and average RAM contention for all VMs

Total number of VM left in the cluster.

The screenshot below shows the super metric formula to get the Maximum CPU Contention of all the VMs in the cluster. To create the Average CPU Contention super metric, you just need to replace the string Max with Avg in the formula.

The screenshot below shows the super metric formula to get the Maximum Memory Contention of all the VMs in the cluster. To create the Average RAM Contention super metric, you just need to replace the string Max with Avg in the formula.

That’s all you need to the get the first 2 line charts, out of the 5 that you need.

To get the “Total number of VM left in the cluster”, refer to Part 3, as it is the same formula. You just have a different threshold.

Here is the resultant dashboard looks like:

In the next post, I will cover Network. It applies to all tiers, as you should not have drop packets in any tier, and your network utilisation should in healthy ranges. As network is normally shared, it’s also easier to monitor per physical data center.

Part 1 explained a new concept, where we use Contention as the basis of Capacity Management. Part 2 provided the super metric equation for each charts. Part 3 will provide example of the super metric formula and dashboard screenshots.

Compute Tier 1 (no over-commit)

To recap, we are implementing the dashboard shown here. We need to create line charts showing the following:

The total number of vCPU left in the cluster.

The total number of vRAM left in the cluster.

Total number of VM left in the cluster.

The screenshot below shows the super metric formula to get the total number of vCPU left in the cluster.

Supply = No of Physical Cores in Cluster x ((No of Hosts – 1) / No of Hosts)

Demand = No of running vCPU in cluster

I have to assume there is 1 HA host in the cluster. If you have 2, replace 1 with 2 in the formula above.

I have to calculate the supply manually as vRealize Operations does not have a metric for No of Hosts – HA. Actually, it does, but the metric cannot be enabled.

If you find the formula complex, you can actually split them into 2 super metrics first. Work out Supply, then work out Demand. Let me use the RAM as example.

The screenshot below shows the super metric formula to get the total RAM supply. It is the total RAM in the cluster, after we take into account HA. I have to divide the number by 1024, then again by 2014, to convert from KB to GB.

Notice I always preview it. It’s important to build the habit of always verifying that your formula is correct.

Once the Supply side is done, I worked on the Demand side. Demand here does not refer to the Demand metric in vRealize or vCenter. It is simply the word Demand in dictionary, which is request/order/need/want. It’s demand in “supply & demand.” The following screenshot shows the demand.

Once I verified that both are correct, it’s a matter of combining them together.