vSphere visibility for Storage Team

Ask any Storage Team and Platform Team whether the collaboration between them can be improved by a mile, and you are likely to get a nod. One reason for this issue is there is lack of common visibility. You need to see the same thing if you want to collaborate. Storage Team do not get always get access vSphere vSphere. Even if they do, vCenter UI is not designed for Storage team. It is designed for VMware Admin.

vRealize Operations and Log Insight can bridge that providing a set of read-only, purpose-built dashboards, that answer common questions such as:

When a VM Owner complains, can we clear if it’s a storage issue within 1 minute?

No ping pong between VM Owner, vSphere Admin, Storage Admin

Is the Storage serving all the VMs well?

If not, who are affected, when and how bad? Read or Write?

The answer has to be tier based, as Tier 1 VM expects lower latency than Tier 3

What’s the total demand hitting the array? Are they growing fast?

Who are the heavy hitters among the VMs?

When & where are we running out of capacity?

How much disk space can be reclaimed? From which VMs?

What have we got?

Are they consistently configured?

The questions above cover the main areas of SDDC Operations, such as performance, capacity, configuration and availability. They enable joint troubleshooting, capacity planning, performance monitoring. For better collaboration, add Blue Medora TVS, so you can analyze physical arrays and fabrics, and then correlate back with vSphere.

Overview

The first dashboard provides overall visibility to Storage team. It gives insight into the SDDC by showing relevant objects.

It quickly show the summary of key information.

It shows VM, datastore, datastore clusters, compute cluster, and datacenter. It shows their relationship, which you can interact and drill down.

It shows all the VMs, where they are located, how much space they are allocated, and how they are using it.

Limitations & Customisation:

No RDM. Customers are moving away, and not many are using it to begin with.

Performance

The purpose of the vSphere platform is running VM. So long it is providing good service to the VM, we don’t have to explain the thing underneath. Whether 10,000 IOPS at VM level translates into 8,000 at hypervisor level due to caching at the host, is not as important as the VMs are being served well (as defined by SLA).

So we need to find out what’s the latency of every single VM in the vCenter. This is near impossible to establish in vCenter, as you have to go thru a lot of VMs. vR Ops helps using super metric.

Overall Performance

The set of dashboards answer questions such as:

What’s the overall performance, for each cluster and datastore? No point troubleshooting a VM or ESXi if the overall array is heavily hit.

What’s the total demand hitting the storage system? Who are the heavy hitters?

When a cluster is not performing, do we know when and which VMs were affected? Looking at cluster is useful as that’s where the demand comes from.

Total Latency is not Read + Write latency. In IP Storage, it is not Tx and Rx. It is both Tx as the ESXi host is sending the packets.

Datastore Performance

As this is for Storage Team, we can drill down to a specific datastore. It provides detail line charts of the datastore latency, throughput, outstanding IO and IOPS.

It also shows the VMs in the datastore, and if any of them is generating a lot of IOPS (villain VM).

Heavy Hitters

The performance problem could be caused by high overall loads. The dashboard shows you the total IOPS and total Throughput. If the number is high, you can drill down to see if there are Heavy Hitters.

What is Heavy Hitters? It has to be defined.

What: IOPS or Throughput?

A VM with large block size can generate high throughput without doing excessive IOPS.

Storage as a Service

When a VM owner complain, can we rule out within 1 minute whether Storage is the issue?

Using the following dashboard, you select or browse for the VM in question. Its key storage properties and KPI will be automatically shown. We are using line chart as the problem might happen in the past and no longer present. You can also verify if it’s one off issue or regular issue.

The VM’s datastore will be shown automatically. The VM in the screenshot has its VMDK files in 3 different datastores. You can click on each, and the performance will be automatically shown. This lets you verify if the underlying datastore was able to cope or not.

Limitation and Customisation:

The dashboard does not show Throughput. Throughput matters more on large block size. 4 – 32K block size should not be problem when IOPS is low.

The dashboard does not show Outstanding IO. This is useful to tell if underlying infra unable to process.

Add snapshot. Latency for snapshot will be higher as it has to go through multiple operations.

Hope you find the material useful. If you do, go back to the Main Page for the complete coverage of SDDC Operations. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.