Journey to the Virtual World

Tag Archives: SDDC

What you architect is SDDC. What you handover as business result to CIO is IaaS. We can assess if the architecture is good or not, based on the actual result in production. Does it result in fire-fighting and blame-storming? Or you have a peaceful operations?

The litmus test below helps you assess the maturity of your IaaS.

Do your customers blame your infrastructure?

If the answer is yes, take a step to ask yourself why. There is a high chance you’re relying on complaint in your operations. So you actually encourage it. No complaint, no problem. A Complaint-based Operations.

The reason why you rely on complaint is you don’t have other means. You have not defined the performance of your IaaS.

A sign of matured operations is you have Performance SLA. It is per-VM, measured every 5 minutes.

Is your IaaS cheaper than both VMware on Amazon and Amazon?

If not, your CIO may question your business value. The reason for having an in-house architect is so you can bring lower cost, after taking into account your salary.

Does Help Desk provide a good first level defense?

If Help Desk simply passes through to the next level, you need to look at why.

Help Desk is your first line of defence. They are not as technical as you are. Equip them with simple dashboard so they can handle VM Owner complaint:

Is the problem caused by IaaS not serving the VM well?

If yes, which part of the Infra: CPU, RAM, Disk, Network?

If not, how to prove it convincingly?

Can you justify new infrastructure when utilization is not yet high?

This is not referring to additional money that comes with new project. This is referring to existing clusters/storage.

Capacity is measured on utilization and performance. A cluster capacity is full if it can’t serve its VMs well. Since it takes time to buy hardware, you need to have have early warning to detect this performance degradation.

Do you struggle with many over-provisioned VMs?

This is an indicator that you’re operating as a System Builder as opposed to a Service Provider.

As a System Builder, you’re meddling with each System (read: Application). You size them, and argue with application team.

As a Service Provider, you’re not “on the way”. IT simply uses an effective pricing model to drive the right behaviour. Does AWS block you when you buy 40 CPU EC2 VM when you only need 2 CPU?

Does Troubleshooting mean all hands on deck?

Do you have a process that is followed by all teams (network, storage, server, OS, application)? Does that process end with Root Cause Analysis?

As part of RCA, do you set up alert so issue can be detected faster if it happens again?

There are more questions, but I thought we start with those first. If you want to see details, download this.

Dashboard: Performance

That’s the key question that you need to answer. You need to show if the clusters are coping well. Show how the clusters are performing in terms of CPU, RAM and Disk.

The above dashboard is per Service Tier. Do you know why?

Yes, the threshold differs for each tier. What is acceptable in Tier 3 may not be acceptable in Tier 1.

The good thing about line chart is it provides visibility beyond present time. You can show the last 6 hours and still get good details. Showing >24 hours will result in visualization that is too static, not suitable for NOC use case.

Limitation & customisation:

You need 1 widget per Service Tier.

If you only have a few clusters, you can show multiple Service Tiers in 1 dashboard. 1 row per tier results in simpler visualisation.

In environment with >10 clusters, we can group them into Service Tier. Focus more on the highest tier (Tier 1).

In environment with >100 clusters, we need another grouping in between. Group the Tier 1 clusters into physical location.

When a cluster is unable to cope, is it because it’s having high utilization? I show CPU, RAM and Disk here. You can add Network as you know the physical limit of ESXi vmnic.

Disk is tricky for 2 reasons:

It is not in %. There is no 100% for IOPS or Throughput. The good thing is when you architected the array or vSAN, you did have certain IOPS number in mind right? 😉 Well, now you can get the storage vendor to prove it that it does indeed perform as per the PowerPoint 😉 If not, you get free hardware if they promise a fast performance that will meet business requirement.

You need to show both IOPS and Throughput. While they are related, you can have low IOPS and high throughput, and vice versa.

If the cluster utilization is high, the next dashboard drills into each ESXi.

We can also see if there are unbalanced. In theory, they should not, if you set DRS to Fully Automated, and pick at least the middle level sensitivity (3). DRS in vSphere 6.5 considers Network also, so you can have unbalanced CPU/RAM.

With the dashboard above, we can tell if ESXi CPU Utilisation is healthy or not.

Low value does not mean VM performs fast. A VM is concerned with VM Contention metric, not ESXi Utilization. Low value means we over invest. It is not healthy as we waste money and power.

High value means we risk performance (read: contention)

For ESXi, go with higher core count. You save license if you can reduce socket.

We can also tell if ESXi RAM Utilisation is healthy or not.

Customers tend to overspend on RAM and underspend on CPU. The reason is this.

For RAM, we have 2 metrics:

Active RAM

Mapped RAM

The value you want is somewhere in between Active and RAM.

In the dashboard, the 3 widgets have different range. The range I set is 30 – 90, 50 – 100 and 10 – 90.

Why not 0 – 100?

It is not 100% because you want to cater for HA. Your ESXi should not hit 100% as if you have HA, it would be beyond 100, meaning performance will be badly affected.

If the cluster or ESXi utilization is high, is it because there are VMs generating excessive workloads?

The dashboard above answers if we have VMs that dominate the shared environment.

CPU: show a heat map showing all VMs, sized by CPU Demand in GHz (not %), color by contention

RAM: show a heat map showing all VMs, sized by Active RAM, color by contention

Storage: show a heat map showing all VMs, sized by IOPS color by latency.

At a glance, we can tell the workload distribution among the VMs. We can also tell if they being served well or not.

Limitation & customisation:

You need 1 widget per Service Tier.

You can change the threshold anytime. If you want a brand new storage from Finance, set the max to 1 ms 😉

I’d focus on the new dashboards, since I was the one designing them. If the dashboards are not meeting your requirements, you now know exactly where to complain 😉 The dashboards were reviewed extensively by Product Managers (Monica Sharma, Ronit Halachmi Bekel) and Sunny Dua.

The dashboards in 6.4 is a subset of the dashboards in Operationalize Your World program. Around 20% made it. They are also simplified. The reason is we wanted the dashboards to pass The 5-second Test, and be applicable to SMB segment. We also wanted to have more feedback from real life environment, before bringing additional & more advanced dashboards. So do let me know at e1@vmware.com.

The following screenshot shows the 2 sets when you are running 6.4. You end up with both sets. They can co-exist, and the cost is some metrics are duplicated. As part of porting the dashboards to 6.4, we converted the super metrics into regular metrics so it’s simpler for you.

Do we remove the old dashboards in 6.3? Nope, we did not. We simply moved them. Can you guess where they are on the screenshot above?

Yes, we moved them under “Other” folder. In future, we might deprecate them as we enhance the UI and dashboards.

You will notice we have grouped the dashboards into Infrastructure and VM. This is in-line with the Dining Area and Kitchen shared in Operationalize Your World. We wanted to drive your attention that you should ensure the VMs are served well, before you look at the kitchen.

We’ve placed some dashboards outside the folder for your convenience. We’ve also created a Read Me dashboard. We call it Getting Started. It explains the new dashboards.

Technically, the dashboard has only 1 Text Widget. The adventurous among you will ask me if you can clone and tailor it for your company. The answer is yes. No, it is not supported. All the texts and images are in this directory:

Operations Overview dashboard

We designed this dashboard to answer a few frequently asked questions on your day to day operations.

What have we got? If this number change, or not what you expect, you want to probe why.

What’s the Health of my environment? Environment can vary in size, so we group them by vSphere Data Center. The dashboard lists all your data centers. Select one, and you can see its uptime and alerts. You expect the uptime to be 100% and the Alerts to be below your normal operations.

Just because your vSphere is healthy does not mean the VMs are being served well. This is where the Top-N comes in. As this dashboard is your daily dashboard, you should expect the number to be within your expectation.

Cluster Performance dashboard

If the VMs are not being served well, you need to investigate why. This is where the following dashboard comes into play. It lets you see which clusters are not performing well. The heatmap shows the cluster by alert. Start with the reddest cluster.

Select the cluster you want to probe. Its performance counters will be automatically shown. You can see if it’s serving its VMs well. We are using line chart, so you can see the past and check if there is any strange spike.

Heavy Hitter VMs dashboard

One possible cause of performance is you have Villain VM. Your vSphere environment is a shared environment. It can take as little as 1-2 VM to create performance problem in a cluster with 500 VMs.

This dashboard answers if there is any abnormal spike generated by any VM. It tracks both Storage and Network. From the following example, we can easily see there are both excessive storage and very high network throughput. They happened on different time and were caused by different VMs. The dashboard quickly shows the 2 villain VMs. You can see their workload is >10x to the second highest VM.

In the Operationalize Your World, we enhance this dashboard by adding details, and split it into 2 (Storage and Network). This is to facilitate collaboration with your peers.

Datastore Performance dashboard

Cluster covers compute. What about storage? The Datastore Performance dashboard lets you see the performance of all your datastores. You can use the view to select a datastore. Its performance charts will be automatically shown. From the line chart, you could see if the datastore was having difficulty serving its VMs.

You can drill down to see each VM in the datastore. Select a VM, and its IOPS and Latency will be automatically plotted.

VM Performance Troubleshooting dashboard

There are 2 spectrum of performance problem:

Whole house on fire

A small number of VMs were hit

The first few dashboards help you answer the first use case. This dashboard helps you look at a single VM. It’s a big dashboard, so I’ve added visual sections. It has 3 sections.

Section 1

This is where you select the VM. You can search, filter or simply browse.

The selected VM key properties, alerts and how it fits into the larger environment are automatically shown. If a VM is part of Resource Pool, check if the Resource Pool is limiting it.

Section 2

We display both the VM KPI and the IaaS KPI.

IaaS KPI is the 4 key metrics that shows how the IaaS serves this VM. If this is high, there is a good chance your IaaS capacity is full. It is struggle to serve all its VMs.

Section 3

You are verifying if the underlying IaaS was able to serve the VM.

Can you guess why we show Cluster instead of ESXi?

Hint: the performance problem may happened in the past.

In the Operationalize Your World, we expanded this dashboard into a set of dashboard.

VM Usage dashboard

A common request from VM Owner is to get his VM utilization and property. This is what this simple dashboard is for. You simply select the VM, and the key information is automatically shown. We are using line chart again, as the data that the VM Owner wants to know can be in the past.

If you need something more advanced, with self service, review the tenant dashboard here.

Capacity Dashboards

The twin brother of performance is capacity. While they are different, they are closely related. We’ve provided 2 dashboards to get you going:

Cluster capacity

Datastore capacity

The Cluster capacity lets you see a cluster utilization in 3 areas: CPU, RAM and Disk.

The model we use here is based on utilization. It does not take into account Availability Policy and Performance SLA. It is also based on Demand model, which is not suitable if you are doing Tier 1 cluster.

The datastore capacity lets you see quickly which datastore is running out of space, and which datastores are hardly used. It uses the red color to show low capacity, and dark grey to show wastage. What you want to see is balanced usage across all datastores.

Configuration Dashboards

The last set of dashboards cover Configuration. We focus on configuration that need attention, rather than simply listing all configuration. Take the VM configuration dashboard, shown below. It highlights is you have lrge VMs, how large they are, and how many for each size.

You can also customize the filter. Simply edit the view widget.

It also highlight configuration that you need to watch. A VM with > vNIC should get your attention that it can bridge your network.

We apply the same principle to the ESXi Configuration dashboard. For example, it shows the BIOS version. You want to keep the version consistent and minimal.

The Cluster Configuration highlight inconsistent config among members ESXi in the cluster.

The Network configuration lets your peers in the Network team to quickly understand the virtual network. It lists all the distributed virtual switch. Once you select one, it automatically lists all the port groups and ESXi in that switch. It also lists all the VMs. You can control and customise all these lists.

Hope you found them useful. We have intentionally kept them simple in 6.4. If you are running 6.3 or later, and you need a more advanced dashboards, download from here.