Monthly Archives: September 2016

In this post, I would like to show how you should monitor your vRealize Automation (vRA). Before we start, if you have vRA and vRealize Operations Manager (vR Ops), go here and enjoy this great blog post. But if you don’t have vR Ops, you can use our free tool that was designed to help to monitor all components of vRA. At the end of this post, I will show you how you can integrate this tool with vR Ops.

You can get your vRA Endpoint name on Infrastructure tab, as shown below.

Run the following command to have the final results:

java-jar vrealize-productiontest-1.6.0.1.jar run --oobList VRA

Once completed, you should see the results like below:

vRealize Production Test will generate a full HTML report, located in the report directory from where you ran the command. In my case, it was C:\inetpub\wwwroot\html-reports\report_all\VRPTReport.html)

The final result looks like:

TIP: If you follow the steps from the start of this blog, you can access your report remotely, like shown above.

In the next post I will show you how you can automate this tests and integrate with vR Ops.

Hope you find it useful. Do reach out via Linkedin / twitter. Thanks for reading!

To ensure that your customers are happy, there are a few proof you must be able to show:

Are the VMs up?

This is the #1 Job. It is more important than security and performance. If the VM is dead, there is nothing to talk about 🙂

Are they fast?

Just because they are up does not mean they are fast!

Is your IaaS serving them well?

If not, which VMs are hit? By what and when?

Who are the victims?

Who’s causing the problem?

Who are the villain?

We saw a performance degradation on a cluster of 500 VM when just 1-2 VM did an IOmeter.

VMs with excessive usage hurts the business.

When a VM Owner complains, can your Help Desk value add, within 1 minute?

Who have time to play corporate ping pong game when there is so much to do?

We know we have Over Provisioning disease.

But how bad is it exactly? It impacts both Performance and Capacity

Can you Right-sizing VM, without impacting performance?

Are they any configuration issue you need to be aware of?

Let’s go through the dashboards that answer those questions, starting from Question 1.

Are the VMs up?

The dashboard helps in the following area:

What’s the overall uptime? CIO may ask you to give the overall uptime across time. You can provide a line chart, showing the aggregate uptime among all the VMs.

What’s the Uptime for each VM per month? The table on the dashboard is grouped by month. It’s showing Sep 2016. All VMs are showing 100%, which is what you want to see before you go for lunch or holiday 🙂

What’s the VM availability now? The heat map provides an easy visualisation. You just expect green for all VMs.

If a VM Uptime is <100%, when was it down and how long? You can click on the heat map, and a line chart will be shown automatically. What you want to see is a straight line.

Are they fast?

The dashboard helps in the following area:

Is your IaaS serving them well? If not, when does it fail to deliver?

If you do not define well, you have not quantified fast. If you have not defined it, you have not set measureable expectation. That’s not a position you want to take, unless you enjoy performance troubleshooting 🙂

Who are the victims?

Once you know which cluster has the problem, and the time & type of problem, you can drill down.

Your IaaS can fail to deliver different resources at different time. For example, it has CPU performance issue at 12:35 pm and Disk performance issue at 22:40 pm. The performance line chart shows you any correlation, if any. In the above example, the selected cluster has Storage performance issue, but doing well on RAM.

During the same time interval, different VMs can be hit by different problems. If your IaaS fails to deliver on CPU and Disk at 12:35 pm, VM 007 can be hit with CPU problem while VM 747 can be hit with Disk problem. This is why you need to be able to see each resource (CPU, RAM, Disk, Network) independently.

[e1: there is a known bug that prevent you from having 4-equal column]

This dashboard depends on the previous dashboard. You select a cluster, then navigate to this dashboard. It will only show VMs from that cluster. You can see which VMs are hit by what (CPU, RAM, Disk, Network). This lets you take the appropriate action, before VM Owner complains.

The packet drop counter can be unreliable if you are not at the right patch level. See this KB. This issue is resolved in:

ESXi 6.0, Patch ESXi-6.0.0-20160804001-standard

ESXi 5.5, Patch ESXi550-201312401-BG: Updates esx-base,

Who are the villain?

Which VMs were generating excessive workload? When and for how long?

You can see it by tracking the maximum workload generated by any VM on a line chart. The example below shows an excessive IOPS. It jumped to 13,212 IOPS when the average did not even touch 15 IOPS.

VMs can only generate excessive workload on IOPS and Network. It can’t abuse CPU and RAM, as it can’t go beyond the configuration. The dashboard tracks IOPS and Network. Once you see a peak, you use the Top-N to list the VMs.

For details on how this dashboard helped a customer who was hit by IOmeter, see this.

When a VM Owner complains

A VM Owner only cares about her VM. The fact that you have 1001 other VMs is irrelevant. As a result, the fact that your VMware cluster is working hard at 100% utilization is also relevant. That’s why the following dashboard does not show other VM and your Infrastructure.

Using the dashboard, a Help Desk operator can search for the VM, or browse the list. For every VM, we show the key properties, such as No of vCPU, RAM size, CPU Contention, RAM Contention, etc. The columns can be customised.

Once found, he simply selects that VM. How well your IaaS platform serves it will be automatically shown. The dashboard uses line chart, and not a single number, so you can see if there is any pattern.

Below is all the dashboard shows! That’s all, because it’s about Monitoring, not Troubleshooting.

This is the most important feature of this dashboard, as it allows you to clear performance issue quickly.

That SLA line is dynamic. It varies, depending on which tier the VM belongs to.

To change the threshold, simply change the value in the super metric named Performance SLA. There is no need to modify policy.

For networking, the general expectation is 0 dropped packet, hence there is no need for SLA Line. We show both the TX and RX instead, so you can see deeper where the issue is

From the example below, it’s clearly showing the IaaS unable to meet its promise on CPU but do well on RAM. It failed for around 20 minutes on Disk. You don’t even have to wait for VM Owner to complain. You can be proactive and discuss the need for additional hardware or upgrade.

The above dashboard clearly tells if you are serving your customer well. It’s suitable for Help Desk Operator. All they need to see if it’s above the threshold or not. Once you have operationalized IaaS, this dashboard is the easy part. You can actually make it self service if you have a formal agreement.

What if you need to find out why. Another word, you move from monitoring to troubleshooting. From this dashboard, you can navigate to the VM Troubleshooting dashboard.

Troubleshooting a world by itself. The diagram below shows partial list that can cause performance issue.

Performance problem can be caused by only 2 main reasons:

The VM itself

The Infra is unable to serve the VM.

For the VM, here are some possible reasons

Utilization is high. It does not have enough capacity.

VM too big. Processes were ping-pong among the vCPU. The context switch is very high

The app does not scale well. It’s not able to take advantage of all vCPU and are concentrated on just a few.

For Infrastructure, looks for sign if it’s heavily loaded or too small for the VM:

vCPU too big relative to Host cores?

Was there vMotion at the time of issue?

Do take note what is considered “high” is relative. This is where performance troubleshooting is not just science, but also art. Also, not all counters indicate performance problem. Examples:

A high number of Process ID inside Guest OS does not correlate to performance issue, if they are mostly idle and do not cause a lot of context switches. On the other hand, you can have a process being ping-pong among the vCPU even though there aren’t many processes running.

VM RAM being ballooned out does not mean the VM experiences performance degradation. RAM performance only happen when CPU wants to access the page and its waiting for RAM. It has to wait if the page was not available in RAM, because it was ballooned out, swapped, compressed, etc. So track the swap in, not swap out.

We can’t show all the metrics and possibilities shown above. Here is what we can do. You can customize it to show more. You should also build a custom Log Insight dashboard to complement this.

The VM is selected from the previous dashboard (Single VM Monitoring). Related Datastores are automatically shown, along with their KPI. Related ESXi cannot be shown as the ESXi where the VM is running might not be the ESXi where the VM was running. On the dashboard, choose the metric Parent Host manually, as shown above. You can see if the VM was on a different ESXi.

Compute: ESXi

CPU Contention, RAM Contention

CPU Demand, RAM Consumed, RAM Active

Storage: Datastore

Read Latency, Write Latency

Outstanding IO request

Network:

We are not showing network because Network should have 0 dropped packet, plus in general it’s hard to saturate 2x 10 GE

Again, line chart is used, and not a single number, because they give you a lot more info.

Do note that VM CPU Workload can exceed 100% as it accounts for CPU contention and overhead.

Over Provisioning disease

If you take all the large VMs in your environment, and plot the maximum utilization among them, what do you expect?

You are right. It depends whether they are over provisioned or not. If they are, the max among them will be low. The average will be even lower.

In a healthy, right-sized environment, there is bound to be 1 VM who have high utilization at any given time. This is especially true in a large environment.

The line charts below show the Max and Average utilizations among the large VMs. We can tell easily the degree of provisioning.

The line chart does not show the VMs. That’s where the table comes in. It shows the max utilization of each VM in a given period.

The table does not show relative comparison among these large VMs. If you want to expose the largest VMs, the heat map shows that. The larger the VM, the larger the box.

What about undersized? Generally speaking, this is not your problem. But if you want to answer “Which VMs hit high CPU usage when?”, you can use the following dashboard:

The above is what you want to see, indicating only 2 VMs had the problem in the past >1 month. In an environment where many VMs are undersized, you will see something like this. Notice this is not 2 months. This is just 6 hours, and each bar is only 10 minutes!

Right-sizing VM without impacting performance

The previous dashboard give you the overall situation. To right size, you need to deal with individual VM. This gives you the confidence that performance will not be affected.

You can select any of the large VMs, starting from the one with the least utilization. The dashboard below will automatically lists the VM utilization.

Each vCPU of the VM are listed in table. It shows the maximum utilisation of individual vCPU in the timeline you are interested.

It shows analysis of the utilization of the VM. The Forensic chart shows 95% of the VM utilization. You expect that number to be >80% as a VM can’t be spending 95% of the time doing just 20% utilization. The Forensic also shows you the remaining 5%, so you can be convinced.

Most VM Owners will ask for a detailed line chart showing each vCPU utilisation. The line chart below will be automatically shown when a VM is selected. It retains a 5-minute granularity.

RAM right sizing is more challenging as you need Guest OS metric, not VM metric. vR Ops 6.3 sports the ability to pull this data with just using VMware Tools.

Are they configured consistently?

Any “bad” config matters we need to know? The dashboard lists VMs configuration that needs attention:

Do I have large VMs? If yes, what’s their configurations? We cover CPU, RAM and Disk separately.

Do I have VM connected to >1 network? They can bridge your network, so it should be reserved for only Networking VM.

Do I have VMs with large snapshots? If yes, which VM and how big?

Do I have VMs with old virtual hardwares? If yes, which versions and how many?

Hope you find the material useful. If you do, go back to the Main Page. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.

Further Reading

Hope you find the blog useful. For more info, you can refer to chapters 4 – 7 in this book.

A common requirements among customers is to have a set of vRealize Operations dashboards to help them manage their VMware IaaS platform. They want a suite of inter-connected dashboards, not individual dashboards.

In the past several years, we have developed around 50 dashboards to help you operate your VMware SDDC. The dashboards form 1 story. We group each dashboard into the 4 pillars of SDDC Operations.

The set of dashboards also go beyond vSphere Admin, and provide dashboards for Storage Team, Network Team and NOC Team. However, they are yet to provide a complete coverage for every role and every purpose. The table below shows the coverage. No means there is no dashboard yet.

Different roles in the team are interested in what’s relevant to them first, which is why the dashboards are tailored for each. Here is the dashboards provided, grouped by role and purpose.

There are naturally more dashboards for the Platform Team. The team was known as the Server Team in the old days of physical world. They have evolved into Platform Team, and is typically where VMware Admin and Architect belong.

They have 2 interfaces in the company:

upstream: to VM Owners, application team.

downstream: to Storage Team, Network Team

In addition, they also deal with IT Management (CIO, etc), Help Desk and Security/Compliance team.

You may notice in the above picture that some dashboards are in grey. That means they are not available. Need MP means it needs a Management Pack. We have not included MP as part of this solution. You should get vSphere under control first before extending coverage. Need feedback means I’m yet to see a use case for it. Every dashboard answers a question, and has to be complementary to other dashboards.

The tools we use to manage VMware SDDC is vRealize Operations and Log Insight. We do live demo during the events and customers ask for a copy that they can import into their environment. This blog provides the steps to import.

Here is what they look like in vRealize Operations 6.4. vR Ops 6.3 is the minimum requirement as it uses 6.3 new feature.

Hope you find the material useful. If you do, go back to the Main Page. It gives you the big picture so you can see how everything fits together. If you already know how it all fits, you can go straight to download here.