Monthly Archives: April 2017

CIO, Head of Global Infrastructure, and other IT Senior Management have a different requirements for dashboard than technical folks.

Generally they want:

big picture, not details.

exception. Things that they need their attention.

less technical info. Ideally, present in business terms, not IT.

a portal that is easy to access. They may not want to login to vR Ops. If they do, they may forget their password. [e1: vR Ops 6.5 cannot do login-less yet]

UI that is easy to understand. So keep each dashboard to a specific question.

system that is easy to use. So keep the interaction, clicking, zooming, sorting, etc. minimal.

That’s what theywant from you.

What do youwant from them?

You show them something so you can get help (e.g. budget, resource). Here are some goals:

Show transparency. Giving visibility into live environment to senior management.

Prove that you do need additional hardware.

Prove the wastage you have been talking for months.

What do you notwant to show? There are things you do not want to show. Urgent issues are something that you should not display. It is not about hiding information to CIO. This is about giving you the time or space to do your job. If there is an active fire that requires your full time concentration, you do not want to be interrupted by CIO asking why it’s showing red on the dashboard!

I covered dashboard best practice in this post. Read that first, as this blog builds upon that.

Done? Great!

We take the same approach we did when planning dashboards for specific roles (e.g. Storage team, Network team). We ask a set of questions.

If we implement the above, we will end up with at least 5 dashboards. I’ve combined some of them. I see a wide variety of requirements, so you will customise them anyway 🙂

Basic Visibility

How many VMs in our cloud? What’s their CPU, RAM, Disk allocation? This gives you a size of the environment the IaaS is supporting.

How much CPU, RAM, Disk do we have? Is it enough to support the above requirement?

You should also give the history of VM growth. What is enough today may not be enough in 3 months.

In the dashboard above, I’ve added Availabilityinformation. As VMs can be powered off intentionally by application team, you should only report for Tier 1 VMs. Tier 3 VMs, especially those in Test and Dev, can be rebooted frequently and hence will give misleading information.

Performance

The dashboard below shows all VMs. In a large environment, the heat map will automatically combine VMs with the same value (read: color & size).

Every VM is represented by a box. The box can take on value between 0 and 3.

Green = 0. The VM is served well.

Yellow = 1. One of the IaaS is not delivered as per Performance SLA. We track CPU, RAM and Disk. If your SLA states 10 ms disk latency, then the VM has to get 10 ms.

Orange = 2. Two of the IaaS is not delivered.

Red = 3. All 3 services not delivered.

The VMs are grouped by Datacenter, then cluster. This lets you see which Datacenter or Cluster aren’t coping well.

The above shows the VMs. What about applications? An Application spans multiple tiers and multiple VMs. Just because a VM does not perform does not mean the whole application is affected. As this is for Senior Management, we’re only showing the Tier 1 applications.

Capacity

CIO is not in charge of capacity management. He just need to know the decision you want him to make (which is to approve hardware purchase, or get VM Owners to rightsize). For that, he needs to know if you are running out of capacity, and existing capacity is not wasted.

How is it growing? This can be taken care of by having a projection. This projection should take into account committed projects too.

Capacity is more complex than performance. Just because vSphere cluster is running low on utilization does not mean it can serve the VMs well. See this for detail explanation.

Capacity is best presented with a line chart. This enables you to see the trend. For environment with <10 clusters, you can fit all the clusters in the screen. For large environment, you need to make a trade off:

show live data. You can be detail as you’re only showing 1 data.

show historical data. You can’t be detail as you’re showing >1 data.

Here is an example with historical data. Notice we cannot show details, and the screen only accommodates <10 clusters.

Here is an example where we only show live data. We can show a lot more clusters, and for each we can show CPU, RAM, Disk and Network.

You may run out of capacity. But if you have a lot of wastage, you may have sufficient capacity after you reclaim them. See this for details.

Configuration

Do we have “bad” configuration? Examples are old & unsupported versions of Windows, Linux, ESXi, VMware Tools, etc.

How uniform is our environment? Complexity is required to optimize cost (hardware, software) and performance. However, there is cost in complexity.

Do we have outdated and unsupported products?

If your CIO does not appreciate the complexity, showing CIO the complexity is good for you. It will result in appreciation of your expertise & effort, as it’s certainly easier if the complexity is low. Complexity increases when you have a wide variety of things.

Factor impacting complexity:

No of ESXi versions. The more variants, the more complex.

No of ESXi CPU version

No of brand. The more vendor, the more complex as you need to learn them, and spend time with the their team.

Like this:

If you have a lot of super metrics, backing them up can be a challenge. You cannot do bulk export to back up. It’s easy to have version control issue if you manually export each.

Replicating in another instance (e.g. your test/dev) is tedious as you need to import one by one.

A workaround is to use Policy as vehicle to bulk export/import.

For backup purpose, export is all you need.

For restore into the same environment (where you exported it earlier), you can use the same XML file. You don’t need to customise it.

For replicating in another environment, you likely need to modify the XML file. This is because the policy file contains other settings, such as alert. It’s safer if your exported policy does not contain all other settings.

I’ll show you how to trim the policy file, so it has only super metrics. In this way, it’s safe to import into anyenvironment, as it won’t modify anything. The XML file only contains super metrics, that’s all.

The policy file is just a long XML file. In the example below, it has >5000 lines! Notice it has Alerts and Custom Profile.

To delete an entire section, simply use the keyboard to highlight them. See below, where I selected Custom Profile.

I’ve also deleted the alert section. The file is shorter now 🙂

We still have some irrelevant lines (line 4 – 1187 in my case). Delete them too. You’ll end up with something like this.

I’ll expand the content so you know what exactly the supermetric section contains.

BTW, do not copy/paste your super metric from vR Ops UI into the XML file. The expression is not 100% identical.

Once done, you can import it safely into another vR Ops instance. The import is much faster too! Here is what it looks like:

And if you go to Super Metrics, they are all there 🙂

To enable them, go your default policy (the one marked with a tiny D on the column), and edit it. Go to section 6, and find your super metrics. In Operationalize Your World case, they are all prefixed with “Ops.”

Do not enable for all objects. It will slow down your system. It also makes it more complex unnecessarily as the formula don’t apply to them.

One common mistake I see in the field is oversized vRealize Operations. I guess the thinking is bigger is better is hard to let go. It does not help that the official sizing guide is conservative. It is conservative for a good reason. There is a wide permutation of vR Ops deployment.

So if your deployment is a simple one, with no management pack and End Point Operations, there is a good chance that you are better off with smaller deployment. So how do you check?

I’ll use an actual example and run through my thought process. The example below is from real production environment, not a lab. The environment is a mid-size, around 3000 VM on 300 ESXi hosts.

The environment is heavy on vCenter folders, vR Ops custom groups, super metrics and alerts. It also has integration with ticketing system. The result is 6000 objects and 10 million metrics. The actual collection is 5500 objects.

To see the above break down, go to Cluster Management screen, as shown below. What can you tell from it?

The vR Ops has 5 nodes. It’s clustered and well balanced. Each node handles around 1100 objects. It’s also using Remote Collector to offload the 5-minute processing. As you’ll see later, that strategy pays off well.

You can see the breakdown of Objects being collected. The number 211 was made of 205 + 6.

Now that we know what the deployment is, we can see each node. From the screen below, you can see again that Remote Collector has a subset of the full node. It only has 4 main modules (Collector, Suite API, Watchdog and Admin UI). There is no Persistence and Databases there.

From the above, we can see the full metrics and property of each node. We can also drill down into each modules.

Remember the 5500 objects being collected? Let’s see the history. I’m plotting since Day 1. This is a new vR Ops, so it only goes back to 1 March.

Notice it starts from 0, as that’s when we deployed it. It was a phased deployment. We registered more vCenter, so the number of VMs, objects and VM went up. The CPU Usage didn’t jump accordingly, indicating it has more than enough CPU to handle the extra load. Another word, the additional load was too small to make a difference.

Since the 5 nodes are well balance, let’s take 1 of them, so we can dive deeper. I added Guest OS RAM this time around.

We see the similar jump in objects and metrics. That’s expected by now. The impact on CPU was also minimal.

The spike you see in CPU is actually a daily chart. We will show that later on that it happens at midnight. The daily spike eventually became higher. I’m not sure exactly, but it’s a daily calculation (e.g. capacity or DT). It’s not super metric or groups, as these are calculated every 5 minutes.

The additional load was actually decent. It was in fact 2x load, as you can see below. I used a more detailed chart, and you can see here the sharp jump as we added a few vCenter. The vCenter in turn brings all the objects.

The sharp jump makes a tiny difference in CPU Usage. From the pattern below, you won’t believe that there was 2x load. To me, the extra load was absorbed by Remote Collector.

The RAM pattern was puzzling. I don’t know why. BTW, this counter is from Guest OS, not from VM level. I do expect memory to be fully used, as it’s just a form of cache. I just don’t know why Free RAM went higher ahead of the addition.

Let’s look at Storage. The pattern match CPU. Read is higher, because at night vR Ops does its capacity and DT, and that means it’s reading a lot of data. The absolute number was low though. 1000 IOPS for 1100 objects means 1 object = 1 IOPS.

I said earlier we would dive into the CPU. Here is a 7-day chart. You can see there is daily peak at midnight. But what about the 2nd peak, the one I marked with “?”

To answer that, we have to zoom into that period. Here is what it looks like. Turned out, there was a problem. Notice there was no collection. So when we rectified the problem, vR Ops has to catch up.

From the chart, we can also see that the daily calculation does not last >15 minutes. The burst was short.