Journey to the Virtual World

Tag Archives: service tier

Part 1 explained a new concept, where we added Availability SLA and Performance SLA as the basis of Capacity Management. In this part, I will now provide the formula for each charts. We will cover Tier 1, followed by Tier 2 and 3.

You should be performing capacity planning at Cluster level, not Data Center or Host level.

Compute (Tier 1)

To recap, we do not have over-subscription in Tier 1. We only have it in Tier 2 and 3. As a result, it becomes simpler, as we are following Allocation model essentially.

Availability Policy

Super Metric Formula: Max no of allowed VM in 1 cluster – No of VM in the cluster

Apply the Availability policy at cluster level since it makes more sense. Applying at ESXi Host level is less applicable due to HA. Yes, the chance of a host going down is higher than entire cluster going down. However, HA will reboot the VMs, and VM owners may not notice if service is not affected. On the other hand, if a cluster goes down, it’s a major issue.

The limitation of this formula is it assumes your cluster size may vary. This is a fair assumption. You should keep things consistent. If for some reasons you have say 3 cluster sizes (e.g. 8, 10, 12), then you have 3 super metrics.

CPU

Supply: Total physical cores of all ESXi Hosts – HA buffer

We can choose physical Core or physical Threads. One will be conservative, while the other aggressive. Ideal number is 1.5 of physical core. My recommendation: take the core, not the Threads. This is because it is Tier 1, your highest & best tier.

Threshold: 10% of your capacity, as it takes time to buy cluster (which also needs storage). You are also not aiming to run your ESXi at 100% utilization.

We do not have to build your threshold (which is your buffer actually) into the super metric formula as it’s dynamic. Once it’s hard coded in the super metric, changing it does not change the history. It is dynamic because it depends on business situation. If there is a large project going live in a few weeks, then your buffer needs to cater for it. This is why we need to stay close to the business. It’s also something you should know, based on your actual experience in your company. You have that gut feel and estimate.

Demand: Total vCPU for all the VMs.

If we are using virtual threads in your VM, then count them as if they are a full vCPU. For example, a VM with 2 vCPU and 2 threads per core should be counted as 4 vCPU.

RAM

Supply: Total physical RAM of all ESXi Hosts – HA buffer

No need to include ESXi vmkernel RAM as it’s negligible. If you are using VSAN & NSX, you can add some buffer. You do not need to include virtual appliance as they take the form of a VM, hence it will be included in the Demand.

Demand: Total vRAM for all the VMs

Network

This number has to be below your physical capacity. Ideally, it has buffer so it can handle spike from network intensive events.

Summary

The above formula is all you need for Tier 1.

In emergency, temporary solution, you can still deploy VM while waiting for your new cluster to arrive. This is because you have HA buffer. ESXi host is known for its high uptime.

Tier 2 and 3

Tier 2 and 3 will be different, as there is oversubscription. Since we overcommit CPU and RAM, we can no longer use allocation model. We need to take into account performance.

Super Metric Formula:Maximum (VM CPU Contention) in the cluster

Super Metric Formula:Average (VM CPU Contention) in the cluster

Super Metric Formula:Maximum (VM RAM Contention) in the cluster

Super Metric Formula:Average (VM RAM Contention) in the cluster

For the total number of VM left in the cluster, see Tier 1. It’s the same formula, just a different policy. You have higher threshold naturally.

For the ESXi vmnic utilization, see Tier 1. Identical formula is used.

Conclusion

Indeed, a few line charts is all you need to manage capacity. I am aware it is not a fully automated solution. However, my customers found it logical and ease to understand. It is following an 80/20 principle, where you are given the 20% room to make the judgement call as the expert.

Infrastructure as a Service (IaaS) is something that I love doing. It is both a privilege and a pleasure to be able to see customers progress in their virtualization journey. I work closely with the Infrastructure team in their transformation from systems builder to service provider. It is a massive change, covering strategy, technology, people and process. No wonder it is far from easy.

Based on 6+ years observing the transformation, I think there is a critical area that is overlooked. I draw a diagram below to convey the message. Can you guess which area it is?

Like this:

I see a lot of VMware Admin/Engineers/Architect at end-user environment do not extend his/her influence beyond architecture. I think that’s a lost opportunity because Operations and Architecture are like Yin and Yang. Or Mobius strip.

I shared the idea that as the creator of the platform, we have to have interest on how it’s operated. It was an impromptu presentation at our VMUG Singapore back in mid 2014, hence no slide.

The restaurant business provides a good analogy to our Infra-as-a-Service business. We (Virtualisation Architect/Engineer/Admin) are the Chef. In that end-user environment where you work, you are the expert in producing what your customers want. You architect and design a solid platform, where your customers can confidently run their VMs. If there is an issue, you often get involved, restoring their confidence in your creation. You are seen as the VMware guy, or the virtualization expert. Yes, you may engage VMware PSO or SI, but they are not working for the company. You are the employee. As far as your customers concern, the buck stops at you.

You do not sell hardware nor software. You charge your customers per VM. In fact, to ensure that your customers order the right kind of VM, you need to charge per vCPU, per vRAM and per vDisk. The chargeback model is something that I very rarely see we discuss. We tend to stay in technical discussion. We need to realise we are no longer just a System Builder. We are Service Provider. By not extending our circle of influence into how App Team should pay for our service, we created the issue we have today (Oversized VM, dormant VM, VM sprawl). We need to “step out from the kitchen” from time to time. We need to be like Chef who step out to the dining area, building relationship with his customers, explaining the reason behind his cooking.

As the Architect/Engineer, we are the best person to determine how much it should be charged. We build this thing. We know the costs, and we know the capacity. Not convinced? Put it this way, would you rather someone else determine how much your creation is worth?

We all know that IT exists because of Business. It starts with the Business. Some of the issues we have are caused by unsuitable chargeback model and incorrect Service Tiering. The VM in Tier 1 (mission critical) platform cannot cost the same with the VM in Tier 3 (non prod). I’d make sure there is distinct difference in quality between Tier 1, Tier 2 and Tier 3, so it’s easy for business to choose. Need a good example? Review this.

Using the restaurant analogy, say you cook fried rice. It’s your dish. You need to determine the price of the fried rice. You also need to be able to justify why you have normal fried rice and special fried rice, and why the special one costs a lot more for the same amount.

To me, the Chargeback model and the Service Tiering serve as Key Drivers to our Architecture. I will not consider my architecture complete unless I include these 2 in my design. We are architecting to meet the business requirements, which are “defined” in the chargeback model (e.g. the business wants a $100 VM per month, not a $100K VM per month), and service tiering (e.g. the business wants 99.999% and 3% CPU Contention).

As shared, I see a chance for us to STEP UP and STEP OUT.

Step out of the kitchen and network with your customers (the App team). Educate and fix the problem at the source. Step up from pure IT architecture to business architecture. Architect your pricing strategy and service tiering.

The good thing about pricing is…. your benchmark is already set.

Azure, AWS, Google, and many SP have to a certain set the benchmark. Your private cloud cannot be too far from it. Too low and you will likely make a loss (it’s almost impossible to beat their efficiency). Too high and you will get a complain. Another source of benchmark is physical.

If you are pricing your VDI, the cost of a PC sets your benchmark. You can be higher, but not by a huge gap. A PC costs $800 with Windows + 3 year warranty + 17” monitor. Add your IT Desktop cost, and you meet your benchmark