A couple of months ago I joined the Office of the CTO of the Cloud Platform Business Unit and started reporting directly to the CTO, Kit Colbert. Kit asked me to select a few areas to focus on. One of these areas is running Kubernetes on vSphere.
I’ve increased my focus on Kubernetes, as this architecture becomes increasingly important in the datacenter. When talking to customers, two questions I ask is, what is the current ratio of VMs to containers in your data center and what is the most popular format of deployment today? The common response is respectively 90% VMs and net-new is 90% containers. Today’s trend moves away from installing shrink-wrapped software and more towards custom building revenue-critical applications by their development teams.
The standard tool for developers is container-based infrastructure. Kubernetes is the defacto choice of orchestration of containers and consists of many infrastructure-focused options. The operations that interest me are the high availability and resource management operations. It appears these operations replace HA and DRS processes when glancing over them, but when looking more closely they strongly augment each other.
At VMworld in Las Vegas, Michael Gasch and I presented the session “Deep Dive: The Value of Running Kubernetes on vSphere” (CNA1553BU). If you are not going to VMworld Europe, I recommend watching the video recording, if you are going to VMworld Europe I recommend you to sign up.
One thing you can expect from me is more Kubernetes focused articles. One of the
things that I noticed is that many articles are written by cloud-native natives
for cloud-native natives. I.e. they rely on extensive previous exposure to this
ecosystem. I’m trying to cover some of the challenges I have faced and the
quirks I notice as a “newcomer”.

Today someone asked if the congestion threshold of SIOC is related to which host latency threshold? Is it the Device average (DAVG), Kernel Average (KAVG) or Guest Average (GAVG)?
Well actually it’s none of the above. DAVG, KAVG and GAVG are metrics in a host-local centralized scheduler that has complete control over all the requests to the storage system. SIOC main purpose is to manage shared storage resources across ESXi hosts, providing allocation of I/O resources independent of the placement of virtual machines accessing the shared datastore. And because it needs to regulate and prioritize access to shared storage that spans multiple ESXi hosts, the congestion threshold is not measured against a host-side latency metric. But to which metric is it compared? In essence the congestion threshold is compared with the weighted average of D/AVG per host, the weight is the number of IOPS on that host. Let’s expand on this a bit further.Average I/O latency
To have an indication of the load of the datastore on the array, SIOC uses the average I/O latency detected by each host connected to that datastore. Average latency across hosts is used to cope with the variety of workloads, the characteristic of the active workloads, such as read versus writes, I/O size and degree of sequential I/Os in addition to array behavior such as block location, caching policies and I/O scheduling.
To calculate and normalize the average latency across hosts, each host writes its average device latency and number of I/Os for that datastore in a file called IORMSTATS.SF stored on the same datastore.
A common misconception about SIOC is that it’s compute cluster based. The process of determining the datastore-wide average latency really reveals the key denominator – hosts connected to the datastore – . All hosts connected to the datastore write to the IORMSTATS.SF file, regardless of cluster membership. Other than enabling SIOC, vCenter is not necessary for normal operations. Each connected host reads the IORMSTATS.SF file each 4 seconds and locally computes the datastore-wide average to use for managing the I/O stream. Therefor cluster membership is irrelevant.Datastore wide normalized I/O latency
Back to the process of computing the datastore wide normalized I/O latency. The average device latencies of each host are normalized by SIOC based on the I/O request size. As mentioned before, not all storage related workloads are the same. Workloads issuing I/Os with a large request size result in longer device latencies due to way storage arrays process these workloads. For example, when using a larger I/O request size such as 256KB, the transfer might be broken up by the storage subsystem into multiple 64KB blocks. This operation can lead to a decline of transfer rate and throughput levels, increasing latency. This allows SIOC to differentiate high device latency from actual I/O congestion at the device itself.Number of I/O requests complete per second
At this point SIOC has normalized the average latency across hosts based on I/O size, next step is to determine the aggregate number of IOPS accessing the datastore. As each host reports the number of I/O requests complete per second, this metric is used to compare and prioritize the workloads.
I hope this mini-deepdive into the congestion thresholds explains why the congestion threshold could never be solely related to a single host-side metric . Because the datastore-wide average latency is a normalized value, the latency observed of the datastore per individual host may be different than the latency SIOC reports per datastore.
.

vSwitch configuration and load-balancing policy selection are major parts of a virtual infrastructure design. Selecting a load-balancing policy can have impact on the performance of the virtual machine and can introduce additional requirements at the physical network layer. Not only do I spend lots of time discussing the various options during design sessions, it is also an often discussed topic during the VCDX defense panels.
More and more companies seem to use IP-hash as there load balancing policy. The main argument seems to be increased bandwidth and better redundancy. Even when the distributed vSwitch is used, most organizations still choose IP-hash over the new load balancing policy “Route based on physical NIC load”. This article compares both load-balancing policies and lists the characteristics, requirements and constraints of both load-balancing policies.IP-Hash
The main reason for selecting IP-Hash seems to be increased bandwidth as you aggregate multiple uplinks, unfortunately adding more uplinks does not proportionally increase the available bandwidth for the virtual machines.How IP-Hash works
Based on the source and destination IP address together the VMkernel distributes the load across the available NICs in the vSwitch. The calculation of outbound NIC selection is described in KB article 1007371. To calculate the IP-hash yourself convert both the source and destination IP-addresses to a Hex value and compute the modulo over the number of available uplinks in the team. For example
Virtual Machine 1 opens two connections, one connection to a backup server and one connection server to an application server.

Virtual Machine

IP-Address

Hex Value

VM1

164.18.1.84

A4120154

Backup Server

164.18.1.160

A41201A0

Application Server

164.18.1.195

A41201C3

The vSwitch is configured with two uplinks.
Connection 1: VM1 > Backup Server (A4120154 Xor A41201A0 = F4) % 2 = 0
Connection 2: VM1 > Application Server (A4120154 Xor A41201C3 = 97) % 2 = 1
IP-Hash treats each connection between a source and destination IP address as a unique route and the vSwitch will distributed each connection across the available uplinks in the vSwitch. However due to the pNIC to vNIC affiliation, any connection is on a per flow basis. A flow can’t overflow to another uplink; this means that a connection is still limited to the speed of a single physical NIC. A real-world user case for IP-hash would be a backup server which requires a lot of bandwidth across multiple connections other than that; there are very few workloads that require bandwidth that can’t be satisfied by a single adapter.Complexity –In order for IP-hash to function correctly additional configuration at the network layer is required:

EtherChannel: IP-hash needs to be configured on the vSwitch if EtherChannel technology is used at the physical switch layer. With EtherChannel the switch will load balance connections over multiple ports in the EtherChannel. Without IP-hash, the VMkernel only expects to receive information on a specific MAC address on a single vNIC. Resulting in some sessions go through to the virtual machine while other sessions will be dropped. When IP-hash is selected, then the VMkernel will accept inbound mac addresses on both active NICs

EtherChannel configuration: As vSphere does not support dynamic link aggregation (LACP), none of the members can be set up to auto-negotiate membership and therefore physical switches have to be configured with static EtherChannel.

Switch configuration: vSphere supports EtherChannel from one switch to the vSwitch. This switch can be a single switch or a stack of individual switches that act as one, but vSphere does not support EtherChannel from two separate – non stacked – switches, when the EtherChannel connect to the same vSwitch.

Additional overhead – For each connection the VMkernel needs to select the appropriate uplink. If a virtual machine is running a front-end application and communicates 95% of its time to the backend database, the IP-Hash calculation is almost pointless. The VMkernel needs to perform the math for every connection and 95% of the connections will use the same uplink because the Algorithm will always result in the same hash.Utilization-unaware – It is possible that a second virtual machine is assigned to use the same uplink as the virtual machine that is already saturating the link. Let’s use the first example and introduce a new virtual machine VM3. Due to the backup window, VM3 connects to the backup server.

Virtual Machine

IP-Address

Hex Value

VM3

164.18.1.86

A4120156

Connection 3: VM3> Backup Server (A4120156 Xor A41201A0 = F6) % 2 = 0
Due to IP-HASH load balancing policy being unaware of utilization it will not rebalance if the uplink is saturated or if virtual machine are added or removed due to power-on or (DRS) migrations. DRS is unaware of network utilization and does not initiate a rebalance if a virtual machine cannot send or receive packets due to physical NIC saturation. In worst-case scenario DRS can migrate virtual machines to other ESX servers, leaving all the virtual machine that are saturating a NIC while the other virtual machines utilizing the other NICs are migrated. Admitted it’s a little bit of a stretch, but being aware of this behavior allows you to see the true beauty of the Load-Based Teaming team policy.Possible Denial of Service –Due to the pNIC-to-vNIC affiliation per connection a misbehaving virtual machine generating many connections can cause some sort of denial of service on all uplinks on the vSwitch. If this application would connect to a vSwitch with “Port-ID” or “based on physical load” only one uplink would be affected.Network failover detection Beacon Probing – Beacon probe does not work correctly if EtherChannel is used. ESX broadcast beacon packets out of all uplinks in a team. The physical switch is expected to forward all packets to other ports. In EtherChannel mode, the physical switch will not send the packets because it’s considered as one link. No beacon packets will be received and can interrupt network connections. Cisco switches will report flapping errors. See KB article 1012819.Route based on physical NIC Load
VMware vSphere 4.1 introduced a new load-balancing policy available on distributed vSwitches. Route based on physical NIC load, also known as Load Based Teaming (LBT) takes the virtual machine network I/O load into account and tries to avoid congestion by dynamically reassigning and balancing the virtual switch port to physical NIC mappings.How LBT works
Load Based Teaming maps vNICs to pNICs and remaps the vNIC-to-PNIC affiliation if the load exceeds specific thresholds on an uplink.
LBT uses the same initial port assignment as the “originating port id” load balancing policy, resulting in the first vNIC being affiliated to the first pNIC, the second vNIC to the second pNIC, etc. After initial placement, LBT examines both ingress and egress load of each uplink in the team and will adjust the vNIC to pNIC mapping if an uplink is congested. The NIC team load balancer flags a congestion condition if an uplink experiences a mean utilization of 75% or more over a 30-second period.Complexity – LBT requires standard Access or Trunk ports. LBT does not support EtherChannels. Because LBT is moving flows among the available uplinks of the vSwitch, it may create packets re-ordering. Even though the reshuffling process is not done often (worst case scenario every 30 seconds) it is recommended to enable PortFast or TrunkFast on the switch ports.Additional overhead – The VMkernel will examine the congestion condition after each time window, this calculation creates a minor overhead opposed to using the static load-balancing policy “originating port-id”.Utilization aware – vNIC to pNIC mappings will be adjusted if the VMkernel detects congestion on an uplink. In the previous example both VM1 and VM3 shared the same connection due to the IP-hash calculation. Both connections can share the same physical NIC as long as the utilization stays below the threshold. It is likely that both vNICs are mapped to separate physical NICs.
In the next example a third virtual machine is powered up and is mapped to NIC1. Utilization of NIC1 exceeds the mean utilization of 70% over a period of more than 30 seconds. After identifying congestion LBT remaps VM2 to NIC2 to decrease the utilization of NIC1.
Although LBT is not integrated in DRS it can be viewed as complimentary technology next to DRS. When DRS migrates virtual machines onto a host, it is possible that congestion is introduced on a particular physical NIC. Due to vNIC to pNIC mapping based on actual load, LBT actively tries to avoid congestion at physical NIC level and attempts to reallocate virtual machines. By remapping vNiCs to pNICs it will attempt to make as much bandwidth available to the virtual machine, which ultimately benefits the overall performance of the virtual machine.Recommendations
When using distributed virtual Switches it is recommended to use Load-Based teaming instead of IP-hash. LBT has no additional requirements on the physical network layer, reduces complexity and is able to adjust to fluctuating workloads. Due to the remapping of vNICs to pNICs based on actual load, LBT attempts to allocate as much bandwidth possible where IP-hash just simply distributes connections across the available physical NICs.
Get notification of these blogs postings and more DRS and Storage DRS information by following me on Twitter: @frankdenneman

vSphere introduced the HA admission control policy “Percentage of Cluster Resources Reserved”. This policy allows the user to specify a percentage of the total amount of available resources that will stay reserved to accommodate host failures. When using vSphere 4.1 this policy is the de facto recommended admission control policy as it avoids the conservative slots calculation method.Reserved failover capacity
The HA Deepdive page explains in detail how the “percentage resources reserved” policy works, but to summarize; the CPU or memory capacity of the cluster is calculated as followed;The available capacity is the sum of all ESX hosts inside the cluster minus the virtualization overhead, multiplied by (1-percentage value).
For instance; a cluster exists out of 8 ESX hosts, each containing 70GB of available RAM. The percentage of cluster resources reserved is set to 20%. This leads to a cluster memory capacity of 448GB (70GB+70GB+70GB+70GB+70GB+70GB+70GB+70GB) * (1 – 20%). 112GB is reserved as failover capacity. Although the example zooms in on memory, the percentage set applies both CPU and memory resources.
Once a percentage is specified, that percentage of resources will be unavailable for active virtual machines, therefore it makes sense to set the percentage as low as possible. There are multiple approaches for defining a percentage suitable for your needs. One approach, the host-level-approach is to use a percentage that corresponds with the contribution of one or host or a multiplier of that. Another approach is the aggressive approach which sets a percentage that equals less than the contribution of one host. Which approach should be used?Host-level
In the previous example 20% was used to be reserved for resources in an 8-host cluster. This configuration reserves more resources than a single host contributes to the cluster. High Availability’s main objective is to provide automatic recovery for virtual machines after a physical server failure. For this reason, it is recommended to reserve resource equal to a single host or a multiplier of that.
When using the per-host level of granularity in an 8-host cluster (homogeneous configured hosts), the resource contribution per host to the cluster is 12.5%. However, the percentage used must be an integer (whole number). Using a conservative approach it is better to round up to guarantee that the full capacity of one host is protected, in this example, the conservative approach would lead to a percentage of 13%.Aggressive approach
I have seen recommendations about setting the percentage to a value that is less than the contribution of one host to the cluster. This approach reduces the amount of resources reserved for accommodating host failures and results in higher consolidation ratios. One might argue that this approach can work as most hosts are not fully loaded, however it eliminates the guarantee that after a failure all impacted virtual machines will be recovered.
As datacenters are dynamic, operational procedures must be in place to -avoid or reduce- the impact of a self-inflicted denial of service. Virtual machine restart priorities must be monitored closely to guarantee that mission critical virtual machines will be restarted before virtual machine with a lower operational priority. If reservations are set at virtual machine level, it is necessary to recalculate the failover capacity percentage when virtual machines are added or removed to allow the virtual machine to power on and still preserve the aggressive setting.Expanding the cluster
Although the percentage is dynamic and calculates capacity at a cluster-level, when expanding the cluster the contribution per host will decrease. If you decide to continue using the percentage setting after adding hosts to the cluster, the amount of reserved resources for a fail-over might not correspond with the contribution per host and as a result valuable resources are wasted. For example, when adding four hosts to an 8-host cluster while continue using the previously configured admission control policy value of 13% will result in a failover capacity that is equivalent to 1.5 hosts. The following diagram depicts a scenario where an 8 host cluster is expanded to 12 hosts; each with 8 2GHz cores and 70GB memory. The cluster was originally configured with admission control set to 13% which equals to 109.2 GB and 24.96 GHz. If the requirement is to be able to recover from 1 host failure 7,68Ghz and 33.6GB is “wasted”.Maximum percentage
High availability relies on one primary node to function as the failover coordinator to restart virtual machines after a host failure. If all five primary nodes of an HA cluster fail, automatic recovery of virtual machines is impossible. Although it is possible to set a failover spare capacity percentage of 100%, using a percentage that exceeds the contribution of four hosts is impractical as there is a chance that all primary nodes fail.
Although configuration of primary agents and configuration of the failover capacity percentage are non-related, they do impact each other. As cluster design focus on host placement and rely on host-level hardware redundancy to reduce this risk of failing all five primary nodes, admission control can play a crucial part by not allowing more virtual machines to be powered on while recovering from a maximum of four host node failure.
This means that maximum allowed percentage needs to be calculated by summing the contribution per host x 4. For example the recommended maximum allowed configured failover capacity of a 12-host cluster is 34%, this will allow the cluster to reserve enough resources during a 4 host failure without over allocating resources that could be used for virtual machines.

Lately the question about setting CPU affinity is rearing its ugly head again. Will it offer performance advantages for the virtual machine? Yes it can, but only in very specific cases. Additional settings and changes to the virtual infrastructure are required to obtain a performance increase over the default scheduling techniques. Setting CPU affinity by itself will not result in any performance gain, but usually a performance decrease.What does CPU affinity do?
By setting a CPU affinity on the virtual machine you are limiting the available CPUs on which the virtual machine can run. It does not dedicate that CPU to that virtual machine and therefore does not restrict the CPU scheduler from using that CPU for other virtual machines.When will CPU-affinity help?
Under a controlled environment some specific workloads can benefit from using CPU affinity. When the virtual machine workload is cache bound and has a larger cache footprint than the available cache of one CPU it can profit from aggregated caches. However, if this workload has high intra-thread communications and is running on specific CPU architectures setting CPU affinity can have the opposite effect and become detrimental to the performance of the application.
CPU-affinity can also be used to isolate a physical CPU to a virtual CPU. But requires a lot of changes and increases management. It will never dedicate the physical CPU to the virtual machine as the VMkernel schedules all its processes across all available CPUs regardless of any custom setting a virtual machine has. Furthermore the scheduling overhead stays the same whether CPU-affinity is set on the virtual machine or not.
To determine if you application fit this description can be a challenge and maintaining such configurations usually result in a nightmare. Generally CPU-affinity is only used for simulations and load testing and it is better left unused for every other cases. Setting CPU-affinity results in less choice for the CPU scheduler to schedule the virtual machine, but there is more to it as well:Controlled environment
Already mentioned but this cannot be stressed enough, CPU affinity does not equal isolation of a physical CPU. In other words, when a virtual machine is pinned to a physical CPU it does not control or own that CPU. The VMkernel CPU scheduler still considers that physical CPU a valid CPU to schedule other virtual machines on. If isolation of a CPU is the end-goal, than all other residing virtual machines on the host (and virtual machine that will be created in the future) must be configured with CPU affinity as well and the specific CPU(s) assigned to the virtual machine must excluded from all other virtual machines.
Setting CPU affinity results in manual CPU micro management and can be a nightmare to maintain. To make it worse, think of the impact a migration will have, the administrator needs to configure the virtual machines on the destination host to exclude the CPU from all active virtual machines as well.Virtual Machine worlds
A virtual machine is made of multiple worlds (threads), besides the vCPU world, worlds are active for the virtual machine MKS subsystem, CD-ROM and VMX file. Although the vCPU world generates the greater part of the CPU load, sometimes a physical CPU is required to run the other worlds. If CPU affinity is set, then all the worlds that constitute the virtual machine can only run on the specified CPUs. If set incorrectly, it can reduce the throughput of the virtual machine as the worlds must compete between each other for CPU time. Therefore it is recommended to add an additional CPU for these worlds. For example; configure a CPU affinity setting that contains 3 physical CPUs for a 2 vCPU virtual machine.Resource entitlements
As CPU affinity will not automatically isolate the CPU for that specific virtual machine, shares and reservations needs to be set to guarantee a specific performance level. Because the scheduler will attempt to maintain fairness for all virtual machines it is possible that other virtual machines will be scheduled on the set of CPU specified in the affinity set of the virtual machine. Adjust the shares and reservations of the virtual machine accordingly to ensure priority over other active virtual machines. Be aware that CPU reservations are friendly; although the vCPU is guaranteed a specific portion of physical resources, it might happen that an external thread/interloper (other virtual machine) is using the vCPU; this thread will not instantly be de-scheduled. Even when the waiting virtual machine has a 100% CPU reservation configured.
To make it worse, in the case when multiple virtual machines are affinity-bound to the same processor it is possible that the CPU scheduler cannot meet the specified reservation. Be aware that admission control ignores affinity, so multiple virtual machines can have a full reservation equal to a full core but still need to compete with other affinity bound virtual machines. More information about how CPU reservations work can be found in the article: “Reservations and CPU Scheduling”.CPU reservations and HA admission control
If the virtual machine with the reservation is running in a HA cluster with a “Host failures cluster tolerates” admission control policy, the CPU reservation will influence the Slot size of the Cluster and can therefore impact the consolidation ratio of the cluster. More info about slot-sizes can be found on the HA deepdive.CPU affinity and DRS clusters.
Because vMotion is not allowed if a virtual machine is configured with CPU affinity, that virtual machine cannot be placed in a DRS cluster with automation mode set to fully automated. If a virtual machine needs to be configured with CPU affinity, the administrator has three choices:

Place the virtual machine on a stand-alone host

Set DRS automation level to manual / partially automated

Set Virtual machine automation mode to manual / partially automated

Stand-alone host
If the virtual machine is placed on the stand-alone host the performance of the virtual machine depends on the level of contention and the virtual machine resource entitlement. During resource contention it can only fall back on its resource entitlement and hopefully gain a higher priority than the other residing virtual machines. If the virtual machine was located on an ESX host in a DRS cluster, the virtual machine could have been migrated to receive its resource entitlement on another host. By choosing CPU-affinity, you are betting only on one horse, the local CPU scheduler of one host instead of leveraging the full suite of resource management vSphere delivers today.

DRS set to Manual or partially automated
If the DRS automation level is set to manual or partially automated, the cluster will not automatically load balance virtual machines and DRS will recommend migrations. These recommendations must be applied manually by the administrator. DRS imbalance calculation will be invoked every 300 seconds but is also triggered if the cluster detects resource demand and supply changes, as well as changes in the resource settings in the cluster. As you can imagine, this behavior will create an incredible load on the administrator to let the cluster operate as efficiently as possible if he wants to ensure that the virtual machines are receiving their resource entitlements.

Set Virtual machine automation mode to manual / partially automated
By changing the automation mode on VM-level, the virtual machine can still be placed inside a fully automated DRS cluster. Although DRS will not automatically migrate this virtual machine, it can migrate other virtual machines to ensure every virtual machine will receive its resource entitlement. However additional measures (shares and reservations) must be taken to guarantee the virtual machine enough physical resources.

CPU architectures
Today new CPU architectures, such as the Intel Nehalem and AMD Opteron’s offer a variety of on-die caches, multiple cores \ logical CPUs and an optimized local\remote memory subsystem. These features can either helpful or be detrimental to the performance of a virtual machine with CPU affinity.

Cache level
If a virtual machine is spanned across two processors (packages) it effectively results in having two L3 caches available to the virtual machine. Today’s CPU architectures offer dedicated L1 and L2 cache per core and a shared last-level L3 cache for all cores inside the CPU package. Because access to Last level cache is faster than (normal) memory, it makes sense to span the virtual machine across two processor packages to increase the amount of available L3 cache.

However the inter-socket communication speed can reduce –or remove- the positive effect of having low-latency cache available and if the workload can fit inside one cache (small cache footprint) and uses intensive intra-thread communication, than placement in one processor packaged is to be preferred over spanning multiple packages.

HyperThreading
If a virtual machine is running on a HyperThreading-enabled system it is best to set the CPU-affinity to logical CPUs not belonging to the same core. The HT threads on a core are translated by the VMkernel as logical CPUs and are consecutively numbers, for example Core 1 contains LCPU0 and LCPU1, Core 2 contains LCPU2 and LCPU3, etc. If CPU-affinity is set to logical CPUs belonging to the same core, both vCPUs of the virtual machine need to compete with each other for physical CPU resources. By scheduling a virtual machine on logical CPUs of different cores, it doesn’t have to compete and can benefit the vCPUs’ throughput because the VMkernel allows the vCPU to use the entire Cores’ resources if only one logical CPU residing on the core is active.

NUMA
If CPU affinity is set on a virtual machine running in a NUMA architecture (Intel Nehalem and AMD Opteron) the virtual machine is treated as a NON-NUMA client and gets excluded from NUMA scheduling. Therefore the NUMA scheduler will not set a memory affinity for the virtual machine to its current NUMA node and the VMkernel can allocate memory from every available NUMA node in the system Therefore the virtual machine may end up running on a different NUMA node than were its memory is residing, resulting in unnecessary memory latency and possibly higher %Ready time as the instruction must wait until the memory is fetched from a remote node.

Bottomline
The bottomline is that almost in every case CPU affinity is better left unused. Scheduling threads is very complex, scheduling threads belonging to multiple virtual machines with different priorities, activity, progress and still considering optimal use of the underlying CPU and memory architecture is mind-blowing complex. The CPU scheduler is aware of all these components and together with the global scheduler (DRS) it can see to it that the virtual machine will receive its resource entitlement. If the virtual machine must have access to physical resources at any time, other mechanisms such as resource allocation settings will have a better effect than using the advanced setting CPU-affinity.