Host Deep Dive

In my previous post, I mentioned the part 3 of the Host Deep Dive session at VMworld 2018. The ‘3’ is because we ran the part 1 and 2 at VMworld 2016 and 2017. We had the chance to try a new way of discussing Host Resources settings by the way of creating levels of performance tuning. The feedback we received will test-driving this session at the London and Indianapolis VMUG, was really positive.

As we always stated, the out-of-the-box experience of VMware vSphere is good enough for 80-90% of common virtual infrastructures. We like to show how you can gradually increase performance and reduce latency with advanced vSphere tuning. That’s why we came up with the Pyramid of Performance Optimization. Delivering the content this way allows for better understanding on when to apply certain optimizations. We will start with the basics and work our way up to settings to squeeze out the maximum performance of vSphere ESXi hosts.

Due to session time constrains, we will focus on compute (CPU and Memory) and virtual Networking. The following pyramids contain the subjects about content we will discuss in our session.

Pyramid of Performance Optimization – Compute:

Pyramid of Performance Optimization – Networking:

We will go trough all these levels in detail. We very much look forward to VMworld and hope to see you there! Be sure to reserve your seat for this session!

Modern physical NICs (pNIC) have several offloading capabilities. If you are running VMware NSX, which is using VXLAN, you could benefit from the VXLAN offloading feature. Using VXLAN offloading allows you to use TCP offloading mechanisms like TCP Segment Offload (TSO) and Checksum Segment Offload (CSO) because the pNIC is able to ‘look into’ encapsulated VXLAN packets. That results in lower CPU utilization and a possible performance gain. But how to determine what is actually supported by your pNIC and the used driver in ESXi?

It is recommended to follow these three steps to fully verify if the VXLAN offload feature you are looking for is supported and enabled.

Step 1: Check the support of the pNIC chipsetStep 2: Check the support of the driver moduleStep 3: Check if the driver module needs configuration

The first step is to check the vendor information about the supported features on their pNIC product. Let’s take the combination of a 10GbE Broadcom QLogic 57810 NIC and the VXLAN offload feature as an example. Looking at the datasheet of the QLogic 57810 NIC, it clearly states that VXLAN offloading is supported.

The (Dynamic) Netqueue feature in ESXi, which is enabled by default if the physical NIC (pNIC) supports it, allows incoming network packets to be distributed over different queues. Each queue gets its own ESXi thread for packet processing. One ESXi thread represents a CPU core.

However, (Dynamic) NetQueue and VXLAN are not the best of friends when it comes to distributing network I/O over multiple queues. That is because of the way Virtual Tunnel End Points (VTEP) are set up. Within a VMware NSX implementation, each ESXi host in the cluster contains at least one VTEP, dependent upon the NIC load balancing mode chosen. The VTEP is the component that provides the encapsulation and decapsulation for the VXLAN packets. That means all VXLAN network traffic from a VM perspective will traverse the VTEP and the receiving VTEP on another ESXi host.

Therein lies the problem when it comes to NetQueue and the ability to distribute network I/O streams over multiple queues. This is because a VTEP will always have the same MAC address and the VTEP network will have a fixed VLAN tag. MAC address and VLAN tag are the filters most commonly supported by pNICs with VMDq and NetQueue enabled. It will seriously restrict the ability to have multiple queues and thereby will possibly restrict the network performance for your VXLAN networks. VMware NSX now supports multi-VTEPs per ESXi host. This helps slightly as a result of the extra MAC addresses, because of the increased number of VTEPs per ESXi host. NetQueue can therefore have more combinations to filter on. Still, it is far from perfect when it comes to the
desired network I/O parallelism handling using multiple queues and CPU cores. To overcome that challenge, there are some pNICs that support the distributing of queues by filtering on inner (encapsulated) MAC addresses. RSS can do that for you.

VMworld is upon us. The schedule builder went live and boy am I excited about VMworld 2017!

SER1872BU

This year Frank and I are presenting the successor of last years session at both VMworlds. We are listed in the schedule builder as SER1872BU – vSphere 6.5 Host Resources Deep Dive: Part 2. We are planning to bring even more ESXi epicness with a slight touch of vSAN and networking information that allows you to prep ESXi to run NFV workloads that drive IoT innovations. Last year we were lucky to have packed rooms in Vegas and Barcelona.

The enthusiasm about our book witnessed so far shows us there is still a lot of love out there for the ESXi hypervisor and ‘under-the-hood’ tech! We are working hard on having an awesome session ready for you!

VMworld Session Picks

This year I want to learn more about NFV, IoT and Edge as I find innovation in these areas intriguing. I found some sessions that look to be very interesting. I supplemented these with talks held by industry titans about various topics. If my schedule lets me, I want to see the following sessions:

Leading the 5G and IoT Revolution Through NFV [FUT3215BU] by Constantine Polychronopoulos

vSAN at the Edge: HCI for Distributed Applications Spanning Retail to IoT [STO2839GU] by Kristopher Groh

Looking at the ESXi VMkernel network path you will notice it consists of Netpoll threads and Tx threads. Netpoll threads receive traffic from an ESXi host perspective where Tx threads transmit data from a VM to another VM or physical component.

By default, each VM is armed with only one Tx thread. As network packets are transmitted from the VM towards the pNIC layer via the VMkernel, ESXi consumes CPU cycles. These cycles, or CPU time, will be accounted to the VM itself. Tx threads are identified in esxtop in the CPU view as NetWorld-VM-XXX. This ensure that you to have a clear picture on what the costs are of transmitting large numbers of networks packets from that specific VM. It allows you to have a better understanding if a VM is constrained by the amount of CPU time that is spent on transmission of data.

Again, only one Tx thread is spun up by default. That correlates with one CPU core. This is why the NetWorld will not trespass the ±100% of %USED.

In the screenshot above, the VM in question was running the transmit side of the packet-generator test. The NetWorld-VM-69999 world was constantly running up to 100%. This is a clear example of a VM being constrained by only one Tx thread. A relatively quick solution is to add an additional Tx thread. You can add more as needs require. Looking at the network view in esxtop, you will be able to see what vNIC is processing the largest amount of network I/O. In this specific case, we knew exactly what vNIC was in extra need of network processing power.

Additional Tx threads

You can add an additional Tx thread per vNIC. This is configured as an advanced parameter in the VM configuration. The ethernetX.ctxPerDev = 1 advanced setting is used for this. The ‘X’ stands for the vNIC for which the parameter is set. You can configure each vNIC with a separate Tx thread. However, that will create unnecessary Tx threads in your VM and potentially consume CPU time in an inefficient way, because not every vNIC is likely to require its own Tx thread. It really is a setting that is driven by demand. If your workload running in the VMs has a large appetite for network I/O, take a closer look at what vNIC could benefit from additional Tx threads.

Once the additional Tx thread(s) are configured, you want to verify that it is activated. Additional Tx threads will appear in esxtop in the CPU view as NetWorld-Dev-<id>-Tx. By being added as a separate world, a clear overview can be generated on which NetWorld is processing the majority of network I/O as a result of the CPU usage associated with that thread.

In this screenshot, you will notice that the additional Tx thread is active and processing network I/O. This is one way to determine if your advanced setting is working correctly. You can also use a net-stats command to do so.

More information…

…can be found in the vSphere 6.5 Host Resources Deep Dive book that is available on Amazon!