New VMware Fling to improve Network/CPU performance when using Promiscuous Mode for Nested ESXi

I wrote an article awhile back Why is Promiscuous Mode & Forged Transmits required for Nested ESXi? and the primary motivation behind the article was in regards to an observation a customer made while using Nested ESXi. The customer was performing some networking benchmarks on their physical ESXi hosts which happened to be hosting a couple of Nested ESXi VMs as well as regular VMs. The customer concluded in his blog that running Nested ESXi VMs on their physical ESXi hosts actually reduced overall network throughput.

UPDATE (04/24/17) - Please have a look at the new ESXi Learnswitch which is an enhancement to the existing ESXi dvFilter MAC Learn module.

UPDATE (11/30/16) - A new version of the ESXi MAC Learning dvFilter has just been released to support ESXi 6.5, please download v2 for that ESXi release. If you have ESXi 5.x or 6.0, you will need to use the v1 version of the Fling as it is not backwards compat. You can all the details on the Fling page here.

This initially did not click until I started to think about this a bit more and the implications when enabling Promiscuous Mode which I think is something that not many of us are not aware of. At a very high level, Promiscuous Mode allows for proper networking connectivity for our Nested VMs running on top of a Nested ESXi VMs (For the full details, please refer to the blog article above). So why is this a problem and how does this lead to reduced network performance as well as increased CPU load?

The diagram below will hopefully help explain why. Here, I have a single physical ESXi host that is connected to either a VSS (Virtual Standard Switch) or VDS (vSphere Distributed Switch) and I have a portgroup which has Promiscuous Mode enabled and it contains both Nested ESXi VMs as well as regular VMs. Lets say we have 1000 Network Packets destined for our regular VM (highlighted in blue), one would expect that the red boxes (representing the packets) will be forwarded to our regular VM right?

What actually happens is shown in the next diagram below where every Nested ESXi VM as well as other regular VMs within the portgroup that has Promiscuous Mode enabled will receive a copy of those 1000 Network Packets on each of their vNICs even though they were not originally intended for them. This process of performing the shadow copies of the network packets and forwarding them down to the VMs is a very expensive operation. This is why the customer was seeing reduced network performance as well as increased CPU utilization to process all these additional packets that would eventually be discarded by the Nested ESXi VMs.

This really solidified in my head when I logged into my own home lab system which I run anywhere from 15-20 Nested ESXi VMs at any given time in addition to several dozen regular VMs just like any home/development/test lab would. I launched esxtop and set the refresh cycle to 2seconds and switched to the networking view. At the time I was transferring a couple of ESXi ISO’s for my kicskstart server and realized that ALL my Nested ESXi VMs got a copy of those packets.

As you can see from the screenshot above, every single one of my Nested ESXi VMs was receiving ALL traffic from the virtual switch, this definitely adds up to a lot of resources being wasted on my physical ESXi host which could be used for running other workloads.

I decided at this point to reach out to engineering to see if there was anything we could do to help reduce this impact. I initially thought about using NIOC but then realized it was primarily designed for managing outbound traffic where as the Promiscuous Mode traffic is all inbound and it would not actually get rid of the traffic. After speaking to a couple of Engineers, it turns out this issue had been seen in our R&D Cloud (Nimbus) which provides IaaS capabilities to the R&D Organization for quickly spinning up both Virtual/Physical instances for development and testing.

Christian Dickmann was my go to guy for Nimbus and it turns out this particular issue has been seen before. Not only has he seen this behavior, he also had a nice solution to fix the problem in the form of an ESXi dvFilter that implemented MAC Learning! As many of you know our VSS/VDS does not implement MAC Learning as we already know which MAC Addresses are assigned to a particular VM.

I got in touch with Christian and was able to validate his solution in my home lab using the latest ESXi 5.5 release. At this point, I knew I had to get this out to the larger VMware Community and started to work with Christian and our VMware Flings team to see how we can get this released as a Fling.

Today, I am excited to announce the ESXi Mac Learning dvFilter Fling which is distributed as an installable VIB for your physical ESXi host and it provides support for ESXi 5.x & ESXi 6.x

Note: You will need to enable Promiscuous Mode either on the VSS/VDS or specific portgroup/distributed portgroup for this solution to work.

You can download the MAC Learning dvFilter VIB here or you can install directly from the URL shown below:

To install the VIB, run the following ESXCLI command if you have VIB uploaded to your ESXi datastore:

A system reboot is not necessary and you can confirm the dvFilter was successfully installed by running the following command:

/sbin/summarize-dvfilter

You should be able see the new MAC Learning dvFilter listed at the very top of the output.

For the new dvFilter to work, you will need to add two Advanced Virtual Machine Settings to each of your Nested ESXi VMs and this is on a per vNIC basis, which means you will need to add N-entries if you have N-vNICs on your Nested ESXi VM.

This can be done online without rebooting the Nested ESXi VMs if you leverage the vSphere API. Another way to add this is to shutdown your Nested ESXi VM and use either the “legacy” vSphere C# Client or vSphere Web Client or for those that know how to append and reload the .VMX file as that’s where the configuration file is persisted
on disk.

I normally provision my Nested ESXi VMs with 4 vNICs, so I have four corresponding entries. To confirm the settings are loaded, we can re-run the summarize-dvfilter command and we should now see our Virtual Machine listed in the output along with each vNIC instance.

Once I started to apply this changed across all my Nested ESXi VMs using a script I had written for setting Advanced VM Settings, I immediately saw the decrease of network traffic on ALL my Nested ESXi VMs. For those of you who wish to automate this configuration change, you can take a look at this blog article which includes both a PowerCLI & vSphere SDK for Perl script that can help.

I highly recommend anyone that uses Nested ESXi to ensure you have this VIB installed on all your ESXi hosts! As a best practice you should also ensure that you isolate your other workloads from your Nested ESXi VMs and this will allow you to limit which portgroups must be enabled with Promiscuous Mode.

Reader Interactions

Comments

Thanks for sharing this. This will be useful not just for folks doing nested ESXi or HyperV, but also for those experimenting with Open vSwitch.
BTW, is the MAC learning parameters customisable? Also interested to know what the default MAC address aging time of the vib is.

William, this is very interesting. I have a similar lab environment, but in my lab I’m using NSX-mh as the ROOT networking layer and my nested ESXi hosts are connected to virtual networks through the NSX vSwitch (NVS), not vDS or VSS. In NSX, virtual network have a mode where they can perform MAC learning and that’s how I configure the ports where my ESXi hosts connect.

By using the vSphere API, you can dynamically add these configurations to a given VM and they would take affect immediately. You can edit the .VMX file manually, but that would require downtime as mentioned in the article. I provide a link to an article which goes into more details leveraging the vSphere API, I recommend you give that a read 🙂

Yes, you can add these to a VM Template prior to converting the VM to a template.

Thanks for the reply – one other question (sorry if I have missed it anywhere), do we need to apply this to any other nested hypervisors? eg. if we have any nested Hyper-V hosts running? these would also be using promiscuous mode.

Great post but what I am wondering is there an easy way to test (AKA send a continuous large ping packet from one machine to another). I want to test before and after so that I can verify things are setup correct. Unless I missed it I do not see the command example on esxtop. That is a new command to me.

Hi William, thanks for the post. If we install the VIB, does the standard switch/distributed switch learn mac for all VMs or just the vms (nested esxi) where I apply the filter by adding the vm parameters?

Yes, it is still needed if you do not want the performance impact of Promiscuous Mode and I believe it is still compatible with ESXi 6.0, I’ve not had a chance to personally test it but afaik, nothing has changed on this front.

I wish to take advantage of the MAC-learning ability in a nested switch topology wherein traffic from VMs on the outer vswitches to VMs on the inner vswitches need not go a physical switch. (Outer vswitch learns MACs of the VMs on the inner vswitches).

Would configuring the above-described VM-options on the VNICs of a transparent bridge connecting the outer and inner vswitch would enable the outer vswitch to learn the MACs on the VMs on inner-vswitch?

>>”The MAC Learn dvFilter is specific to how Nesed ESXi works”,
That explains why I was still seeing traffic from outer-vswitch VMs to inner vswitch VMs going to the uplink of the physical switch after trying the steps to use the fling.

We’re working to implement this in a lab and our host dvSwitch has multiple uplinks. We’ve found that we cannot have the uplinks configured in an Active-Active configuration and get reliable connectivity to our Nested environments, but if we have online one Uplink connected to our dvSwitch, or if we have them in an Active-Standby configuration, the nested environment functions just fine.

Can you think of why we would see issues when implementing the Active-Active uplink configuration on the parent dvSwitch? While we have a functional alternative solution, it’s more a matter of interest in understanding the nuts and bolts behind the solution.

Any insights would be greatly appreciated… we have the Active-Standby solution functional in 5.5 and will be trying in 6.0 very soon.

Interesting 🙂
I’m starting to use an ESXi now for lab environment and one thing that striked me odd is the poor remote (NFS) performance. My “storage” is able to support ~100MB/s locally (linux based) and a VM running on workstation over W7 via 1Gb link gets ~50MB/s but same VM on the ESXi gets 15MB/s 🙁
This did increase the performance to ~22MB/s, but I guess there is more space to grow…

I really would appreciate a description of how this works. It seems it learns in each port what traffic should be filtered (well, it does not say but it says it works in conjunction with promiscuous and that port memory is limited and that forgetting -aka aging- is not implemented).
Any way to see the learnt MACs ? And how would it work with the same MAC living in many ports ?
What about trunks ? Also, does it have to be filter4 ? 🙂
Thanks!

I fail to get this working in my VNCI vCloud environment. The filter installed ok and works for a standalone (as in not managed by vCloud) nested ESXi, but as soon as I apply the vmx advanced features it renders the nested ESXi offline.
Any Ideas?

Hi Carlos,
yes, the VIB is installed on the physical ESXi host and not installed on the nested ESXi.
To me it looks like it does not work in conjunction with the vsla-fence filter, which is added by vCloud Director.

I see, I missinterpreted your “managed nested ESX” phrase.
Hard part is that there is not much docs about how to deal with this. There is a post somewhere that shows how to do network sniffing in between filter taps. But if the insertion of this filter is what breaks connectivity… I don’t see that there is much that can be done.
Good luck,
-Carlos

FYI, I changed dvfilter-maclearn to filter0 (instead of 4). Seems to work with vsla-fence at filter1 – for now.
So @Carlos, it can be another filter position, but who knows what implications that has 😉

OneBit,
nice to know. I’ve been unable to find much info (let alone tools) to play with the filter chain,
agent management, policies (failopen/failclose), fast/slow path, etc.
So far I thought tat filter order was irrelevant if you would clear the whole chain, but it seems it is not.
(order of the elements alters the product here)
Thanks for the feedback!

I run a fairly large nested ESXi environment in a dev/test environment with 25 physical ESXi hosts that have roughly 700 nested ESXi hosts running within them. This VIB greatly improved network performance in this environment however there is one flaw that really creates problems. Within our environment we are constantly creating and destroying these nested ESXi environments to ensure a clean environment and to avoid needing licences for all these virtual ESXi hosts. The problem is after about 3-4 months the mac learning tables fill up and the % of dropped packed reported by esxtop skyrockets into the 30-40% range. At this point network performance becomes severely degraded. The only solution I have found to this problem is to put each host into maintenance mode one at a time and then reboot them. This clears the learning table and then allows this vib to function properly again. This is a very time consuming process though since with 25 physical hosts we have we can really only put 1 or 2 into maintenance mode at a time before we run out of memory on the other hosts in the cluster. I have been hoping that this vib would either get updated to drop mac addresses from the table that have not been used or at least provide a command to flush the whole table when it becomes full so we don’t have to reboot the physical hosts.

The Mac Learn table is actually on a per vNIC bases and not at any global pESXi level. The table is cleared each time a VM is powered off and re-created when it is powered back on. If the ESXi VMs are running for long periods of time, this could explain why you’re seeing this behavior. Is this the case when you say 3-4months time?

The 3-4 months time is just an estimate as to how long after I reboot our physical hosts that we start seeing the problem again. Since the ESXi eval license is only good for 60 days ESXi VMs don’t stick around any longer than that before they are powered down, deleted, and re-created. There might be some exceptions to this though for testing environments we have licensed for longer term testing. I am going to look around and see if I can find some VMs that have been recently created that are exhibiting this behavior. Can you elaborate on why a VM that would be running for a long period of time would eventually start seeing these high percentages of dropped rx packets?

Thank you very much for this fling! I am a one person I/T consultant and have my own vSphere Essentials bundle to handle all my stuff. I use OpenVPN as my primary VPN to connect my laptop or iPhone to my network and it requires promiscuous mode on the host side as it bridges network connections together in Windows, creating a new bridge NIC with a new MAC address that ESXi didn’t know about. I had noticed that whenever I was backing up VMs from my ESXi box over the network to a USB 3 drive on a separate computer, the Windows VM running the OpenVPN server would go to almost 100% cpu and become very sluggish. Of course this was because it was seeing about 700mbps worth of extraneous packets while I was backing up other VMs. Your fling solved this problem quite nicely!

However, it would probably be a better fix security wise in the long run if the ESXi developers would give users a way to add additional MAC addresses to a virtual network adapter through the configuration VMX file, as that way promiscuous mode wouldn’t be necessary in the first place unless the VM was truly intended to be authorized to monitor traffic from other VMs on the vSwitch.

William, it may be helpful to clarify that even with the dvfilter and VMX settings, you still need to enable Promiscuous mode and Forged Transmits on each portgroup… unless I messed up something else, that was the only way I got it to work…

Is this Fling and its related dvfilter VM settings strictly relevant to VM vNIC’s only? If I were to pass Virtual Functions of 1 or more 10GbE physical adapters through to nested ESXi instances for certain key functions (iSCSI…possibly vMotion) using SR-IOV, would the dvfilter settings be helpful? Or even possible? Since they’re fundamentally a hardware passthrough, I don’t know whether the filter Name and onFailure advanced config entries would be accepted upon changes being applied, and even if they did I have no idea whether they’d provide any benefit or instead actually prove detrimental. Have you (or has anyone else reading this) tried this out? I’ll get to it myself in the next few days and will try to circle back to report my results/findings. In the meantime, any insight you might have from your own experience would be greatly appreciated. Thanks for your terrific blog; since first coming upon it a few days ago it’s quickly become one of my main “go to” sources for learning vSphere’s innards!

Yes, this is only for a VM’s vNIC. I don’t believe a Virtual Function would show up as the same in the VMX param (e.g. ethernetX), but you could give it a shot and see if works based on the key that shows up and replace filter name? I’ve not had anyone ask about this so I can’t comment any further

Thank you, William! Yeah, I have a tendency to imagine my way into some unusual corner cases with my experiments. In any case, I don’t think an SR-IOV virtual function of my particular 10GbE hardware (Intel X552 on a Xeon D based board) would work if passed to a nested ESXi instance anyway. I looked through the PCI ID adjustments the support VIB would make and there seems to only be driver support for the physical function PCI ID’s, not the 8086:15a8 that I think my VF’s show up as. So my experiment wouldn’t have been possible anyway. But thank you for the clarity you provided; you have far more experience than I have and I’m grateful for your time.

I understand the problem of Promiscuous and understand this fling helps vswitch to learn, but how does this improve performance when still Promiscuous mode is required ? I couldn’t find explanation on how it works !

ONLY FOR VMK’s ?? Why ? Surely all of the nested esx hosts vNics are connected to a vswitch ?? This makes ZERO sense , would you mind clarifying ? And please also mention a way to remove the installed vibs and identify which exact ones get installed ?

I understand “Enabling Promiscuous Mode” on vSwitch/portgroups causes performance overhead.
I don’t understand how this dvFilter-maclearn solve Promiscuous performance problem.
Its likely that I am not reading it correctly… or may be it wasn’t clear.
Are you saying that this dvfilter-maclearn will allow to enable Promiscous at vSwitchPortID level for each vNIC, and will not forward to all other vSwitchPorts within the same entire portgroup?

The other question I have is… from your other article I read.http://www.virtuallyghetto.com/2011/05/how-to-query-for-macs-on-internal.html
Based on this article, vSwitch (ESXi) will maintain the MAC Address Table for forwarding purpose.
As per screen shots within that article, my understanding is vSwitch will automatically create and allocate a seperate virtual port for each MAC Address coming from a source VM (& its Guest OS) (especially in an nested Environment).
Can you confirm is it correct? (vSwitch will automatically create and allocate a seperate virtual port for each MAC Address coming from a source VM)
If true, thats different from a physical Switch MAC Addrress Table.
If true, Why does it work this way? What’s the underlying reason VMware programmed it this way?

Primary Sidebar

Search this website

Author

William Lam is a Staff Solutions Architect working in the VMware Cloud on AWS team within the Cloud Platform Business Unit (CPBU) at VMware. He focuses on Automation, Integration and Operation of the VMware Software Defined Datacenter (SDDC).