Blog Articles

Speeding up Open vSwitch with partial hardware offloading

Open vSwitch (OVS) can use the kernel datapath or the userspace datapath. There are interesting developments in the kernel datapath using hardware offloading through the TC Flower packet classifier, but in this article, the focus will be on the userspace datapath accelerated with the Data Plane Development Kit (DPDK) and its new feature—partial flow hardware offloading—to accelerate the virtual switch even more.

This article explains how the virtual switch worked before versus now and why the new feature can potentially save resources while improving the packet processing rate.

Everything you need to grow your career.

DPDK-accelerated OVS with and without flow hardware offloading

Let’s start by reviewing how DPDK-accelerated OVS works without flow hardware offloading. There should be one or more userspace threads responsible for constantly polling the network card for new packets, classifying them, and executing the respective actions. The demand for higher speeds never stops, and in order to be faster, each stage needs to do its part.

DPDK provides optimized methods to query for new packets, fetch any, and send them out if needed. It’s the I/O part. Next is the packet classification, which comprises three stages in sequence.

The first stage is used when a packet is received that’s called an EMC (Exact Match Cache). It is the fastest mechanism, as you would expect, but it also has limitations. The basic idea is to calculate a value (hash) that is specific to a packet and with that value search for the flow rule in the cache that contains the actions to be executed.

However, it is an expensive task to compute that hash value for each packet, so here comes the first example of hardware offloading, if hardware offloading is supported by the network card, which most do nowadays. Since version 2.5.0, OVS-DPDK uses the RSS hash provided by the network card to search the flow in the cache. Now we have extra cycles to get to the next packets!

As said above, however, the cache has its limitations, such as dealing with hash collisions, which requires parsing the packet headers to make sure it finds the correct flow. The cache also can’t be too big or too small, so depending on the use case/traffic pattern, the cache might not be very efficient. There were improvements in this area, for example, the “Conditional EMC Insert,” but that is a topic for another article.

The ultimate goal for OVS-DPDK today is to push all the per-packet processing work (matching the packets to a specific flow rule and executing the corresponding actions) to the network cards. That would free system resources such as the main processors and memory to do other work, improve the packet processing speed while the virtual switch would be responsible for managing the cards and related tasks, for example, providing flow statistics. That’s called Flow Hardware Offload, which is not there yet. But since OVS 2.10, experimental partial hardware offloading has been available. It is disabled by default, and for now, it is limited to certain network cards and flows.

The idea with the experimental partial hardware offloading is that OVS-DPDK pushes flow rules along with unique marks to the network card, and the card will match packets belonging to each flow rule and mark them accordingly. Then the virtual switch will use each unique mark to find the specific flow rule and then execute the necessary actions in software. Although it seems a lot like the EMC described above, in this case, some expensive tasks are executed in the network card. For example, the virtual switch does not need to parse all the packet headers as it did before, because the mark is guaranteed to be unique, nor does it need to avoid the use of another level of cache in software if the number of flows is higher than EMC can handle.

In summary, OVS-DPDK leverages the network card flow MARK action’s support in the hardware to skip some very costly CPU operations in the host. This way, OVS-DPDK can process even more packets or potentially reduce the number of processors bogged down with networking operations.