Open vSwitch – Red Hat Developer BlogInsights and news on Red Hat Developer tools, platforms and more
2019-05-25T07:00:26Z https://developers.redhat.com/blog/feed/atom/WordPressFlavio Bruno Leitnerhttps://developers.redhat.com/blog/?p=5415972019-01-09T02:33:33Z2019-01-09T13:00:38ZOpen vSwitch (OVS) can use the kernel datapath or the userspace datapath. There are interesting developments in the kernel datapath using hardware offloading through the TC Flower packet classifier, but in this article, the focus will be on the userspace datapath accelerated with the Data Plane Development Kit (DPDK) and its new feature—partial flow hardware […]

]]>Open vSwitch (OVS) can use the kernel datapath or the userspace datapath. There are interesting developments in the kernel datapath using hardware offloading through the TC Flower packet classifier, but in this article, the focus will be on the userspace datapath accelerated with the Data Plane Development Kit (DPDK) and its new feature—partial flow hardware offloading—to accelerate the virtual switch even more.

This article explains how the virtual switch worked before versus now and why the new feature can potentially save resources while improving the packet processing rate.

DPDK-accelerated OVS with and without flow hardware offloading

Let’s start by reviewing how DPDK-accelerated OVS works without flow hardware offloading. There should be one or more userspace threads responsible for constantly polling the network card for new packets, classifying them, and executing the respective actions. The demand for higher speeds never stops, and in order to be faster, each stage needs to do its part.

DPDK provides optimized methods to query for new packets, fetch any, and send them out if needed. It’s the I/O part. Next is the packet classification, which comprises three stages in sequence.

The first stage is used when a packet is received that’s called an EMC (Exact Match Cache). It is the fastest mechanism, as you would expect, but it also has limitations. The basic idea is to calculate a value (hash) that is specific to a packet and with that value search for the flow rule in the cache that contains the actions to be executed.

However, it is an expensive task to compute that hash value for each packet, so here comes the first example of hardware offloading, if hardware offloading is supported by the network card, which most do nowadays. Since version 2.5.0, OVS-DPDK uses the RSS hash provided by the network card to search the flow in the cache. Now we have extra cycles to get to the next packets!

As said above, however, the cache has its limitations, such as dealing with hash collisions, which requires parsing the packet headers to make sure it finds the correct flow. The cache also can’t be too big or too small, so depending on the use case/traffic pattern, the cache might not be very efficient. There were improvements in this area, for example, the “Conditional EMC Insert,” but that is a topic for another article.

The ultimate goal for OVS-DPDK today is to push all the per-packet processing work (matching the packets to a specific flow rule and executing the corresponding actions) to the network cards. That would free system resources such as the main processors and memory to do other work, improve the packet processing speed while the virtual switch would be responsible for managing the cards and related tasks, for example, providing flow statistics. That’s called Flow Hardware Offload, which is not there yet. But since OVS 2.10, experimental partial hardware offloading has been available. It is disabled by default, and for now, it is limited to certain network cards and flows.

The idea with the experimental partial hardware offloading is that OVS-DPDK pushes flow rules along with unique marks to the network card, and the card will match packets belonging to each flow rule and mark them accordingly. Then the virtual switch will use each unique mark to find the specific flow rule and then execute the necessary actions in software. Although it seems a lot like the EMC described above, in this case, some expensive tasks are executed in the network card. For example, the virtual switch does not need to parse all the packet headers as it did before, because the mark is guaranteed to be unique, nor does it need to avoid the use of another level of cache in software if the number of flows is higher than EMC can handle.

In summary, OVS-DPDK leverages the network card flow MARK action’s support in the hardware to skip some very costly CPU operations in the host. This way, OVS-DPDK can process even more packets or potentially reduce the number of processors bogged down with networking operations.

]]>0Mark Michelsonhttps://developers.redhat.com/blog/?p=5326872019-01-01T23:21:13Z2019-01-02T13:00:56ZOVN (Open Virtual Network) is a subcomponent of Open vSwitch (OVS). It allows for the expression of overlay networks by connecting logical routers and logical switches. Cloud providers and cloud management systems have been using OVS for many years as a performant method for creating and managing overlay networks. Lately, OVN has come into its […]

]]>OVN (Open Virtual Network) is a subcomponent of Open vSwitch (OVS). It allows for the expression of overlay networks by connecting logical routers and logical switches. Cloud providers and cloud management systems have been using OVS for many years as a performant method for creating and managing overlay networks.

Lately, OVN has come into its own because it is being used more in Red Hat products. The result has been an increased amount of scrutiny for real-world scenarios with OVN. This has resulted in new features being added to OVN. More importantly, this has led to tremendous changes to improve performance in OVN.

In this article, I will discuss two game-changing performance improvements that have been added to OVN in the past year, and I will discuss future changes that we may see in the coming year.

Recent improvements

ovn-nbctl daemonization

One of our first performance targets was to determine the feasibility of supporting clusters the size of OpenShift Online. Based on some expected numbers, we set up tests where we would build the cluster to the expected size, and then simulate the creation and deletion of pods once at scale to see how OVN performed. Based on our initial testing, we were not very happy with the results. As the scale grew, the time it took to make changes to the cluster took longer and longer.

After isolating components and profiling them, we finally had a working theory on what was causing the problem: ovn-nbctl. ovn-nbctl is a command-line utility that allows for interaction with the OVN northbound database. This command is the main mechanism by which OpenShift builds its overlay network. I was able to create a shell script that mimicked the setup from our scale tests but that focused solely on the ovn-nbctl calls. Here is that script:

The first loop in the script creates 159 logical switches and connects all of them to a logical router. The next set of nested loops simulates the operations performed when adding a pod to an OpenShift cluster:

A switch port is added to one of the switches.

The switch port’s address is added to an address set. Each address set consists of two addresses. Therefore, alternating runs through the loop will either create a new address set or add the switch port’s address to the address set created during the previous loop iteration.

ACLs are created for the new port. One ACL allows traffic to the port from other addresses in its address set. The other drops all other traffic.

The loops result in 159 switches with 92 ports, totaling 14,628 logical switch ports. Each iteration of the loop calls ovn-nbctl five times, meaning there are a total of 73,460 invocations of ovn-nbctl.

When we run the script, this is the result:

$ time ./scale.sh
real 759m27.270s
user 500m40.805s
sys 36m5.682s

It’s hard to draw conclusions from that time alone, but I think it’s fair to say that taking over 12 hours to complete is not good. So let’s have a look at how long each iteration of the inner loop takes. Click the image below to enlarge it.

As you can see in the graph, as the test continues, the amount of time it takes to complete a loop iteration increases. Towards the end, an iteration takes over seven seconds to complete! Why is this?

To understand the issue, let’s start by taking a closer look at how OVSDB clients and servers work. OVSDB clients and servers communicate using JSONRPC. To prevent the need for raw JSONRPC from being embedded within client code, OVS provides a C-based IDL (interface definition language). The IDL has two responsibilities. First, at compile time, the IDL reads the schema for the databases and generates native C code to allow for programs to read and manipulate the database data in a type-safe way. Second, at runtime, the IDL acts as a translator between the C structures and JSONRPC.

A typical OVSDB client starts by determining which database it is interested in, which tables in that database it is interested in, and which columns’ values within those tables it is interested in. The client formulates a request to the server asking for the current values of those databases, tables, and columns. The server responds with the current values encoded as JSON. The client’s IDL then translates the JSON into native C structures. The client code can now read the contents of the database by examining C structures. If the client wants to make a modification to the database, it can make its own C data and pass it to the IDL for processing. The IDL then converts this to a JSONRPC database transaction to send to the server.

The IDL also aids the database client for messages originating from the server. If the data in the database changes, the server can send an update JSONRPC message to the client. The IDL then uses this JSONRPC to modify the C data with the updated information.

This works well for long-running OVSDB clients. After the initial dump of the current database, interaction between the client and server happens in small chunks. The problem with ovn-nbctl is that it is a short-lived process. ovn-nbctl starts up, requests all data from the northbound database, processes that JSON data into C data, and then usually performs a single operation before exiting. Through our profiling, what we found was that the majority of time was spent by ovn-nbctl processing the initial database dump at startup. This makes sense when you consider the amount of data being processed as the test reaches its conclusion.

The solution we created was to make ovn-nbctl have the option of being a long-running process. ovn-nbctl has been outfitted with an option to allow the OVSDB client portion of it to run in the background continuously. Further invocations of ovn-nbctl pass the command to the daemon process. The daemon process then passes the result of the command back to the CLI. By doing it this way, the OVSDB client only requires a single database dump from the server, followed by gradual updates of the content. ovn-nbctl processes can run much faster since they no longer require a dump of the entire database every time.

The solution was initially implemented by Jakub Sitnicki of Red Hat and then improved upon by Ben Pfaff of VMware.

The actual mechanism for running ovn-nbctl as a daemon is simple. You can run the following command:

$ export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach)

By setting the OVN_NB_DAEMON variable, any further calls to ovn-nbctl will connect to the daemon process.

With this improvement in place, we modified the previous script to have the daemonization line at the beginning. Running the modified script results in the following:

That is considerably faster! It’s over 99% faster, in fact. Here is a graph of each loop iteration. Click the graph to enlarge it.

That is a lot more flat than we previously saw. There still is a slight increase over time. That is due to certain ovn-nbctl commands becoming more complex as the total data size grows. The scale of the growth is much smaller than in the previous run.

Port groups

Another bottleneck seen in OVN testing was a tremendous slowdown when ACLs were heavily used. Here’s a script that illustrates the issue well:

Let’s discuss what’s happening in the script. First, an address set called set1 is created that has 1,020 IP addresses in it (10.0.1.0 through 10.0.4.255). We then create a logical switch ls0 and add NUM_PORTS to it. NUM_PORTS defaults to 100, but this can be changed to any value by passing an argument to the script. In our testing, we use 1000 for NUM_PORTS. In addition to creating the port, we also add a new ACL that allows traffic to go to this new port from any of the addresses in our address set. There are two important other lines here as well

The ovn-nbctl lsp-add command contains --wait=hv. This means that the command will block until it has been processed by ovn-controller processes running on all hypervisors. In our case, we are running in the OVS sandbox, so there is only one hypervisor. That means just waiting for a single ovn-controller to finish.

There is an ovs-vsctl add-port command to bind the logical switch port to the OVS br-int bridge. This makes it so that OpenFlow can be generated by ovn-controller for this particular logical switch port.

This particular script is not far-fetched. The number of addresses in the address set may be larger than a typical deployment, but this pattern is commonly used by deployments when using ACLs. They create ACLs that are mostly identical aside from the logical switch port that the ACL applies to.

Let’s see what happens when we run this script and create 1,000 logical switch ports.

The script takes over two hours to complete. Like before, it’s hard to gauge anything based on a time alone. Let’s look at what happens when we time each iteration in the loop. Click the picture to enlarge it.

Like with the ovn-nbctl issue before, we can see that the time increases steadily as the test goes on. As the network gets more ACLs, it takes longer for the loop to complete. Why is this?

The answer has to do with the way that ovn-controller generates OpenFlow. Despite the fact that our ACLs reference a 1,020-member address set as a single unit, that does not translate directly into OpenFlow. Instead, the address set has to be expanded into each individual address, and each individual address is evaluated in a separate flow. When we add our first port, ovn-controller generates OpenFlow similar to the following for our ACL in OpenFlow table 44:

Table 44 is on the logical switch egress pipeline, and it is where flows pertaining to to-lport ACLs are written.

metadata is an identifier for the datapath of the packet. In this case, 0x1 refers to our logical switch.

reg15 is an OpenFlow register that stores the output port number. In this case, 0x1 refers to the first logical switch port we have added. The flow checks the output port because our ACL was a to-lport ACL. If it had been a from-lport ACL, then we would have checked the input port instead, and we would do it in a different OpenFlow table.

nw_src is the network layer source address.

The resubmit(,45) action allows for processing to continue at OpenFlow table 45. The action is “resubmit” in this case because our ACL had an “allow” action on it. If it had been “drop,” then the flow would have actions=drop instead.

So in this case, we’ve created 1,020 flows, one for each address in our address set. Now let’s see what happens when we add our second port:

We now have added an identical set of flows, except that reg15 matches 0x2 instead of 0x1. In other words, we’ve created another 1,020 flows for the new port we added. You may notice a pattern forming. The number of flows generated in table 44 is approximately the number of ports multiplied by the number of addresses in the address set. Let’s count the total number of flows in our table after completing the script:

There are over a million flows, and table 44 accounts for 99% of the total. Bear in mind that in our script, we have --wait=hv present when adding our logical switch port. This means that we have to wait for ovn-controller to generate every flow every time that we add a new switch port. Since the table is growing by 1,020+ flows every iteration, it eventually starts to take tens of seconds to add a single port.

The solution to this issue is to try to minimize the number of flows generated. There’s actually a built-in construct in OVS’s OpenFlow implementation called a conjunctive match. A conjunctive match allows for OpenFlow rules that exist in the same table and that have the same resulting action to be combined into a more compact form. If these rules have similar match criteria, then a list of flows pertaining to the first part of the common match criteria can be made, followed by a list of flows pertaining to the second part, and so on until all parts are matched.

Consider the previous set of flows. All of the flows exist in table 44 and have the same action of resubmitting to table 45. All of the flows match first on a port number, and then they all match on a set of IP addresses. What we want is a conjunctive match of two parts. The first part will match on every valid port, and the second part will match on every IP address in our address set. Each time we add a new port, the first part of the conjunctive match will have a new flow added to it, but the second part will remain the same. By doing this, the number of flows in table 44 could be approximated by the number of ports plus the number of addresses in the address set.

To get a conjunctive match generated, several ideas were presented. The one that won out was for the addition of a feature called port groups. A port group is a simple construct that allows for a number of logical switch ports to be referred to by a single collective name. References to port groups can be made in ACLs in place of where you would normally refer to a single port. Port groups have other nice features, but those lie outside the scope of this particular article.

Port groups were initially implemented by Han Zhou of eBay, but loads of other contributors have fleshed out the feature and improved it since.

Now let’s rewrite the script using port groups. The major difference will be that instead of defining new ACLs for every port we add, we will define a single ACL that refers to a port group. As we create new logical switch ports, we will add the port to the port group. Having all ports and IP addresses expressed in a single ACL should allow for the conjunctive match to be created. Here is the resulting script:

As you can see, the construction is mostly similar. Notice that we refer to port group pg1 using the @ sign when defining our ACL. This lets OVN know that we are referring to a port group and not a single port in our ACL. Let’s see what happens when we run this script and create 1,000 logical switch ports.

Now the script takes only about 6 minutes to complete. That’s a 95% improvement! Here is a graph of each iteration of the loop. Click the picture to enlarge it.

The time is still increasing on each iteration, but if you look at the scale on the y-axis, you can see the times are much lower. To put it in perspective, these are the two graphs superimposed on each other. Click the picture to enlarge it.

You can see the blue bars starting to appear in the bottom right of the graph if you squint just so…

Let’s take a look at the flows generated at each step. After adding one switch port, the OpenFlow looks like this:

And there’s the conjunctive match we were after! The top line creates the conjunctive match. The ID of the conjunctive match is 2, and it states that if all requirements of the conjunctive match are met, then the action is to resubmit to table 45. The rest of the lines define the requirements of the conjunctive match with their conjunction actions. The 2 before the comma indicates they are a requirement for the conjunctive match with ID 2. The numbers after the comma indicate the requirement number and total number of requirements of the conjunctive match. All of the flows that end with 1/2 in their conjunction action pertain to the source IP address. All flows with 2/2 in their conjunction action pertain to the output logical switch port. The conjunctive match results in the same logical set of actions to take but expresses it in a more compact way.

After loading our network up with 1,000 switch ports, let’s examine the total number of OpenFlow flows generated.

That is three orders of magnitude fewer flows than before. Now table 44 accounts for only 22% of the total flows. This makes a world of difference in the time it takes for ovn-controller to generate the flows.

Mixing the optimizations

One thing you may have noticed in our ACL generation script is that we did not use a daemonized ovn-nbctl. We discussed in the first section how great an optimization that is, so let’s see how that makes a difference here.

There’s a 31% speedup, but it’s not due to the same reasons we saw with conjunctive matches earlier. Since there is no --wait=hv in the script, we are not waiting for ovn-controller to generate flows. The big reason for the speedup is because using port groups requires fewer calls to ovn-nbctl than we previously used.

Future improvements

Incremental processing

One thing you may have noticed from the port groups section was the fact that ovn-controller has to generate the entire OpenFlow table every time it reads the contents of the southbound database. ovn-northd works similarly: it always reads the entire northbound database in order to generate a complete new set of southbound data. Operating this way has some distinct advantages.

The code is easy to reason about.

The code is resilient in the case of temporary connectivity failures.

What overshadows these advantages is that this method is slow, computationally expensive, and it gets worse as the size of the dataset increases. A huge performance boost can be gained by processing the changes to the database rather than all content in the database.

This is a difficult thing to do in practice. There have been multiple attempts to refactor ovn-controller to process results incrementally. The most common issue with these implementations is increased difficulty in maintaining the code. A common problem seen throughout all attempts is that C doesn’t provide the easiest way of implementing an incremental processing engine.

Having examined the problems with past attempts and understanding the best way to move forward, engineers at VMware have started an effort to rewrite portions of OVN in a different language than C. They have created a language called Differential Datalog, commonly abbreviated DDlog. DDlog at its core is an incremental database processing engine. This is exactly what is needed in order to get more performant processing. Ben Pfaff sent a good e-mail to the ovs-discuss mailing list with a summary of the project.

So what sort of benefits can we expect from incremental processing? Han Zhou of eBay put together a proof-of-concept C version of incremental processing for ovn-controller. In tests run by him and me, we found around a 90% reduction in CPU usage by ovn-controller. We also found about a 90% speedup in ovn-controller‘s general operation. Unlike the optimizations discussed in the previous sections, this doesn’t apply to specific use cases, but rather provides an optimization for ALL operations by ovn-controller. DDlog’s experimental results on ovn-northd shows similar improvements. This presentation by Leonid Ryzhyk provides some graphs illustrating the immense speedup seen with DDlog’s incremental computation over the current C implementation at all cluster sizes.

The conversion to DDlog is a work in progress. The intent is to have the DDlog of ovn-northd finished and integrated into OVN by the release of version 2.11. Once the implementation drops, we encourage everyone to deploy it and tell us about the performance improvements you see. For those of you who are more hesitant about deploying a rewritten OVN component in your live environments, don’t worry! The C implementation of ovn-northd will still be present, and you can choose to continue using it instead. But you’ll be missing out on the amazing performance improvements of the DDlog implementation.

Other future improvements

Incremental processing is the foundation on which all future improvements are based. However, even with incremental processing, there are some future smaller improvements that we can visualize. Here is a brief list of possible improvements.

Incremental flow processing

ovn-controller currently creates a collection of all desired flows that it wants to install in OVS. It also maintains a collection of all flows currently installed in OVS. On each iteration, ovn-controller must determine the difference between the two collections and then send appropriate modification messages to OVS. With incremental processing in place, it naturally follows that we can incrementally calculate the flows to install as well. When testing Han Zhou’s C implementation of incremental processing, the comparison between desired and installed flows was the new top user of CPU time.

Pass incremental changes directly

Once incremental processing is put in place, ovn-northd will calculate some change to make to the southbound database and make that change. ovn-controller will take the southbound database contents, determine what has changed, and act on those incremental changes. If ovn-northd has already calculated what the changes to the southbound database are, perhaps there could be a way to pass the changes directly between ovn-northd and ovn-controller. This could eliminate some repetitive processing in the two daemons. It’s also possible to save some hard drive space that the southbound database would take up.

Of the possible improvements that could be made, this likely would be the most difficult to get right. This is because multiple ovn-controller daemons connect to the OVN southbound database. Therefore, trying to calculate a universal delta of changes means that it’s easy for an ovn-controller to miss an update.

Even if this cannot be done across the board for all southbound data, perhaps something could be done specifically for the southbound Logical_Flow table. That table tends to grow larger than any other southbound table.

Better conjunctive match generation

We touched on how previously, conjunctive matches greatly helped to lessen the number of flows that are installed by ovn-controller. The expression parser in ovn-controller has difficulty generating conjunctive matches in situations where it really should be able to. It takes some more complex analysis of the resulting flows than currently exists in ovn-controller. If the expression parser could be made smarter, there might be some savings that could be made to the number of flows generated.

Centralize expression parsing

Currently, ovn-controller reads through logical flows from the southbound database and parses the expressions in order to generate flows for OVS. While there are some logical flows whose resulting parsed form will differ between hypervisors, most of the parsed expressions will be exactly the same no matter where they are parsed. Perhaps some computing power on the hypervisors could be saved by parsing expressions centrally instead.

Final thoughts

OVN development is entering an exciting time. Being able to improve performance on such a grand scale is a great sign that the software is maturing. With the introduction of incremental processing, I believe that control plane performance concerns will completely disappear. The use of OVN will become a natural fit for anyone that wants to use OVS in their environments, adding very little overhead.

I suspect that as word gets out about the performance improvements, adoption of OVN will increase even more. Increased adoption means that in addition to no longer needing to focus on performance, we can expect to see a bevy of new features added to OVN in the near future.

If you are interested in contributing, I strongly encourage you to get involved now. This could be the beginning of a golden age of new OVN features to add.

]]>Open Virtual Network (OVN) is a subproject of Open vSwitch (OVS), a performant, programmable, multi-platform virtual switch. OVN adds to the OVS existing capabilities the support for overlay networks by introducing virtual network abstractions such as virtual switches and routers. Moreover, OVN provides native methods for setting up Access Control Lists (ACLs) and network services such as DHCP. Many Red Hat products, such as Red Hat OpenStack Platform and Red Hat Virtualization, are now using OVN, and Red Hat OpenShift Container Platform will be using OVN soon.

In this article, I’ll cover how OVN ARP/ND_NS actions work, the main limitations in the current implementation, and how to overcome those. First, I’ll provide a brief overview of OVN’s architecture to facilitate the discussion:

OVN architecture

An OVN deployment consists of several components:

The OVN/CMS plugin (for example, Neutron) is the CMS interface component for OVN. The plugin’s main purpose is to translate the CMS’s notion of the logical network configuration into an intermediate representation composed by logical switches and routers that can be interpreted by OVN.

The OVN northbound database (NBDB) is an OVSDB instance responsible for storing network representation received from the CMS plugin. The OVN northbound database has only two clients: the OVN/CMS plugin and the ovn−northd daemon.

The ovn−northd daemon connects to the OVN northbound database and to the OVN southbound database. It translates the logical network configuration in terms of conventional network concepts, taken from the OVN northbound database, into logical datapath flows in the OVN southbound database

The OVN southbound database (SBDB), is also an OVSDB database, but it is characterized by a quite different schema with respect to the northbound database. In particular, instead of familiar networking concepts, the southbound database defines the network in terms of match-action rule collections called logical flows. The logical flows, while conceptually similar to OpenFlow flows, exploit logical concepts, such as virtual machine instances, instead of physical ones, such as physical Ethernet ports. In particular, the southbound database includes three data types:

Physical network data, such as the VM’s IP address and tunnel encapsulation format

Logical network data, such as packet forwarding mode

The binding relationship between the physical network and logical network

L2 address resolution problem

A typical OVN deployment is shown below where the overlay network is connected to an external one through a localnet port (ext-localnet, in this case):

Whenever a device belonging to the overlay network (for example, PC1) tries to reach an external device (for example, PC-EXT), it forwards the packet to the OVN logical router (LR0). If LR0 has not already resolved the L2/L3 address correspondence for PC-EXT, it will send an ARP frame (or a Neighbor Discovery for IPv6 traffic) for PC-EXT. The current OVN implementation employs ARP action to perform L2 address resolution. In other words, OVN will instruct OVS to perform a “packet in” action whenever it needs to forward an IP packet for an unknown L2 destination. The ARP action replaces the IPv4 packet being processed with an ARP frame that is forwarded on the external network to resolve the PC-EXT MAC address. Below is shown the IPv4/IPv6 OVN SBDB rules corresponding to that processing:

The main drawback introduced by the described processing is the loss of the first packet of the connection (as shown in the following ICMP traffic) introducing latency in TCP connections established with devices not belonging to the overlay network:

Proposed solution: Add buffering support for IP packets

In order to overcome this limitation, a solution for adding buffering support for IP packets has been proposed by which incoming IP frames that have no corresponding L2 address yet are queued and will be re-injected to ovs-vswitchd as soon as the neighbor discovery process is completed.

Repeating the above tests proves that even the first ICMP echo request is received by PC-EXT:

Future development

A possible future enhancement to the described methodology could be to use the developed IP buffering infrastructure to queue packets waiting for given events and then send them back to ovs-vswitchd as soon as the requested message has been received. For example, we can rely on the IP buffering infrastructure to queue packets designated for an OpenShift pod that has not completed the bootstrap phase yet. Stay tuned

]]>0Matteo Crocehttps://developers.redhat.com/blog/?p=5388172019-05-07T13:47:44Z2018-12-03T12:00:45ZIntroduction Networks are fun to work with, but often they are also a source of trouble. Network troubleshooting can be difficult, and reproducing the bad behavior that is happening in the field can be painful as well. Luckily, there are some tools that come to the aid: network namespaces, virtual machines, tc, and netfilter. Simple […]

Networks are fun to work with, but often they are also a source of trouble. Network troubleshooting can be difficult, and reproducing the bad behavior that is happening in the field can be painful as well.

Luckily, there are some tools that come to the aid: network namespaces, virtual machines, tc, and netfilter. Simple network setups can be reproduced with network namespaces and veth devices, while more-complex setups require interconnecting virtual machines with a software bridge and using standard networking tools, like iptables or tc, to simulate the bad behavior. If you have an issue with ICMP replies generated because an SSH server is down, iptables -A INPUT -p tcp --dport 22 -j REJECT --reject-with icmp-host-unreachable in the correct namespace or VM can do the trick.

This article describes using eBPF (extended BPF), an extended version of the Berkeley Packet Filter, to troubleshoot complex network issues. eBPF is a fairly new technology and the project is still in an early stage, with documentation and the SDK not yet ready. But that should improve, especially with XDP (eXpress Data Path) being shipped in Red Hat Enterprise Linux 8, which you can download and run now.

While eBPF is not a silver bullet, I think it is a very powerful tool for network debugging and it deserves attention. I am sure it will play a really important role in the future of networks.

The problem

I was debugging an Open vSwitch (OVS) network issue affecting a very complex installation: some TCP packets were scrambled and delivered out of order, and the throughput between VMs was dropping from a sustained 6 Gb/s to an oscillating 2–4 Gb/s. After some analysis, it turned out that the first TCP packet of every connection with the PSH flag set was sent out of order: only the first one, and only one per connection.

I tried to replicate the setup with two VMs, and after many man pages and internet searches, I discovered that both iptables and nftables can’t mangle TCP flags, while tc could, but it can only overwrite the flags, breaking new connections and TCP in general.

Probably I could have dealt with it using a combination of iptables mark, conntrack, and tc, but then I thought: this could be a job for eBPF.

What is eBPF?

eBPF is an extended version of the Berkeley Packet Filter. It adds many improvements to BPF; most notably, it allows writing memory instead of just reading it, so it can also edit packets in addition to filtering them.

eBPF is often referred to as BPF, while BPF is referred to as cBPF (classic BPF), so the word BPF can be used to represent both, depending on the context: here, I’m always referring to the extended version.

Under the hood, eBPF uses a very simple bytecode VM that can execute small portions of bytecode and edit some in-memory buffers. eBPF comes with some limitations, to prevent it from being used maliciously:

Cycles are forbidden, so the program will exit in a definite time.

It can’t access memory other than the stack and a scratch buffer.

Only kernel functions in a whitelist can be called.

The loaded program can be loaded in the kernel in many ways, doing a plethora of debugging and tracing. In this case, we are interested in how eBPF works with the networking subsystem. There are two ways to use an eBPF program:

Attached via XDP to the very early RX path of a physical or virtual NIC

Attached via tc to a qdisc just like a normal action, in ingress or egress

In order to create an eBPF program to attach, it is enough to write some C code and convert it into bytecode. Below a simple example using XDP:

The snippet above, stripped of include statements, helpers, and all the not-necessary code, is an XDP program that changes the TTL of received ICMP echo replies, namely pongs, to a random number. The main function receives a struct xdp_md, which contains two pointers to the packet start and end.

To compile our code into eBPF bytecode, a compiler with support for it is needed. Clang supports it and produces eBPF bytecode by specifying bpf as the target at compile time:

$ clang -O2 -target bpf -c xdp_manglepong.c -o xdp_manglepong.o

The command above produces a file that seems to be a regular object file, but if inspected, you’ll see that the reported machine type will be Linux eBPF rather than the native one of the OS:

Every packet received goes through eBPF, which eventually does some transformation and decides to drop or let the packet pass.

How eBPF can help

Going back to the original network issue, I needed to mangle some TCP flags, only one per connection, and neither iptables nor tc allow doing that. Writing C code for this scenario would be very easy: set up two VMs linked by an OVS bridge and simply attach eBPF to one of the two VM virtual devices.

This looks like a nice solution, but you must take into account that XDP only supports handling of received packets, and attaching eBPF in the rx path of the receiving VM will have no effect on the switch.

To properly address this, eBPF has to be loaded using tc and attached in the egress path within the VM, as tc can load and attach eBPF programs to a qdisc just like any other action. In order to mangle packets leaving the host, an egress qdisc is needed to attach eBPF to.

There are small differences between the XDP and tc API when loading an eBPF program: the default section name differs, the argument of the main function has a different structure type, and the returned values are different, but this is not a big issue. Below is a snippet of a program that does TCP mangling when attached to a tc action:

tcpdump confirms that the new eBPF code is working, and about 1 of every 10 TCP packets has the PSH flag set. With just 20 lines of C code, we selectively mangled the TCP packets leaving a VM, replicating an error that happened in the field, all without recompiling any driver and without even rebooting! This simplified a lot the validation of the Open vSwitch fix in a manner that was impossible to do with other tools.

Conclusion

eBPF is a fairly new technology, and the community has strong opinions about its adoption. It’s also worth noting that eBPF-based projects like bpfilter are becoming more popular, and as consequence, various hardware vendors are starting to implement eBPF support directly in their NICs.

While eBPF is not a silver bullet and should not be abused, I think it is a very powerful tool for network debugging and it deserves attention. I am sure it will play a really important role in the future of networks.

Additional Resources

]]>0Numan Siddiquehttps://developers.redhat.com/blog/?p=5274172018-11-08T13:56:27Z2018-11-08T13:56:27ZIn this article, I discuss external connectivity in Open Virtual Network (OVN), a subproject of Open vSwitch (OVS), using a distributed gateway router. OVN provides external connectivity in two ways: A logical router with a distributed gateway port, which is referred to as a distributed gateway router in this article A logical gateway router In […]

Setup details

Let’s first talk about the deployment details. I will take an example setup having five nodes, out of which three are controller nodes and the rest are compute nodes. The tenant VMs are created in the compute nodes. Controller nodes run OVN database servers in active/passive mode.

Note: When you run the command ovn-nbctl/ovn-sbctl, it should be run on the node where the OVN database servers are running. Alternately, you can pass the --db option with the IP address/port.

Chassis in OVN

In OVN terminology, each node is referred to as chassis. A chassis is nothing but a node where the ovn-controller service is running. In order for a chassis to act as a gateway chassis, it should be capable of providing external (north/south) connectivity to the tenant traffic. It also requires the following configuration:

Configure ovn-bridge-mappings, which provides a list of key-value pairs that map a physical network name to a local OVS bridge that provides connectivity to that network.

ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider

Create the provider OVS bridge and add to the OVS bridge the interface that provides external connectivity:

Notice thenetwork_name=provider. The network_name should match the list defined in the ovn-bridge-mappings. When a localnet port is defined in a logical switch, the ovn-controller running on gateway chassis creates an OVS patch port between the integration bridge and the provider bridge so that the logical tenant traffic leaves from and enters into the physical network.

At this point, the tenant traffic from the logical switches sw0 and sw1 still cannot enter the public logical switch, since there is no association between it and the logical router lr0.

We still need to schedule the distributed gateway port lr0-public to a gateway chassis. What does scheduling mean here? It means the chassis that is selected to host the gateway router port provides the centralized external connectivity. The north-south tenant traffic will be redirected to this chassis and it acts as a gateway. This chassis applies all the NATting rules before sending out the traffic via the patch port to the provider bridge. It also means that when someone pings 172.168.0.200 or sends ARP request for 172.168.0.200, the gateway chassis hosting this will respond with the ping and ARP replies.

Scheduling the gateway router port

This can be done in two ways:

Non-high-availability (non-HA) mode: The gateway router port is configured to be scheduled on a single gateway chassis. If the gateway chassis hosting this port goes down for some reason, the external connectivity is completely broken until the CMS (cloud management system) detects this and reschedules it to another gateway chassis.

HA mode: The gateway router port is configured to be scheduled on a set of gateway chassis. The gateway chassis configured with a high priority claims the gateway router port. If this gateway chassis goes down for some reason, the next higher priority gateway chassis claims the gateway router port.

Scheduling in non-HA mode

Select a gateway chassis where you want to schedule the gateway router port. Let’s schedule on controller-0. There are two ways to do it. Run one of the following commands:

You can always delete a gateway chassis’ association to the distributed router port by running the following command:

ovn-nbctl lrp-del-gateway-chassis lr0-public controller-1

To support HA, OVN uses the Bidirectional Forwarding Detection (BFD) protocol. It configures BFD on the tunnel ports. When a gateway chassis hosting a distributed gateway port goes down, all the chassis detect that (thanks to BFD) and the next higher priority gateway chassis claims the port. For more details, please refer to this and run the following commands to access the OVN man pages: man ovn-nb, man ovn-northd, and man ovn-controller.

Chassis redirect port

In the output of ovn-sbctl show, you can see Port_Binding "cr-lr0-public". What is cr-lr0-public? For every gateway router port scheduled, ovn-northd internally creates a logical port of type chassisredirect. This port represents an instance of the distributed gateway port that is scheduled on the selected chassis.

What happens when a VM sends external traffic?

Now let’s briefly see what happens when a VM associated with the logical port (let’s say sw0-port0) sends a packet to destination 172.168.0.110 from the OVN logical datapath pipeline perspective. Let’s assume the VM is running on compute-0 and the chassis redirect port is scheduled on controller-0. 172.168.0.110 could be associated with a physical server or a VM that is reachable via the provider network.

On the compute chassis, the following occurs:

When the VM sends the traffic, the logical switch pipeline of sw0 is run.

From the logical switch pipeline, it enters the ingress router pipeline via the lr0-sw0 port as the packet needs to be routed.

The ingress router pipeline is run and the routing decision is made and the outport is set to lr0-public.

The ingress pipeline of sw0 is run and the packet is sent to compute-0 via the tunnel port because OVN knows that sw0-port1 resides on compute-0.

On compute-0 chassis, the following occurs:

compute-0 receives the traffic on the tunnel port and sends the traffic to the egress pipeline of logical switch sw0.

In the egress pipeline, the packet is delivered to sw0-port1.

Conclusion

This article provides an overview of a distributed gateway router in OVN, how it is created and what happens when a VM sends external traffic. Hopefully this will be helpful in understanding external connectivity support in OVN and troubleshooting any issues related to it.

]]>0Mark Michelsonhttps://developers.redhat.com/blog/?p=5178172018-09-27T19:37:44Z2018-09-27T19:37:44ZIn part one of this series, we explored the dynamic IP address management (IPAM) capabilities of Open Virtual Network. We covered the subnet, ipv6_prefix, and exclude_ips options on logical switches. We then saw how these options get applied to logical switch ports whose addresses have been set to the special “dynamic” value. OVN, a subproject […]

]]>In part one of this series, we explored the dynamic IP address management (IPAM) capabilities of Open Virtual Network. We covered the subnet, ipv6_prefix, and exclude_ips options on logical switches. We then saw how these options get applied to logical switch ports whose addresses have been set to the special “dynamic” value. OVN, a subproject of Open vSwitch, is used for virtual networking in a number of Red Hat products like Red Hat OpenStack Platform, Red Hat Virtualization, and Red Hat OpenShift Container Platform in a future release.

In this part, we’re going to explore some of the oversights and downsides in the feature, how those have been corrected, and what’s in store for OVN in future versions.

Subnet changes

Let’s start by creating a simple logical switch with a couple of logical switch ports that use dynamic addresses:

Huh? The dynamic addresses didn’t update. Prior to OVS version 2.10, the dynamic addresses will not automatically update if the subnet, ipv6_prefix, or exclude_ips is updated. If you want the dynamic addresses to update, you need to clear the dynamic_addresses from the affected logical switch ports. The easiest way to clear the dynamic_addresses on all switch ports on switch sw is the following:

There; that’s better. There are a couple of things to note here. First, the order in which IP addresses get assigned to the switch ports is not always predictable. The final octet of the IP addresses assigned to the switch ports was swapped from what it had previously been. Also, the MAC addresses have been updated on each switch port. When we cleared the dynamic_addresses, the MAC address assignments on the switch port got lost. As a result, ovn-northd assigns new MAC addresses to the ports. Unfortunately, if you are using dynamic MAC addresses, this is unavoidable.

The good news is that starting with OVS 2.10.0, this is no longer necessary. Updating subnet, ipv6_prefix, or exclude_ips on a logical switch will automatically update the dynamic_addresses on all logical switch ports. The even better news is that only the affected values are updated, so in this particular case, the MAC addresses on each switch port stay the same.

Conflicting addresses

Let’s take our switch from the previous section and add a third switch port to it:

Oops—our new switch port has an address that conflicts with one of our dynamic addresses. This will result in errors when packets are sent. There are a couple of ways to clear this up.

One way to fix this is by clearing the dynamic_addresses of sw-p2, and then sw-p2 will get a new dynamic address assigned to it. As mentioned in the previous section, this also means that sw-p2 will get assigned a new MAC address.

The other way is to use ovn-nbctl lsp-set-addresses on sw-p3 so that it has an address that doesn’t conflict.

Starting with OVS version 2.10.0, this conflict can no longer occur. Instead, sw-p2 will automatically have its IP address updated to the next available address in the subnet. The code makes the assumption that statically assigned addresses are always correct and that dynamic addresses are “wrong” and need to be updated in the case of a conflict.

Starting with OVS version 2.11.0, it will be more difficult to cause this type of conflict. Watch what happens when we try the following with the current master of OVS:

The message above indicates that the conflict is detected by ovn-nbctl and the conflicting address is not set on sw-p3. It still is possible to set a conflicting address on sw-p3 by using the following command:

Doing this will still result in the conflicting address being set in the northbound database, and it will result in sw-p2 being assigned a new IP address.

Other fixed problems

In this final section, we’ll examine some more minor things that are fixed in the 2.10 series of OVS. These are much less likely to happen than the issues explored in the previous two sections, and they’re similar. Here’s a brief summary:

Prior to 2.10, if the MAC address on a switch port changes from being statically assigned to dynamically assigned, the MAC address would not be updated. In 2.10+, the MAC address is dynamically assigned.

Prior to 2.10, if the IPv6 address is dynamically assigned and the MAC address on the port changes, then the IPv6 address is not updated. In 2.10+, when the MAC address is changed, the IPv6 address is recalculated too.

The future of IPAM in OVN

IPAM offers a handy way to have IP addresses and MAC addresses automatically get assigned to your logical switch ports. In part 1, we explored the basics of enabling IPAM in OVN, and in this part, we saw some downsides that have been fixed recently. But what is still to come? New developments are focused not so much on fixing issues as on adding features.

One improvement in the pipe is to allow for the pool of assignable MAC addresses to be configured. As we have seen in these posts, OVN will assign MAC addresses that start with “0a.” But what about deployments where you want OVN to assign MAC addresses but you want to pick the range of MAC addresses to be assigned? This is currently being developed. One idea is to provide a start and end address, allowing OVN to assign addresses from that range. Another idea is to allow for an Organizational Unique Identifier (OUI) to be configured and assign OVN addresses using this OUI as a prefix.

Another improvement is to provide consistent pairings of IPv4 addresses and MAC addresses. Currently, OVN assigns MAC and IPv4 addresses independently of each other. However, it would be more friendly on ARP tables to try to assign the same IPv4 address with the same MAC address each time.

Both of the above ideas are currently in development, with a target of being available in the 2.11 series of OVS. I’m sure those of you reading these blog posts have ideas for further features that could be added. If you do, feel free to leave a comment on this post with your suggestion.

]]>0Eelco Chaudronhttps://developers.redhat.com/blog/?p=5191872018-09-20T16:01:42Z2018-09-19T11:00:00ZWhen most people deploy an Open vSwitch configuration for virtual networking using the NORMAL rule, that is, using L2 learning, they do not think about configuring the size of the Forwarding DataBase (FDB). When hardware-based switches are used, the FDB size is generally rather large and the large FDB size is a key selling point. […]

]]>When most people deploy an Open vSwitch configuration for virtual networking using the NORMAL rule, that is, using L2 learning, they do not think about configuring the size of the Forwarding DataBase (FDB).

When hardware-based switches are used, the FDB size is generally rather large and the large FDB size is a key selling point. However for Open vSwitch, the default FDB value is rather small, for example, in version 2.9 and earlier it is only 2K entries. Starting with version 2.10 the FDB size was increased to 8K entries. Note that for Open vSwitch, each bridge has its own FDB table for which the size is individually configurable.

This blog explains the effects of configuring too small an FDB table, how to identify which bridge is suffering from too small an FDB table, and how to configure the FDB table size appropriately.

Effects of too small an FDB table

When the FDB table is full and a new entry needs to be added, an older entry is removed to make room for the new one1. This is called FDB wrapping. If a packet is then received from the MAC address whose entry was removed, another entry is removed to make room, and the source MAC address of the packet will be re-added.

When more MAC addresses exist in the network than can be held in the configured FDB table size and all the MAC addresses are seen frequently, a lot of ping/ponging in the table can happen.

The more ping/ponging there is, the more CPU resources are needed to maintain the table. In addition, if traffic is received from evicted MAC addresses, the traffic is flooded out of all ports.

1 The algorithm for removing older entries in Open vSwitch is as follows. On the specific bridge, the port with the most FDB entries is found and the oldest entry is removed.

Open vSwitch–specific manifestations of too small an FDB table

In addition to the FDB table updates, Open vSwitch also has to clean up the flow table when an FDB entry is removed. This is done by the Open vSwitch revalidator thread. Because this flow table cleanup takes quite a bit of CPU cycles, the first indication you might have of an FDB table wrapping issue is a high revalidator thread utilization. The following example shows a high revalidator thread utilization of around 83% (deduced by adding the percentages shown in the CPU% column) in an idle system:

Troubleshooting an FDB wrapping issue

Let’s figure out if the high revalidator thread CPU usage is related to the FDB requesting a cleanup. This can be done by inspecting the coverage counters. The following shows all coverage counters (that have a value higher than zero) related to causes for the revalidator running:

In the above output, you can see that rev_mac_learning has triggered the revalidation process about 20 times per second. This is quite high. In theory, it could still happen due to the normal FDB aging process, although in that specific case the last minute/hour values should be lower.

Hower normal aging can be isolated by using the same coverage counters:

As you can see, there are mac_learning_learned and mac_learning_expired counters. In the above output, you can see a lot of new MAC addresses have been learned: around 1,836 per second. For an FDB table with the size of 2K, this is extremely high and would indicate we are replacing FDB entries.

If you are running Open vSwitch v2.10 or newer, it has additional coverage counters:

mac_learning_evicted: Shows the total number of evicted MAC entries, that is, entries moved out due to the table being full

mac_learning_moved: Shows the total number of “port moved” MAC entries, that is, entries where the MAC address moved to a different port

Now, how can you determine which bridge has an FDB wrapping issue? For v2.9 and earlier, it’s a manual process of dumping the FDB table a couple of times, using the command ovs-appctl fdb/show, and comparing the entries.

For v2.10 and higher a new command was introduced, ovs-appctl fdb/stats-show, which shows all the above statistics on a per-bridge basis:

$ ovs-appctl fdb/stats-show ovs0
Statistics for bridge "ovs0":
Current/maximum MAC entries in the table: 8192/8192
Total number of learned MAC entries : 52779
Total number of expired MAC entries : 8192
Total number of evicted MAC entries : 36395
Total number of port moved MAC entries : 1

NOTE: The statistics can be cleared with the command ovs-appctl fdb/stats-clear, for example, to get a per-second rate:

Fixing the FDB table size

With Open vSwitch, you can easily adjust the size of the FDB table, and it’s configurable per bridge. The command to do this is as follows:

ovs-vsctl set bridge <bridge> other-config:mac-table-size=<size>

When you change the configuration, take note of the following:

The number of FDB entries can be from 10 to 1,000,000.

The configuration is active immediately.

The current entries are not flushed from the table.

If a smaller number is configured than the number of entries currently in the table, the oldest entries are aged out. You can see this in the expired MAC entries statistics.

Why not change the default to 1 million and stop worrying about this? Resource consumption: each entry in the table allocates memory. Although Open vSwitch allocates memory only when the entry is in use, changing the default to a too-high value could become a problem, for example, when someone does a MAC flooding attack.

So what would be the correct size to configure? This is hard to tell and depends on your use case. As a rule of thumb, you should configure your table a bit larger than the average number of active MAC addresses on your bridge.

Simple script to see FDB wrapping effects

If you would like to experiment with the counters, the following reproducer script from Jiri Benc, which lets you reproduce the effects of FDB wrapping, will let you do this.

Now you can use the counter commands in the previous troubleshooting section to see the FDB table wrapping information and then set the size of the FDB appropriately.

Additional Open vSwitch and Open Virtual Network resources

Many of Red Hat’s products, such as Red Hat OpenStack Platform and Red Hat Virtualization, are now using Open Virtual Network (OVN) a sub-project of Open vSwitch. Red Hat OpenShift Container Platform will be using OVN soon. Some other virtual networking articles on the Red Hat Developer blog:

]]>0Mark Michelsonhttps://developers.redhat.com/blog/?p=5161672018-09-27T19:42:45Z2018-09-03T11:00:51ZSome background For those unfamiliar, Open Virtual Network (OVN) is a subproject of OpenVswitch (OVS), a performant programmable multi-platform virtual switch. OVN provides the ability to express an overlay network as a series of virtual routers and switches. OVN also provides native methods for setting up Access Control Lists (ACLs), and it functions as an […]

For those unfamiliar, Open Virtual Network (OVN) is a subproject of OpenVswitch (OVS), a performant programmable multi-platform virtual switch. OVN provides the ability to express an overlay network as a series of virtual routers and switches. OVN also provides native methods for setting up Access Control Lists (ACLs), and it functions as an OpenFlow switch, providing services such as DHCP. The components of OVN program OVS on each of the hypervisors in the network. Many of Red Hat’s products, such as Red Hat OpenStack Platform and Red Hat Virtualization, are now using OVN. Red Hat OpenShift Container Platform will be using OVN soon.

Looking around the internet, it’s pretty easy to find high-quality tutorials on the basics of OVN. However, when it comes to more-advanced topics, it sometimes feels like the amount of information is lacking. In this tutorial, we’ll examine dynamic addressing in OVN. You will learn about IP address management (IPAM) options in OVN and how to apply them.

Static addressing

One of the first things you’ll learn when starting with OVN is how to create logical switches and logical switch ports. You’ll probably see something like this:

The first line creates a logical switch called sw. The second line adds a switch port called sw-p1 to sw. The final line sets the MAC and IP address of sw-p1. It’s pretty simple, but it requires you to manually keep track of IP addresses for the switch ports. Is there a way we could create a switch port without having to manually add the MAC and IP addresses?

Dynamic addressing

If you dig a bit deeper, you can find tutorials on the web that describe how to set up OVN to provide IP addresses using DHCP. This saves you some configuration steps on the VMs, but it doesn’t help any on the OVN side. You still have to specify an IP address on the logical switch port.

But is there some way that you can actually have OVN dynamically assign addresses to switch ports? If you scour the ovn-nb manpage, you might be able to piece together the way to do it.

This tutorial seeks to clear the air, so you can know exactly what tools are available to you and how to use them.

For our demonstration, we will use a very simple logical switch with two ports:

Switch configuration

Let’s start with the relevant options you can set, and then we’ll look at some examples that use these options. All of these are set as other_config on logical switches.

subnet: This is an IPv4 subnet, specified as a network address and mask. For example, 10.0.0.0/8 or 10.0.0.0/255.0.0.0.

exclude_ips: This is a list of IPv4 addresses that should not be assigned to switch ports. You can either comma-separate individual addresses, or you can specify a range of addresses using ...

ipv6_prefix: This is an IPv6 network address of 64 bits. If you provide a longer address size than 64 bits, those bits past the first 64 are ignored. The IPv6 address provided on each switch port is an EUI-64 address using the specified prefix and the MAC address of the port.

Once you have these options set on your switch, it’s then a matter of setting your switch ports up to make use of these options. You can do this in one of two ways.

Method #1:

$ ovn-nbctl lsp-set-addresses port 00:ac:00:ff:01:01 dynamic

Method #2:

$ ovn-nbctl lsp-set-addresses port dynamic

With method #1, you specify the MAC address, and with method #2, you allow OVN to allocate the MAC address for you.

Notice the dynamic_addresses for the two switch ports. This database column is automatically populated by ovn-northd based on IPAM configuration on the logical switch. For port1, we specified dynamic for the addresses, so OVN created a MAC address and IP address for us.You can recognize OVN-allocated MAC addresses because they always start with 0a.

In picture form, our switch now looks like this:

port1 was assigned the IPv4 address 192.168.0.2, and port2 was assigned the address 192.168.0.3. Why does the addressing start with .2 instead of .1? OVN reserves the first address of a subnet for the router that the switch attaches to. In our case, there is no router, so no switch port was assigned 192.168.0.1. The current algorithm of ovn-northd assigns addresses consecutively within the subnet.

For the rest of this tutorial, we will work knowing that this is how ovn-northd operates. However, since the documentation does not state that this is how dynamic IPv4 addressing works, it may be risky to rely on this behavior in your application. A change in OVS versions may result in a change in addressing.

The IPv6 addresses for each port are EUI-64 addresses. The first 64 bits of the address are the ipv6_prefix that we configured. The rest of the address is derived from the MAC address. When configuring an ipv6_prefix, keep in mind that even though only the first 64 bits of the address are used, OVN expects a valid IPv6 address to be provided. Therefore, if you are providing 64 bits, be sure to end the address with :: so that OVN will process it as expected.

Excluding IP addresses

Let’s take a closer look at the exclude_ips option. Let’s set up exclude_ips and then set up additional ports to see what happens.

Before continuing, let’s take a closer look at the syntax. First, we specified one IP address: 192.168.0.4. This means that this individual IP address will not be dynamically assigned. Next, we specified a range of IP addresses using ... This is a lot more practical than spelling out the 95 IP addresses from 192.168.0.6 to 192.168.0.100. The quotation marks around the string are necessary so that the shell does not interpret the spaces between addresses as separate arguments to ovn-nbctl.

Only IPv4 addresses can be specified in exclude_ips. Since IPv6 addresses are derived from the port’s MAC address, there is no point in specifying any excluded addresses.

Based on the pattern we had previously seen, we might expect for port3 to have IP address 192.168.0.4. However, that address is in our excluded set of IP addresses. Let’s see what port3 has been assigned:

It got assigned 192.168.0.101 since all addresses between 192.168.0.6 and 192.168.0.100 are in our excluded set.

Here is our final logical switch:

What’s to come

With all this, you should have the tools you need to set up IP addresses on your logical switches without the need to keep track of assigned addresses in your application.

But there’s more to this than what I have presented here. In part 2 of this blog series, we’ll look at some of the downsides in the IPAM implementation of OVN, and we will delve into improvements that are coming in an upcoming version of OVN.

]]>0Eelco Chaudronhttps://developers.redhat.com/blog/?p=5027372018-06-26T17:11:11Z2018-06-20T11:00:34ZThe most common problem when people are trying to deploy an Open vSwitch with Data Plane Development Kit (OvS-DPDK) solution is that the performance is not as expected. For example, they are losing packets. This is where our journey for this series of blogs will start. This first blog is about Poll Mode Driver (PMD) […]

]]>The most common problem when people are trying to deploy an Open vSwitch with Data Plane Development Kit (OvS-DPDK) solution is that the performance is not as expected. For example, they are losing packets. This is where our journey for this series of blogs will start.

This first blog is about Poll Mode Driver (PMD) thread core affinity. It covers how to configure thread affinity and how to verify that it’s set up correctly. This includes making sure no other threads are using the CPU cores.

Dedicate CPU Cores to the PMD Threads

PMD threads are the threads that handle the receiving and processing of packets from the assigned receive queues. They do this in a tight loop, and anything interrupting these threads can cause packets to be dropped. That is why these threads must run on a dedicated CPU core; that is, no other threads in the system should run on this core. This is also true for various Linux kernel tasks.

Let’s assume you would like to use CPU cores 1 and 15 (a single hyper-thread pair) for your PMD threads. This will convert into a pmd-cpu-mask mask of 0x8002.

To manually accomplish the isolation you have to do the following:

Use the Linux kernel command line option isolcpus to isolate the PMD cores from the general SMP balancing and scheduling algorithms. For the example above, you would use the following: isolcpus=1,15. Please note that the isolcpus= parameter is deprecated in favor of cpusets. For more information check the kernel documentation.

Reducing the number of clock tick interrupts can be done with the combined nohz=on nohz_full=1,15 command-line options. This reduces the times the PMD threads get interrupted for servicing timer interrupts. More details on this subject can be found here: NO_HZ.txt

For the above to work correctly we need another command-line option, rcu_nocbs=1,15, or else the kernel will still interrupt the thread; details are in the same document: NO_HZ.txt.

NOTE: For the above kernel options you might need to add additional cores that also need isolation. For example, cores assigned to one or more virtual machines and the cores configured by the dpdk-lcore-mask.

To make all of the above more convenient you could use a tuned profile called cpu-partitioning for this. There is a somewhat older blog on tuned that might be helpful. However, in short, this is how you configure it:

]]>0Kevin Traynorhttps://developers.redhat.com/blog/?p=5006672018-06-13T22:31:49Z2018-06-14T11:00:48ZIntroduction This article is about debugging out-of-memory issues with Open vSwitch with the Data Plane Development Kit (OvS-DPDK). It explains the situations in which you can run out of memory when using OvS-DPDK and it shows the log entries that are produced in those circumstances. It also shows some other log entries and commands for further […]

This article is about debugging out-of-memory issues with Open vSwitch with the Data Plane Development Kit (OvS-DPDK). It explains the situations in which you can run out of memory when using OvS-DPDK and it shows the log entries that are produced in those circumstances. It also shows some other log entries and commands for further debugging.

When you finish reading this article, you will be able to identify that you have an out-of-memory issue and you’ll know how to fix it. Spoiler: Usually having some more memory on the relevant NUMA node works. It is based on OvS 2.9.

Background

As is normal with DPDK-type applications, it is expected that hugepage memory has been set up and mounted. For further information see set up huge pages.

The next step is to specify the amount of memory pre-allocated for OvS-DPDK. This is done using the Open vSwitch Database (OVSDB). In the case below, 4GB of huge-page memory is pre-allocated on NUMA node 0 and NUMA node 1.

All these issues can be fixed by correctly setting up huge pages and requesting to pre-allocate an appropriate amount.

Adding a Port or Changing the MTU

These situations are grouped together because they can both result in a new pool of buffers being requested for a port. Where possible, these pools of buffers will be shared and reused, but that is not always possible due to differing port NUMA nodes or MTUs.

For new requests, the size of each buffer is fixed (MTU-based) but the number of buffers can be variable and OvS-DPDK will retry for a lower number of buffers if there is not enough memory for initial requests.

When DPDK cannot provide the requested memory to any one of the requests, it reports the following:

|dpdk|ERR|RING: Cannot reserve memory

While that may look serious, it’s nothing to worry about because OvS handles this and simply retries for a lower amount. If however, the retries do not work then the following will be in the log:

If you were changing the MTU, the MTU change fails but the port will continue to operate with the previous MTU.

How can you fix these errors? The general guide would be just to give OvS-DPDK more memory on the relevant NUMA node, or stick with a lower MTU.

Starting a VM

It doesn’t seem obvious why you would run out of memory when starting a VM, as opposed to when you are adding a vhost port for it (previous section). The key is vhost NUMA reallocation.

When a VM is started, DPDK checks the NUMA node of the memory shared from the guest. This may result in requesting a new pool of buffers from the same NUMA node. But of course, there might be no memory pre-allocated with dpdk-socket-mem on that NUMA node, or else there might be insufficient memory left.

The fix for this is having enough memory on the relevant NUMA node, or changing the libvirt/QEMU settings so VM memory is from a different NUMA node.

Runtime, Adding a Port, or Adding Queues

Didn’t we already cover adding a port? Yes, we did; however, this section is for when we get a requested pool of buffers, but some time later that proves to be insufficient.

This might be because there are many ports and queues sharing a pool of buffers and by the time some buffers are reserved for Rx queues, some are in flight processing and some are waiting to be returned from Tx queues, so there just aren’t enough buffers to go around.

For example, the log entries when this occurs while using a physical NIC could look like this:

Wrap-up

If you have read to here, it probably means you’ve hit an issue with OvS-DPDK. Sorry to hear that. Hopefully, after reading the above guide you’ll be able to identify if the issue was due to running out of memory and you’ll know how to fix it.