By Drew Gallatin

In the summer of 2015, the Netflix Open Connect CDN team decided to take on an ambitious project. The goal was to leverage the new 100GbE network interface technology just coming to market in order to be able to serve at 100 Gbps from a single FreeBSD-based Open Connect Appliance (OCA) using NVM Express (NVMe)-based storage.

At the time, the bulk of our flash storage-based appliances were close to being CPU limited serving at 40 Gbps using single-socket Xeon E5–2697v2. The first step was to find the CPU bottlenecks in the existing platform while we waited for newer CPUs from Intel, newer motherboards with PCIe Gen3 x16 slots that could run the new Mellanox 100GbE NICs at full speed, and for systems with NVMe drives.

Fake NUMA

Normally, most of an OCA’s content is served from disk, with only 10–20% of the most popular titles being served from memory (see our previous blog, Content Popularity for Open Connect for details). However, our early pre-NVMe prototypes were limited by disk bandwidth. So we set up a contrived experiment where we served only the very most popular content on a test server. This allowed all content to fit in RAM and therefore avoid the temporary disk bottleneck. Surprisingly, the performance actually dropped from being CPU limited at 40 Gbps to being CPU limited at only 22 Gbps!

After doing some very basic profiling with pmcstat and flame graphs, we suspected that we had a problem with lock contention. So we ran the DTrace-based lockstat lock profiling tool that is provided with FreeBSD. Lockstat told us that we were spending most of our CPU time waiting for the lock on FreeBSD’s inactive page queue. Why was this happening? Why did this get worse when serving only from memory?

A Netflix OCA serves large media files using NGINX via the asynchronous sendfile() system call. (See NGINX and Netflix Contribute New sendfile(2) to FreeBSD). The sendfile() system call fetches the content from disk (unless it is already in memory) one 4 KB page at a time, wraps it in a network memory buffer (mbuf), and passes it to the network stack for optional encryption and transmission via TCP. After the network stack releases the mbuf, a callback into the VM system causes the 4K page to be released. When the page is released, it is either freed into the free page pool, or inserted into a list of pages that may be needed again, known as the inactive queue. Because we were serving entirely from memory, NGINX was advising sendfile() that most of the pages would be needed again — so almost every page on the system went through the inactive queue.

The problem here is that the inactive queue is structured as a single list per non-uniform memory (NUMA) domain, and is protected by a single mutex lock. By serving everything from memory, we moved a large percent of the page release activity from the free page pool (where we already had a per-CPU free page cache, thanks to earlier work by Netflix’s Randall Stewart and Scott Long, and Isilon’s Jeff Roberson) to the inactive queue. The obvious fix would have been to add a per-CPU inactive page cache, but the system still needs to be able to find the page when it needs it again. Pages are hashed to the per-NUMA queues in a predictable way.

The ultimate solution we came up with is what we call “Fake NUMA”. This approach takes advantage of the fact that there is one set of page queues per NUMA domain. All we had to do was to lie to the system and tell it that we have one Fake NUMA domain for every 2 CPUs. After we did this, our lock contention nearly disappeared and we were able to serve at 52 Gbps (limited by the PCIe Gen3 x8 slot) with substantial CPU idle time.

Pbufs

After we had newer prototype machines, with an Intel Xeon E5 2697v3 CPU, PCIe Gen3 x16 slots for 100GbE NIC, and more disk storage (4 NVMe or 44 SATA SSD drives), we hit another bottleneck, also related to a lock on a global list. We were stuck at around 60 Gbps on this new hardware, and we were constrained by pbufs.

FreeBSD uses a “buf” structure to manage disk I/O. Bufs that are used by the paging system are statically allocated at boot time and kept on a global linked list that is protected by a single mutex. This was done long ago, for several reasons, primarily to avoid needing to allocate memory when the system is already low on memory, and trying to page or swap data out in order to be able to free memory. Our problem is that the sendfile() system call uses the VM paging system to read files from disk when they are not resident in memory. Therefore, all of our disk I/O was constrained by the pbuf mutex.

Our first problem was that the list was too small. We were spending a lot of time waiting for pbufs. This was easily fixed by increasing the number of pbufs allocated at boot time by increasing the kern.nswbuf tunable. However, this update revealed the next problem, which was lock contention on the global pbuf mutex. To solve this, we changed the vnode pager (which handles paging to files, rather than the swap partition, and hence handles all sendfile() I/O) to use the normal kernel zone allocator. This change removed the lock contention, and boosted our performance into the 70 Gbps range.

Proactive VM Page Scanning

As noted above, we make heavy use of the VM page queues, especially the inactive queue. Eventually, the system runs short of memory and these queues need to be scanned by the page daemon to free up memory. At full load, this was happening roughly twice per minute. When this happened, all NGINX processes would go to sleep in vm_wait() and the system would stop serving traffic while the pageout daemon worked to scan pages, often for several seconds. This had a severe impact on key metrics that we use to determine an OCA’s health, especially NGINX serving latency.

The basic system health can be expressed as follows (I wish this was a cartoon):

This problem is actually made progressively worse as one adds NUMA domains, because there is one pageout daemon per NUMA domain, but the page deficit that it is trying to clear is calculated globally. So if the vm pageout daemon decides to clean, say 1GB of memory and there are 16 domains, each of the 16 pageout daemons will individually attempt to clean 1GB of memory.

To solve this problem, we decided to proactively scan the VM page queues. In the sendfile path, when allocating a page for I/O, we run the pageout code several times per second on each VM domain. The pageout code is run in its lightest-weight mode in the context of one unlucky NGINX process. Other NGINX processes continue to run and serve traffic while this is happening, so we can avoid bursts of pager activity that blocks traffic serving. Proactive scanning allowed us to serve at roughly 80 Gbps on the prototype hardware.

RSS Assisted LRO

TCP Large Receive Offload (LRO), is the technique of combining several packets received for the same TCP connection into a single large packet. This technique reduces system load by reducing trips through the network stack. The effectiveness of LRO is measured by the aggregation rate. For example, if we are able to receive four packets and combine them into one, then our LRO aggregation rate is 4 packets per aggregation.

The FreeBSD LRO code will, by default, manage up to 8 packet aggregations at one time. This works really well on a LAN, when serving traffic over a small number of really fast connections. However, we have tens of thousands of active TCP connections on our 100GbE machines, so our aggregation rate was rarely better than 1.1 packets per aggregation on average.

Hans Petter Selasky, Mellanox’s 100GbE driver developer, came up with an innovative solution to our problem. Most modern NICs will supply an Receive Side Scaling (RSS) hash result to the host. RSS is a standard developed by Microsoft wherein TCP/IP traffic is hashed by source and destination IP address and/or TCP source and destination ports. The RSS hash result will almost always uniquely identify a TCP connection. Hans’ idea was that rather than just passing the packets to the LRO engine as they arrive from the network, we should hold the packets in a large batch, and then sort the batch of packets by RSS hash result (and original time of arrival, to keep them in order). After the packets are sorted, packets from the same connection are adjacent even when they arrive widely separated in time. Therefore, when the packets are passed to the FreeBSD LRO routine, it can aggregate them.

With this new LRO code, we were able to achieve an LRO aggregation rate of over 2 packets per aggregation, and were able to serve at well over 90 Gbps for the first time on our prototype hardware for mostly unencrypted traffic.

An RX queue containing 1024 packets from 256 connections would have 4 packets from the same connection in the ring, but the LRO engine would not be able to see that the packets belonged together, because it maintained just a handful of aggregations at once. After sorting by RSS hash, the packets from the same connection appear adjacent in the queue, and can be fully aggregated by the LRO engine.

New Goal: TLS at 100 Gbps

So the job was done. Or was it? The next goal was to achieve 100 Gbps while serving only TLS-encrypted streams.

By this point, we were using hardware which closely resembles today’s 100GbE flash storage-based OCAs: four NVMe PCIe Gen3 x4 drives, 100GbE ethernet, Xeon E5v4 2697A CPU. With the improvements described in the Protecting Netflix Viewing Privacy at Scale blog entry, we were able to serve TLS-only traffic at roughly 58 Gbps.

In the lock contention problems we’d observed above, the cause of any increased CPU use was relatively apparent from normal system level tools like flame graphs, DTrace, or lockstat. The 58 Gbps limit was comparatively strange. As before, the CPU use would increase linearly as we approached the 58 Gbps limit, but then as we neared the limit, the CPU use would increase almost exponentially. Flame graphs just showed everything taking longer, with no apparent hotspots.

We finally had a hunch that we were limited by our system’s memory bandwidth. We used the Intel® Performance Counter Monitor Tools to measure the memory bandwidth we were consuming at peak load. We then wrote a simple memory thrashing benchmark that used one thread per core to copy between large memory chunks that did not fit into cache. According to the PCM tools, this benchmark consumed the same amount of memory bandwidth as our OCA’s TLS-serving workload. So it was clear that we were memory limited.

At this point, we became focused on reducing memory bandwidth usage. To assist with this, we began using the Intel VTune profiling tools to identify memory loads and stores, and to identify cache misses.

Read Modify Write

Because we are using sendfile() to serve data, encryption is done from the virtual memory page cache into connection-specific encryption buffers. This preserves the normal FreeBSD page cache in order to allow serving of hot data from memory to many connections. One of the first things that stood out to us was that the ISA-L encryption library was using half again as much memory bandwidth for memory reads as it was for memory writes. From looking at VTune profiling information, we saw that ISA-L was somehow reading both the source and destination buffers, rather than just writing to the destination buffer.

We realized that this was because the AVX instructions used by ISA-L for encryption on our CPUs worked on 256-bit (32-byte) quantities, whereas the cache line size was 512-bits (64 bytes) — thus triggering the system to do read-modify-writes when data was written. The problem is that the the CPU will normally access the memory system in 64 byte cache line-sized chunks, reading an entire 64 bytes to access even just a single byte. In this case, the CPU needed to write 32 bytes of a cache line, but using read-modify-writes to handle those writes meant that it was reading the entire 64 byte cache line in order to be able to write that first 32 bytes. This was especially silly, because the very next thing that would happen would be that the second half of the cache line would be written.

After a quick email exchange with the ISA-L team, they provided us with a new version of the library that used non-temporal instructions when storing encryption results. Non-temporals bypass the cache, and allow the CPU direct access to memory. This meant that the CPU was no longer reading from the destination buffers, and so this increased our bandwidth from 58 Gbps to 65 Gbps.

In parallel with this optimization, the spec for our final production machines was changed from using lower cost DDR4–1866 memory to using DDR4–2400 memory, which was the fastest supported memory for our platform. With the faster memory, we were able to serve at 76 Gbps.

VTune Driven Optimizations

We spent a lot of time looking at VTune profiling information, re-working numerous core kernel data structures to have better alignment, and using minimally-sized types to be able to represent the possible ranges of data that could be expressed there. Examples of this approach include rearranging the fields of kernel structs related to TCP, and re-sizing many of the fields that were originally expressed in the 1980s as “longs” which need to hold 32 bits of data, but which are now 64 bits on 64-bit platforms.

Another trick we use is to avoid accessing rarely used cache lines of large structures. For example, FreeBSD’s mbuf data structure is incredibly flexible, and allows referencing many different types of objects and wrapping them for use by the network stack. One of the biggest sources of cache misses in our profiling was the code to release pages sent by sendfile(). The relevant part of the mbuf data structure looks like this:

The problem is that arg2 fell in the 3rd cache line of the mbuf, and was the only thing accessed in that cache line. Even worse, in our workload arg2 was almost always NULL. So we were paying to read 64 bytes of data for every 4 KB we sent, where that pointer was NULL virtually all the time. After failing to shrink the mbuf, we decided to augment the ext_flags to save enough state in the first cache line of the mbuf to determine if ext_arg2 was NULL. If it was, then we just passed NULL explicitly, rather than dereferencing ext_arg2 and taking a cache miss. This gained almost 1 Gbps of bandwidth.

Getting Out of Our Own Way

VTune and lockstat pointed out a number of oddities in system performance, most of which came from the data collection that is done for monitoring and statistics.

The first example is a metric monitored by our load balancer: TCP connection count. This metric is needed so that the load balancing software can tell if the system is underloaded or overloaded. The kernel did not export a connection count, but it did provide a way to export all TCP connection information, which allowed user space tools to calculate the number of connections. This was fine for smaller scale servers, but with tens of thousands of connections, the overhead was noticeable on our 100GbE OCAs. When asked to export the connections, the kernel first took a lock on the TCP connection hash table, copied it to a temporary buffer, dropped the lock, and then copied that buffer to userspace. Userspace then had to iterate over the table, counting connections. This both caused cache misses (lots of unneeded memory activity), and lock contention for the TCP hash table. The fix was quite simple. We added per-CPU lockless counters that tracked TCP state changes, and exported a count of connections in each TCP state.

Another example is that we were collecting detailed TCP statistics for every TCP connection. The goal of these statistics is to monitor the quality of customer’s sessions. The detailed statistics were quite expensive, both in terms of cache misses and in terms of CPU. On a fully loaded 100GbE server with many tens of thousands of active connections, the TCP statistics consumed 5–10% of the CPU. The solution to this problem was to only keep detailed statistics on a small percentage of connections. This dropped CPU used by TCP statistics to below 1%.

These changes resulted in a speedup of 3–5 Gbps.

Mbuf Page Arrays

The FreeBSD mbuf system is the workhorse of the network stack. Every packet which transits the network is composed of one or more mbufs, linked together in a list. The FreeBSD mbuf system is very flexible, and can wrap nearly any external object for use by the network stack. FreeBSD’s sendfile() system call, used to serve the bulk of our traffic, makes use of this feature by wrapping each 4K page of a media file in an mbuf, each with its own metadata (free function, arguments to the free function, reference count, etc).

The drawback to this flexibility is that it leads to a lot of mbufs being chained together. A single 1 MB HTTP range request going through sendfile can reference 256 VM pages, and each one will be wrapped in an mbuf and chained together. This gets messy fast.

At 100 Gbps, we’re moving about 12.5 GB/s of 4K pages through our system unencrypted. Adding encryption doubles that to 25 GB/s worth of 4K pages. That’s about 6.25 Million mbufs per second. When you add in the extra 2 mbufs used by the crypto code for TLS metadata at the beginning and end of each TLS record, that works out to another 1.6M mbufs/sec, for a total of about 8M mbufs/second. With roughly 2 cache line accesses per mbuf, that’s 128 bytes * 8M, which is 1 GB/s (8 Gbps) of data that is accessed at multiple layers of the stack (alloc, free, crypto, TCP, socket buffers, drivers, etc).

To reduce the number of mbufs in transit, we decided to augment mbufs to allow carrying several pages of the same type in a single mbuf. We designed a new type of mbuf that could carry up to 24 pages for sendfile, and which could also carry the TLS header and trailing information in-line (reducing a TLS record from 6 mbufs down to 1). That change reduced the above 8M mbufs/sec down to less than 1M mbufs/sec. This resulted in a speed up of roughly 7 Gbps.

This was not without some challenges. Most specifically, FreeBSD’s network stack was designed to assume that it can directly access any part of an mbuf using the mtod() (mbuf to data) macro. Given that we’re carrying the pages unmapped, any mtod() access will panic the system. We had to augment quite a few functions in the network stack to use accessor functions to access the mbufs, teach the DMA mapping system (busdma) about our new mbuf type, and write several accessors for copying mbufs into uios, etc. We also had to examine every NIC driver in use at Netflix and verify that they were using busdma for DMA mappings, and not accessing parts of mbufs using mtod(). At this point, we have the new mbufs enabled for most of our fleet, with the exception of a few very old storage platforms which are disk, and not CPU, limited.

Are We There Yet?

At this point, we’re able to serve 100% TLS traffic comfortably at 90 Gbps using the default FreeBSD TCP stack. However, the goalposts keep moving. We’ve found that when we use more advanced TCP algorithms, such as RACK and BBR, we are still a bit short of our goal. We have several ideas that we are currently pursuing, which range from optimizing the new TCP code to increasing the efficiency of LRO to trying to do encryption closer to the transfer of the data (either from the disk, or to the NIC) so as to take better advantage of Intel’s DDIO and save memory bandwidth.