What's New in FreeBSD 7.0

FreeBSD is back to its incredible performance and now can take advantage of multi-core/CPUs systems very well... so well that some benchmarks on both Intel and AMD systems showed release 7.0 being faster than Linux 2.6 when running PostreSQL or MySQL.

Federico Biancuzzi interviewed two dozen developers to discuss all the cool details of FreeBSD 7.0: networking and SMP performance, SCTP support, the new IPSEC stack, virtualization, monitoring frameworks, ports, storage limits and a new journaling facility, what changed in the accounting file format, jemalloc(), ULE, and more.

Networking

It seems network performance is much better in 7.0!

Andre Oppermann: In general it can be said that the FreeBSD 7.0 TCP stack is three to five times faster (either in increased speed where not maxed out, or reduced CPU usage). It is no problem to fill either 1 Gb/s or 10 Gb/s links to the max.

How did you reach these results?

Andre Oppermann: Careful analysis and lots of code profiling.

What type of improvements would TCP socket buffers auto-sizing offer? In which context would this feature show its best result?

Andre Oppermann: In a world of big files (video clips, entire DVDs, etc.), fast network connections (ADSL, VDSL, Cable, fiber to the home), and global distribution, the traditional TCP default configuration was hitting the socket buffer limit. Because TCP offers reliable data transport it has to keep a buffer of data sent until the remote end has acknowledged reception of it. It takes the ACK a full round trip (RTT, as seen with ping) to make it back. Thus for fast connections over large distances, like from U.S.A. to Europe or Asia, we need large socket buffer to keep all the unacknowledged data around. FreeBSD had a default 32 K send socket buffer. This supports a maximal transfer rate of only slightly more than 2 Mbit/s on a 100 ms RTT trans-continental link. Or at 200 ms just above 1 Mbit/s. With TCP send buffer auto scaling in its default settings it supports 20 Mbit/s at 100 ms and 10 Mbit/s at 200 ms (socket buffer at 256 KB) per TCP connection. That's an improvement of factor 10, or 1000%. If you have very fast Internet connections very far apart you may want to further adjust the defaults upwards. The nice thing about socket buffer auto-tuning is the conservation of kernel memory which is in somewhat limited supply. The tuning happens based on actual measured connection parameters and are adjusted dynamically. For example a SSH session on a 20 Mbit/s 100 ms link will not adjust upwards because the initial default parameters are completely sufficient and do not slow down the session. On the other hand a 1GB file transfer on the same connection will cause the tuning to kick in and to quickly increase the socket buffers to the max. The socket buffer auto-tuning was extensively tested on the European half of ftp.FreeBSD.ORG. From there I was able to download a full ISO image at close to 100 Mbit/s (my local connection speed) with auto-tuning. Before it would only go up to around 30 Mbit/s.

A few more performance relevant things I've changed/added to 7.0:

parallelized TCP syncache

sendfile(2) systemcall heavily optimized and added use of TSO, 5.7 times better performance

next send offset pointer into send socket buffer mbuf chain to prevent long walks on large socket buffers, 2 times better performance

All this stuff accumulates quite a bit. ;-) And FreeBSD wasn't bad at all before. It just became even better than it was.

Other than that, I've done a lot of code overhaul and refactoring primarily in tcp_input.c and tcp_output.c to make it more readable and maintainable again. This work is still ongoing. However it already has shown increased interest from network researchers who have to modify the code for their experimental features. The cleanup makes it much more accessible again.

Direct dispatch of inbound network traffic. What is it?

Robert Watson: Direct dispatch is a performance enhancement for the network stack. In older versions of the BSD network stack, work is split over several threads when a packet is received:

The interrupt thread runs the device driver interrupt code and link layer code to pull a packet out of the hardware. It then determines what protocol will process it and passes it on to...

The netisr thread, a network kernel thread that does IP and TCP processing, either then sending it out another interface for packet forwarding, or delivering it to a socket for local delivery. It then wakes up...

The user thread, which will read the packet data out of the socket.

Direct dispatch allows the ithread to perform full protocol processing through socket delivery. This results in significantly reduced latency by avoiding enqueue/dequeue and a context switch. It can also introduce new opportunities for parallelism: there's one ithread per device, so rather than a single thread doing all IP and TCP processing for input, it now happens in multiple device-specific threads. Finally, it eliminates a possible drop point -- when the "netisr queue" overflowed, we would drop packets -- now the queue doesn't exist, the drop point is pushed back into hardware. This means we don't do link layer processing for a packet unless we will also do IP layer processing, so when the system is under very high load, we don't waste CPU on packets that would otherwise be dropped later because TCP/IP can't keep up.

Like all optimizations, it comes with some trade-offs, so you can disable it and restore netisr processing for the input path using a sysctl. However, for many workloads it can result in a significant performance improvement, especially where latency is an issue.

You added support for TSO (TCP/IP segmentation offload) and LRO (Large Receive Offload) hardware on gigabit and faster cards. Does this mean that when these features are active the hardware partially bypasses FreeBSD's TCP/IP stack? What about bugs in hardware?

With TSO only a small part of the TCP stack is bypassed. It's the part where are large amount of data from a socket write is split up into network MTU sized packets. TSO can handle up to 64KB sized writes. We give this large chunk and tell the network card to chop it up into smaller packets for the wire. All the headers are prepared by our TCP stack. The TSO hardware in the network card then only has to increment the TCP header fields for each packet sent until all are done. This process is rather straight forward. TSO is only used for standard bulk sending of data. All special cases, like retransmits and so on, are handled completely within our stack. Bugs in hardware can and do happen. We've done extensive testing and found a specific network card where we had to disable TSO because it wasn't correctly implemented.

LRO is actually not implemented in the network card hardware but in the device driver. All modern gigabit and higher speed network cards batch up received packets and issue only one interrupt for them. The driver then sends them up our network stack. In many cases, especially in LAN environments, a large number of successive packets belong to the same connection. Instead of handing up each packet individually LRO will perform some checks on the packets (port and sequence numbers among others) and merge successive packets together into one. The TCP stack then sees it as one large packet instead of many small ones. The performance benefit is the reduced overhead for entering the TCP stack.

Could you tell us more about the new rapid spanning tree and link aggregation support?

Andrew Thompson: The FreeBSD bridge now supports the rapid spanning tree protocol which provides much faster spanning tree convergence. The new topology will be active in under a second in most cases compared to 30-60 seconds for legacy STP. This makes it an excellent choice for Ethernet redundancy and is the standard used by modern switches. Progress is being made to implement the VLAN-aware MST protocol extension.

The link aggregation support came from the trunk(4) framework on OpenBSD and was extended to include NetBSD's LACP/802.3ad (IEEE aggregation protocol standard) implementation. This framework allows for different operating modes to be selected including Failover, EtherChannel, and (of course) LACP, for the purpose of providing fault-tolerance and/or high-speed links.

Andrew Thompson: The wireless networking has had a major update for 7.0. The most visible change is in scanning, this has been split out to support background scanning which updates the scan cache during inactivity so the client can roam to the strongest AP. The scanning policies have also been modularized.

The new code has working 802.11n support although no drivers have been released yet. Changes have also been made to allow future vap support which gives multi-bss/multi-sta on supporting hardware, the vap work is ongoing and may be released this year. Benjamin Close added the new wpi(4) driver for the Intel 3945 wireless card and the new usb drivers zyd(4) and rum(4) were ported over by Weongyo Jeong and Kevin Lo respectively.

The majority of the net80211 work was done by Sam Leffler with contributions from Kip Macy, Sepherosa Ziehau, Max Laier, Kevin Lo, myself, and others.

Beyond the big profiling work in the networking subsystem, you also added an implementation of the Stream Control Transmission Protocol (SCTP). Would you like to explain us what it is and who could take advantage of it?

Randall Stewart: SCTP is a general purpose transport protocol developed in the IETF. It is basically a "next-gen" TCP. It can be used basically anywhere you would use TCP, but there are some differences.

There are various tutorials you can find on SCTP. There is also an RFC (3286). These can give the interested person an "overview" of the protocol.

And an introduction that has a real nice comparision of SCTP/TCP and UDP can be found online.

Currently, if you are in Europe and you send an SMS message, you are using SCTP; or if you are in China and you make a phone call, you are using SCTP. SCTP is at the bottom of the "IP over SS7" stack known as sigtran. There are other places it is used as well, I know about some of them (not all) :-). For instance its a required part of the "IP-Fix" protocol, which is the standarized version of "reliable netflow." You may also see it used in some instances for web access. This is still in its early stages but it can provide some enormous benefits. The web server at www.sctp.org for example do both SCTP and TCP.

The University of Delaware's PEL lab (Protocol Engineering Lab) is doing some interesting work in pushing this forward, they have some very interesting videos showing the differences between TCP and SCTP. There is other information around as well on their main web site (patches for instance for both firefox and apache).

Basically you can think of SCTP as a "super TCP" that adds a LOT of features that make it so an applications can do "more" with less work. So why did we put it in FreeBSD? Well, let me turn the question around why would you expect FreeBSD to NOT have the "next generation" version of TCP available in its stack?

I believe we are actually "first" to make it part of the shipping kernel. In Linux you can enable it as a module, but there are extra steps you must take. For FreeBSD its just there, like TCP.

How does the new in-kernel Just-In-Time compiler for Berkeley Packet Filter programs work?

Jung-uk Kim: Berkeley Packet Filter (BPF) is a simple filter machine (src/sys/net/bpf_filter.c), which executes a filter program. In-kernel BPF JIT compiler turns this filter program into a series of native machine instructions when the filter program is loaded. Then, instead of emulating the filter machine, the pre-compiled codes are executed to evaluate each packet. In layman's terms:

JVM : Java JIT compiler ~= BPF : BPF JIT compiler

Please see bpf(4) and bpf(9) for more information.

BPF JIT compiler for i386 was ported from WinPcap 3.1 of NetGroup Politecnico di Torino. Then, amd64 support was addedby me. This feature first showed up in WinPcap 3.0 alpha 2 according to the change log.

Could you tell us more about the migration from KAME IPsec to Fast IPsec?

Bjoern A. Zeeb: In November 2005 KAME announced "mission completed" on their excellent, highly appreciated contributions to the FreeBSD Project, developing and deploying an IPv6/IPsec reference implementation. As a consequence maintainership of their code was handed over to the different BSD projects.

FreeBSD already had a second IPsec implementation done by Sam Leffler which was derived from KAME and OpenBSD work but is fully SMP safe. That means it can exploit multiple cores/CPUs of modern hardware to improve performance. Fast IPsec also uses the crypto(4) framework supporting crypto accelerator cards and supports features like the virtual enc(4) interface permitting filtering of IPsec traffic.

George Neville-Neil, you can find his BSDTalk on IPsec online, implemented and committed IPv6 support for Fast IPsec, which was the last missing key feature.

With FreeBSD 7 and onwards, Fast IPsec, now simply called IPSEC, is the only implementation supported. To use IPsec, people will need to update their kernel configuration files according to the information given in the ipsec(4) manual page.

With the current implementation FreeBSD is well positioned and we are looking forward to integrate more contributed work to implement new standards, to further improve scaling, etc. during the next months.