Last year in an online discussion someone used in-kernel TCP stacks as
a canonical example of code that you can't apply modern testing
practices to. Now, that might be true but if so the operative phrase
there is "in-kernel", not "TCP stack". When the
TCP implementation is just a normal user-space application,
there's no particular reason it can't be written in a way
that's testable and amenable to a test driven development
approach.

The first versions of Teclo's
TCP stack were written as a classic monolithic systems application
with lots of explicit and implicit global state, not as something that
could be treated as a library let alone something you could reasonably
mock out parts of. As such it was totally unsuited for any kind of
automated testing. The best you could do was run the system in various
kinds of simulated network environments and check that it was getting
roughly the same speeds from one release to the next. We've also got
over 50 configuration parameters for tweaking the behavior of the TCP
algorithms, which would make for a hell of a test matrix. Repeated
manual testing of all these parameters would probably require a person
to do nothing but run those tests full time.

This was clearly not tenable in the long term, so getting some kind of
deterministic and automatable tests up was a pretty high priority. So
very soon after we were finished with the rush to ship a first
version, we could refactor things a bit for better testability and
had at least some rudimentary tests up.

How we write tests

What would make a TCP implementation particularly tricky to test?

Our TCP flow record has over 70 state variables and 10 timers (some of
which can interact with each other). And we need two of those records
for a single TCP connection, with the state of one flow potentially
affecting the behavior of the other [1]. With this much interlinked state it is hard
to feel confident about any testing that tries to artificially set up
only the relevant variables. Even if such setup is done correctly right
now, it would be very easy for those assumptions to break as the code
changes, invalidating the tests.

In general the appropriate unit of testing here is then the TCP stack
as a whole rather than e.g. somehow trying to test a feature like zero
window probing in isolation just by calling a method that implements
that feature. The latter would be an absurd idea, since interesting
TCP features end up being a lot more cross cutting than that. Of
course I don't mean testing the application as a whole either. We
chop the application off at the the core event loop, which would
normally handle polling the NICs for packets, update the system's
idea of the current time between packets, run timers when appropriate,
and occasionally receive RPC messages from the management system. All
of this detail is luckily irrelevant for testing the core TCP
algorithms.

Instead for testing we create an instance of the TCP stack that
replaces the normal NIC-based IO backend with a callback based one. The
test driver will inject packets directly to the TCP stack by calling
the appropriate entry point. When the TCP stack wants to emit a
packet, that triggers a callback in the test driver and we can check
whether the contents of the packet are what were expected. Finally,
the other entry point to the TCP stack is implicit, through timer
callbacks. To handle this case, we need to replace the wall clock
based time source with a virtual one, and give the test driver the
responsibility for triggering it.

The second problem is expressing the test cases in a convenient
manner. The first thought here was expressing the test cases as pcap
format trace files [2]. The trace files would theoretically
have exactly right information: the exact packet contents and
microsecond accurate timing information for the test driver to work
by. This approach turns out to not to be good for much. Artificial
test cases are very hard to create and update, debugging test failures
is painful, and testing for things other than packets that are output
impossible (e.g. counters) [3]. No, the test cases really need to be
expressed in code.

For that to work, there needs to be a simple way of describing packets
both for the purposes of generating packets as well as comparing
output packets to expectations. Now, we happened to have some code
around for pretty-printing many kinds of packets as JSON. That sounds
like a perfect tool for the job, we'd just need a bit of code to
do the reverse operation. This is what the JSON looked like:

Right... That just won't work. First, it's way too verbose since even a
simple test will involve tens of packets. You could eliminate some of
the verbosity in the packet generation by using lots of
defaulting. But that doesn't really work for checking outputs against
expects. This kind of 1:1 mapping to raw fields in the packet is also
not what's really needed. Things like advertised windows and sequence
ranges are pretty much the things I care most about when specifying a
test case. But the actual advertised window can't be determined from a
single packet, you need to know the window scaling factor that's in
use, which is only available in the SYN. Likewise the starting
sequence number of a segment is in the TCP header, but the ending
sequence number is implicit.

What we need is something designed for humans to read. The obvious
choice here was to pattern it after the tcpdump output
format since we read packet dumps in that format every day.

That's still a bit chubby, but the way this works in practice is that we'd make a PacketGenerator object that has defaults for the fields that are generally going to be constant over the lifetime of the connection (but just generally, if they need to change, no problem):

What about expects then? In the normal mode of operation we simply
push the string representation of expected packets to a per-interface
queue. When the TCP stack tries to emit a packet, it ends up instead
in a callback in the test driver. The callback pretty-prints the
packet, and compares it to the first string in the queue for the
output interface. If the strings don't match, or if the queue is
empty, the test fails (and thanks to having readable string
representations it's generally completely obvious where in the test
the problem was, and what the difference between expected and actual
results). To eliminate the repetition in the pretty-printed
representation, we also typically have a small macro that fills in
the layer 2 / layer 3 information.

The last bit of basic functionality is manipulating time. This is very
simple, as long as it's easy to substitute some kind of a virtual
clock for a real clock. Just move the clock forward by the requested
amount, run any timers, and check that the expect queues are
empty. The only tricky bit here is advancing the clock in the minimum
timer quant rather than all at once. This matters since a timer
getting run might cause a timer (either the same or different one) to
be (re)scheduled.

There are a few cases where the textual representation is
insufficient. For example maybe some header field that needs to be
changed is too obscure to bother including in the parser and the
prettyprinter. For cases like this the packet structure returned by
generate() can be modified before being injected. Likewise there's
another version of expect that takes a callback function for doing
arbitrary checks on the packet, rather than just a string comparison.

Finally, it turns out that when testing edge cases of TCP behavior
it's often very convenient to run a bunch of alternate scenarios
starting from some particular socket state. A normal solution here
might be to make the state cloneable, but that's something we actively
don't want to do in the normal application, and maintaining the
copying code would be fragile and an unnecessary hassle. Instead for
testing we have little BEGIN_FORK
and END_FORK macros to run a block of code in a forked
process and quitting the parent process if the child processes errors
out, with the alternate scenarios each running in their own forked
process. It's not an ideal setup, since forking makes the experience
of using tools like gdb or valgrind a bit rough.

This also makes for pretty large tests. A typical test (containing
several subtests through the fork hack) is around 100-150 lines
long. Unsurprisingly the tests end up a lot longer than the code being
tested. Code coverage of the relevant files is at about 93% which is
good enough (most of the code that isn't covered is probably never
executed in production; it's old experimental features hidden behind
flags not enabled by default, paranoid error checking code for
situations that would be very hard or impossible to write a test
to trigger, etc).

What we can and can't test

One objection that was implied to testing TCP implementations is that
you only really test completely trivial things, and most of the
trouble comes from the nature of TCP being a system with complex and
distributed state. So what kind of tests can you express using this
setup? Let's use the earlier example of zero windows. Cases you might
want to test for and which are easy enough to do (and some of which we
really want to test multiple times with different configuration parameters):

Receiving a zero window in the SYNACK, with the window getting opened by a separate ACK only once the 3WHS finishes.

Receiving a zero window in the SYNACK, with the window never getting opened.

The window starting at a reasonable value, but shrinking to zero during the connection, then opening up again (both naturally or as a reaction to
zero window probing)

Probes getting sent at the expected timeouts if the zero window condition persists for too long.

Advertising a zero window yourself to one of the endpoints when buffers
are full. Check that anything sent in excess of the advertised window is
properly dropped.

Correctly reacting to zero window probes sent by that endpoint.
(Both the "still zero" and "I have some space now" cases).

Are these kind of tests interesting or useful? I'd like to think
so. At least that list is a mix of things we got wrong at one time or
another, things we've seen others get wrong, and tests done just in
case. Thinking about the test cases also gets you into an adversarial
mode of thought, where it's easier to see the cases that were left
unhandled.

Of course this kind of tests has its limits, and couldn't
possibly detect all failures. As I've written
earlier, TCP
is harder than it looks mainly because of the bizarre
interoperability failures. Unit testing can catch algorithm bugs
during development, but will at best act as a regression test for
problems encountered with endpoints that behave in completely unexpected
ways. Nor does this help at all with testing some other parts that are
on the critical traffic path like our custom device drivers.
But you can't let the perfect be the enemy of the good.

Even when there are corners of the system that you can't test,
I've still found unit testing and a semi-TDD approach [4] to be hugely valuable in this
problem space, and I've found myself leaning on writing the test cases
before code much more heavily than in any other project before. In
fact if we've got a bug report and a theory about what could be going
on, the first step is writing a test case to verify or disprove the
theory. It's just an order of magnitude faster to set up fully
controlled test case with this system than it would be to try to
recreate the hypothetical network conditions required for the bug to
manifest.

There are some nice side benefits too in addition to the typical gains
from testing. One is that we get IPv6 test coverage for essentially
free. We can run the same tests twice, once with the packet generator
making IPv4 packets and then with it generating IPv6 ones. It mostly
just requires a bit of finesse with the packet pretty-printing /
parsing to account for the different IP address
size.

Conclusion

Anyway, I'm really happy with this setup for low level network
programming. If it truly is the case that in-kernel TCP stacks are
untestable, maybe that's just another reason to get networking
out of the OS and into the userspace.

Footnotes

[1] Our
TCP stack is part of a transparent performance
enhancing proxy. It splits every TCP connection in two parts without
terminating the connections. The TCP connection is only taken over by
the proxy after the initial handshake finishes, so both endpoints end
up having a compatible view of the TCP options and sequence numbers
used for the connection. This means that we essentially run a separate
and full TCP stack for both halves of the connection, but e.g. the
amount of data that has been acked on one half affects how much window
space we want to advertise on the other half.

[2] One file per interface for
the inputs, one file per interface for expected outputs, have the test
driver compare actual outputs to expected
ones.

[3]
Not just guessing, I know it's basically useless since I later
implemented this model for creating regression tests for issues we
already had example traces for. Theoretically this allowed creating
new test cases with almost zero effort, but the tests were so annoying
to validate and maintain that we only ever made 4 of them. This is odd because in past lives this general form of testing
has been my tool of choice over lovingly handcrafted artisanal unit
tests.

[4]
Semi-TDD, since the diehards
wouldn't be happy with testing essentially a single static entry point for
klocs and klocs of code.