Code Coverage: Challenges For Open Source Projects

08 August, 2014

In this blog post I'll pose a few challenges to open source projects. These challenges,
among other goals, are designed to push automated test suites to maximize code coverage.
For motivation, I'll present code coverage stats from the OpenSSL and OpenSSH test suites,
and contrast this with what I'm trying to achieve in the fwknop project.

With major bugs like
Heartbleed and
"goto fail",
there is clearly a renewed need for better automated testing. As Mike Bland
demonstrated, the
problem isn't so much about technology since unit tests can be built for both Heartbleed
and "goto fail". Rather, the problem is that the whole software engineering life cycle
needs to embrace better testing.

1) Publish your code coverage stats

Just as open source projects publish source code, test suite code coverage results should
be published too. The mere act of publishing code coverage - just as open source itself -
is a way to engage developers and users alike to improve such results. Clearly displayed
should be all branches, lines, and functions that are exercised by the test suite, and
these results should be published for every software release. Without such stats, how can
users have confidence that the project test suite is comprehensive? Bugs still crop up
even with good code coverage, but many many bugs will be squashed during the
development of a test suite that strives to exercise every line of source code.

For a project written in C and compiled with gcc, the
lcov tool provides a nice
front-end for gcov coverage
data. lcov was used to generate the following
codecoveragereports on OpenSSL-1.0.1h, OpenSSH-6.6p1,
and fwknop-2.6.3 using the respective test suite in each project:

The numbers speak for themselves. To be fair, fwknop has a much smaller code base than OpenSSL
or OpenSSH - less than 1/10th the size of OpenSSL and about 1/5th the size of OpenSSH in terms
of lines reported by gcov. So, presumaby it's a lot easier to reach higher levels of code
coverage in fwknop than the other two projects. Nevertheless, it is shocking that the line
coverage
in the OpenSSL test suite is below 50%, and not much better in OpenSSH. What are the odds that
the other half of the code is bug free? What are the odds that changes in newer versions won't
break assumptions made in the untested code? What are the odds that one of the next security
vulnerabilities announced in either project stems from this code? Of course, test suites are
not a panacea, but there are almost certainly bugs lying in wait within the
untested code. It is easier to have confidence in code that has at least some test
suite coverage than code that has zero coverage - at least as far as the test suite
is concerned (I'm not saying that other tools are not being used).
Both OpenSSL and OpenSSH use tools beyond their respective test suites to try and maintain
code quality - OpenSSL uses Coverity for example -
but these tools are not integrated with test suite results and do not contribute to the code
coverage stats above. What I'm suggesting is that the test suites themselves should get much
closer to 100% coverage, and this may require the integration of infrastructure like a custom
fuzzer or fault injection library (more on this below).

On another note, given that an explicit design goal of the fwknop test suite is to maximize
code coverage, and given the results above, it is clear there is significant work left to do.

2) Make it easy to automatically produce code coverage results

Neither OpenSSL nor OpenSSH make it easy to automatically generate code coverage stats like
those shown above. One can always Google for the appropriate
CFLAGS and LDFLAGS settings, recompile, and run lcov, but you shouldn't have to. This should
be automatic and built into the test suite as an option. If your project is using autoconf,
then there should be a top level --enable-code-coverage switch (or similar) to the
configure script, and the test suite should take the next steps to produce the
code coverage reports. Without this, there is unnecessary complexity and manual work,
and this affects users and developers alike. My guess is this lack of automation is a factor
for why code coverage for OpenSSL and OpenSSH is not better. Of course, it takes a lot of
effort to develop a test suite with comprehensive code coverage support, but automation is
low hanging fruit.

If you want to generate the code coverage reports above, here are two trivial scripts -
one for OpenSSH and
another for OpenSSL.
This one works for OpenSSH:

3) Integrate a fuzzer into your test suite

If your test suite does not achieve 99%/95% function/line coverage, architectural changes
should be made to reach these goals and beyond. This will likely require that test suite
drive a fuzzer against your project and measure how well it exercises the code base.

Looking at code coverage results for older versions of the fwknop project was an eye opener.
Although the test suite had hundreds of tests, there were large sections of code that were
not exercised. It was for this reason the 2.6.3 release concentrated on more comprehensive
automated test coverage. However, achieving better coverage was not a simple matter of
executing fwknop components with different configuration files or command line arguments -
it required the development of a dedicated SPA packet
fuzzer along
with a special macro
-DENABLE_FUZZING
built into the source code to allow the fuzzer to reach portions
of code that would have otherwise been more difficult to trigger due to encryption and
authentication requirements. This is a similar to the strategy proposed in Michal
Zalewski's fuzzer American Fuzzy Lophere (see the "Known
Limitations" section).

The main point is that fwknop was changed to support fuzzing driven by the test suite as a
way to extend code coverage. It is the strong integration of fuzzing into the test
suite that provides a powerful testing technique, and looking at code coverage results allows
you to measure it.

Incidentally, an example double free() bug that the fwknop packet fuzzer triggered
in conjunction with the test suite can be found
here (fixed in 2.6.3).

4) Further extend your code coverage with fault injection

Any C project leveraging libc functions should implement error checking against function
return values. The canonical example is checking to see whether malloc() returned
NULL, and if so this is usually treated as an unrecoverable error like so:

Some projects elect to
write a "safe_malloc()" wrapper for malloc() or other libc functions so that error handling
can be done in one place, but it is not feasible to do this for every libc function.
So, how to verify whether error conditions are properly handled at run
time? For malloc(), NULL is typically returned under extremely high memory
pressure, so it is hard to trigger this condition and still have a functioning system let
alone a functioning test suite. In other words, in the example above, how can the test suite
achieve code coverage for the clean_up() function? Other examples include filesystem
or network function errors that are returned when disks fill up, or a network communication
is blocked, etc.

What's needed is a mechanism for triggering libc faults artificially, without requiring the
underlying conditions to actually exist that would normally cause such faults. This is where
a fault injection library like libfiu comes in.
Not only does it support fault injection at run time against libc functions without the need to
link against libfiu (a dedicated binary "fiu-run" takes care of this), but it can
also be used to trigger faults in arbitrary non-libc functions within a project to see how
function callers handle errors. In fwknop, both strategies are used by the test suite, and
this turned up a number of bugs
like this one.

Full disclosure: libfiu does not yet support code coverage when executing a
binary under fiu-run because there are problems interacting with libc functions necessary
to write out the various source_file.c.gcno and source_file.c.gcda coverage files. This
issue is being worked on for an upcoming release of libfiu. So, in the context of fwknop,
libfiu is used to trigger faults directly in fwknop functions to see how calling functions
handle errors, and this strategy is compatible with gcov coverage results. The fiu-run tool
is also used, but more from the perspective of trying to crash one of the fwknop binaries
since we can't (yet) see code coverage results under fiu-run. Here is an example fault
introduced into the fko_get_username() function:

With the fault set (there is a special command line argument --fault-injection-tag on
the fwknopd server command line to enable the fault), the error handling code seen at the
end of the example below is executed via the test suite. For proof of error handling execution,
see the
full coverage report
(look at line 240).

Once again, it is the integration of fault injection with the test suite and
corresponding code coverage reports that extends testing efforts in a powerful way.
libfiu offers many nice features, including thread safey, the ability to enable a fault
injection tag relative to other functions in the stack, and
more.

5) Negative valgrind findings should force tests to fail

So far, a theme in this blog post has been better code coverage through integration.
I've attempted to make the case for the integration of fuzzing and fault injection with project
test suites, and code coverage stats should be produced for both styles of testing.

A third integration effort is to leverage
valgrind. That is, the test suite should run tests
underneath valgrind when possible (speed and memory usage may be constraints here depending
on the project). If valgrind discovers a memory leak, double free(), or other problem, this
finding should automatically cause corresponding tests to fail. In some
cases
valgrind suppressions will need to be created if a project depends on libraries or other
code that is known to have issues under valgrind, but findings within project sources should
cause tests to fail. For projects heavy on the crypto side, there are some instances where
code is very deliberately built in a manner that triggers a valgrind error (see Tonnerre
Lombard's write up on the old
Debian OpenSSL vulnerability), but these are not common occurrences and suppressions can
always be applied. The average valgrind finding in a large code base should cause test
failure.

Although running tests under valgrind will not expand code coverage, valgrind is a powerful
tool that should be tightly integrated with your test suite.