An informative guide on using AFL and libFuzzer.

The point of security software is to make a system more secure. When developing software, one definitely doesn’t want to introduce new points of failure, or to increase the attack surface of the system the software is running on. So naturally, we take secure coding practices and software quality seriously. One good example of how we strive to improve software quality and security at F-Secure is our Vulnerability Reward Program that’s been running for almost two years now. And it’s still running, so do participate if you have a chance! Earlier this year, we posted an article detailing what we learned during the first year. It goes without saying that we have many processes in-house to catch potential bugs and vulnerabilities in our software. In this article, we’d like to explain one of the many processes we use in-house to find vulnerabilities before they reach our customers, and our dear bug bounty hunters.

One method for bug hunting that has proven to be very effective is a technique called fuzzing, where the target program is injected with unexpected or malformed data, in order to reveal input handling errors leading to, for example, an exploitable memory corruption. To create fuzz test cases, a typical fuzzer will either mutate existing sample inputs, or generate test cases based on a defined grammar or ruleset. An even more effective way of fuzzing is coverage guided fuzzing, where program execution paths are used to guide the generation of more effective input data for test cases. Coverage guided fuzzing tries to maximize the code coverage of a program, such that every code branch present in the program is tested. With the emergence of [Google’s!!] open source coverage guided fuzzing tools such as American Fuzzy Lop (AFL), LLVM libFuzzer, and HonggFuzz, using coverage guided fuzzing has never been easier or more approachable. You no longer need to master arcane arts, spend countless hours writing test case generator rules, or collecting input samples that cover all functionality of the target. In the simplest cases you can just compile your existing tool with a different compiler, or isolate the functionality you want to fuzz, write just a few lines of code, and then compile and run the fuzzer. The fuzzer will execute thousands or even tens of thousands of test cases per second, and collect a set of interesting results from triggered behaviors in the target.

If you’d want to get started with coverage guided fuzzing yourself, here’s a couple of examples showing how you’d fuzz libxml2, a widely used XML parsing and toolkit library, with two fuzzers we prefer in-house: AFL and LLVM libFuzzer.

Fuzzing with AFL

Using AFL for a real world example is straightforward. On Ubuntu 16.04 Linux you can get fuzzing libxml2 via its xmllint utility with AFL with just seven commands.

First we install AFL and get the source code of libxml2-utils.

$ apt-get install -y afl
$ apt-get source libxml2-utils

Next we configure libxml2 build to use AFL compilers and compile the xmllint utility.

$ cd libxml2/
$ ./configure CC=afl-gcc CXX=afl-g++
$ make xmllint

Lastly we create a sample file with content “<a></a>” for AFL to start with and run the afl-fuzz.

AFL will continue fuzzing indefinitely, writing inputs that trigger new code coverage in ./out/queue/, crash triggering inputs in ./out/crashes/ and inputs causing hangs in /out/hangs/. For more information on how to interpret the AFL’s status screen, see: http://lcamtuf.coredump.cx/afl/status_screen.txt

Fuzzing with LLVM libFuzzer

Let’s now fuzz libxml2 with the LLVM libFuzzer. To start fuzzing, you’ll first need to introduce a target function, LLVMFuzzerTestOneInput, that receives the fuzzed input buffer from libFuzzer. The code looks like this.

Before compiling our target function, we need to compile all dependencies with clang and -fsanitize-coverage=trace-pc-guard, to enable SanitizerCoverage coverage tracing. It is a good idea to also use -fsanitize=address,undefined in order to enable both the AddressSanitizer(ASAN) and the UndefinedBehaviorSanitizer(UBSAN) that catch many bugs that otherwise might be hard to find.

We didn’t use any sample inputs, so libFuzzer starts by generating random data in order to find inputs that trigger new code paths in our libxml2 target function. All inputs that trigger new coverage are stored as sample files in ./output. As libFuzzer runs in-process, if a bug is found, it saves the test case and exits. On a high-end laptop, a single instance of libFuzzer reached over 5000 executions per second, slowing down to around 2000 once it started to generate test cases with more coverage. For more information on how to interpret the output see: http://llvm.org/docs/LibFuzzer.html#output

Creating a corpus

If your target is fast, meaning hundreds or even thousands of executions per second, you can try generating a base corpus out of thin air. With coverage guided fuzzing it is possible to do this even with more complex formats like the AFL author Michał Zalewski did with JPEG-files, but to save time, you should get a good representation of typical files for the application that are as small as possible. The smaller the files, the faster they are to fuzz.

AFL does not give any additional flags to tinker with when generating corpus out of thin air. Just give it a small sample input, for example “<a></a>” as an XML sample, and run AFL like you normally would.

With libFuzzer you have more flags to experiment with. For example, for XML you might want to try with ‘-only_ascii=1‘. One good technique for most formats is to execute multiple short runs while incrementing the maximum sample size of our fuzzer on each round and then merge all the results to form the output corpus.

With this approach, we first collect interesting inputs with maximum length of 4 bytes, the second run analyses the 4 byte inputs and uses those as a base for 8 byte inputs and so on. This way we discover “easy” coverage with faster smaller inputs and when we move to larger files we have a better initial set to start with.

To get some numbers for this technique, we did three runs with the example script.

On average, running the corpus generation script took about 18 minutes on our laptop. LibFuzzer was still frequently discovering new coverage at the end of iterations where -max_len was larger than 8 bytes, which suggests that, for those lengths, libFuzzer should be allowed to run longer.

For comparison, we also took the libFuzzer with default settings and ran it for three rounds, which took about 18 minutes.

From these results we see that our runs with the corpus generation script on average executed more test cases, generated a larger set of files, that triggers more coverage and features than the set generated with the default values. This is due to the size of test cases generated by libFuzzer using default settings. Previously libFuzzer used default -max_len of 64 bytes, but at the time of writing libFuzzer was just updated to have a default -max_len of 4096 bytes. In practice sample sets generated by this script have been very working starting points for fuzzing, but no data has been collected how the effects differ in comparison to default setting in long continuous fuzzing.

Corpus generation out of thin air is an impressive feat, but if we compare these results to the coverage from W3C XML test suite we see that it is a good idea to also to include sample files from different sources to your initial corpus, as you’ll get much better coverage before you’ve even fuzzed the target.

Merging our generated corpus into the W3C test suite increased the block coverage to 18727, so not that much, but we still got a total of 83972 features, increasing the total throughput of these test cases. Both improvements are most probably due to small samples triggering error conditions that were not covered by the W3C test suite.

Trimming your corpus

After fuzzing the target for a while, you’ll end up with a huge set of fuzzed files. A lot of these files are unnecessary, and trimming them to a much smaller set will provide you with the same code coverage of the target. To achieve this, both projects provide corpus minimization tools.

AFL gives you the afl-cmin shell script that you can use to minimize your corpus. For the previous example, to minimize the corpus generated in the ./out directory, you can generate a minimized set of files into the ./output_corpus directory.

AFL also offers another tool afl-tmin that can be used to minimize individual files while maintaining the same coverage as observed initially. Be aware that running afl-tmin on a large set of files can take a very long time, so first do couple of iterations with afl-cmin before trying afl-tmin.

LibFuzzer doesn’t have an external trimming tool – it has the corpus minimization feature, called merge, built-in.

LibFuzzer merge is a little easier to use since it looks for files recursively from any number of input directories. Another nice feature in libFuzzer merge is the -max_len flag. Using -max_len=X, libFuzzer will only use the first X bytes from each sample file, so you can collect random samples without caring about their sizes. Without the max_len flag, libFuzzer uses a default maximum length of 1048576 bytes when doing a merge.

With libFuzzer merge, you can use the same technique as you did to generate a corpus out of thin air.

With this trimming strategy libFuzzer will first collect new coverage triggering 2 byte chunks from each input sample, then merge those samples to 4 byte chunks, and so on, until you have the optimized set out of all the different length chunks.

A simple merge won’t always help you with performance issues. Sometimes your fuzzer can stumble upon very slow code paths, causing collected samples to start decaying your fuzzing throughput. If you don’t mind sacrificing a few code blocks for performance, libFuzzer can be easily used to remove too slow samples from your corpus. When libFuzzer is run with a list of files as an argument instead of a folder, it will execute every file individually and print out execution time for each file.

Running both fuzzers in parallel

Now that you have a good base corpus, and you know how to maintain it, you can kick off some continuous fuzzing runs. You could run your favorite fuzzer alone, or run both fuzzers separately, but if you’ve got enough hardware available you can also easily run multiple fuzzers simultaneously on the same corpus. That way you get to combine best of both worlds while the fuzzers can share all the new coverage they find.

It’s easy to implement a simple script that will run both fuzzers simultaneously, while restarting the fuzzers every hour to refresh their sample corpus.

Because the example script only runs one hour per iteration, AFL is used in “quick & dirty mode” to skip all the deterministic steps. Even one large file can cause AFL to spend hours, or even days, on deterministic steps, so it it’s more reliable to run AFL without them when running on time budget. Deterministic steps can be run manually, or automatically on another instance that copies new samples to ‘./libfuzzer_output‘.

Dictionaries

You have your corpus, and you’re happily fuzzing and trimming. Where do you go from here?

Both AFL and libFuzzer support user-provided dictionaries. These dictionaries should contain keywords, or other interesting byte patterns, that would be hard for the fuzzer to determine. For some useful examples, take a look at Google libFuzzer’s XML dictionary and this AFL blog post about dictionaries.

Both AFL and libFuzzer also collect dictionary during execution. AFL collects dictionary when performing deterministic fuzzing steps, while libFuzzer approach is to instrument.

When running libFuzzer with time or test case limit, libFuzzer will output a recommended dictionary upon exit. This feature can be used to collect interesting dictionary entries, but it is recommended to do manual sanity checks over all automatically collected entries. libFuzzer builds those dictionary entries as it discovers new coverage, so those entries often build up towards the final keyword.

We tested dictionaries with three 10 minute runs: without dictionary, with the recommended dictionary from first run and with the Google’s libFuzzer XML dictionary. Results can be seen from the table below.

Surprisingly, there was no significant difference between the results from the run without dictionary and the run with recommended dictionary from the first run, but with a “real” dictionary there is a dramatic change in the amount of coverage discovered during the run.

Dictionaries can really change the effectiveness of fuzzing, at least on short runs, so they are worth the investment. Shortcuts, like the libFuzzer recommended dictionary, can help, but you still need to do the extra manual effort to leverage the potential in dictionaries.

Fuzzing experiment

Our goal was to do a weekend long run on a couple of laptops. We ran two instances of AFL and libFuzzer, fuzzing the above example. The first instance was started without any corpus, and the second one with trimmed corpus from W3C XML Test Suite. The results could then be compared by performing a dry run for minimized corpus from all four sets. Results from these fuzzers are not directly comparable since both fuzzers use different instrumentation to detect executed code paths and features. libFuzzer measures two things for assessing new sample coverage, block coverage, that is isolated blocks of code visited, the and featurecoverage, that is a combination of different code path features like transitions between code blocks and hit counts. AFL doesn’t offer direct count for the observed coverage, but we use overall coverage map density in our comparisons. The map density indicates how many branch tuples we have hit, in proportion to how many tuples the coverage map can hold.

Our first run didn’t go quite as expected. After 2 days and 7 hours we were reminded about the downsides of using deterministic fuzzing on large files. Our afl-cmin minimized corpus contained a couple of over 100kB samples that caused AFL to slow down to crawl after processing only under 38% of the first round. It would have taken days for AFL to get through a single file, and we had four of those in our sample set, so we decided to restart instances, after we removed all over 10kB samples. Sadly, on Sunday night at 11PM, “backup first” wasn’t the first thing in our mind and the AFL plot data was accidentally overwritten, so no cool plots from the first round. We managed to save the AFL UI before aborting.

Full results of our 2 day fuzzing campaign can be found from the image/table below.

We had actually never tried to pit these fuzzers against each other before. Both fuzzers were surprisingly even in our experiment. Starting from the W3C samples, the difference between discovered coverage, as measured by libFuzzer, was only 1.4%. Also both fuzzers found pretty much the same coverage. When we merged all the collected files from the four runs, and the original W3C samples, the combined coverage was only 1.5% higher than the coverage discovered by libFuzzer alone. Another notable thing is that without initial samples, even after 2 days, neither libFuzzer or AFL had discovered more coverage than our previous demonstration in generating a corpus out of thin air did repeatedly in 10 minutes.

We also generated a chart from coverage discovery during libFuzzer fuzzing run with the the W3C samples.

Which one should I use?

As we detailed, AFL is really simple to use, and can be started with virtually no setup. AFL takes care of handling found crashes and stuff like that. However, if you don’t have a ready command line tool like xmllint, and would need to write some code to enable fuzzing, it often makes sense to use libFuzzer for superior performance.

In comparison to AFL, libFuzzer has built-in support for sanitizers, such as AddressSanitizer and UndefinedBehaviorSanitizer, which help in finding subtle bugs during fuzzing. AFL has some support for sanitizers, but depending on your target there might be some serious side effects. AFL documentation suggests on running fuzzing without sanitizers and running the output queue separately with sanitizer build, but there is no actual data available to determine whether that technique can catch the same issues as ASAN enabled fuzzing. For more info about AFL and ASAN you can check docs/notes_for_asan.txt from the AFL sources.

In many cases however it makes sense to run both fuzzers, as their fuzzing, crash detection and coverage strategies are slightly different.

If you end up using libFuzzer, you really should check the Google’s great libFuzzer tutorial.