Push the Red Button

Wednesday, October 17, 2018

I suspect a lot of people in academia end up having a lot of ideas and projects that went nowhere for any number of reasons – maybe there were insurmountable technical challenges, maybe the right person to work on it never materialized, or maybe it just got crowded out by other projects and never picked back up. Here are a couple of mine. For each I'll try to indicate why it fell by the wayside, and whether I think it could be resurrected (if you're interested in doing some idea necromancy, let me know! :)).

Detecting Flush+Flush

Among the flurry of microarchitectural side channel attacks that eventually culminated in the devastating Spectre and Meltdown attacks was one that has received relatively little attention: Flush+Flush. The base of the attack is the observation that clflush takes a different amount of time depending on whether the address to be flushed was already in the cache or not. Gruss et al. had a nice paper on this variant of the attack at DIMVA 2016.

The interesting thing about Flush+Flush to me is not its speed (which I believe has since been surpassed) but the fact that it is stealthy: unlike Flush+Reload, which causes an unusually large number of cache misses in the attacking process, Flush+Flush only causes cache misses in the victim process. So you can tell that an attack is happening, but you can't tell which process is actually carrying it out --- which might be useful if you want to stop the attack without taking down the whole machine.

The key idea of the project was that even if you have only a global detector of whether an attack is going on system-wide, you can convert this into something that detects which process is attacking with a simple trick. Assuming you have control of the OS scheduler, just pause half the processes on the system and see if the attack stops. If it does, then the attacker must have been in the half you paused; you can now repeat this procedure on those processes and find the attacker via binary search. There are some wrinkles here: what if the attacker is carrying out the attack from multiple cooperating processes? What if the attacker probabilistically chooses whether or not to continue the attack at each time step?

It's a pretty simple idea but I think it may have applications beyond detecting flush-flush. For example, I recently complained on twitter that it's currently hard to tell what's causing my Mac's fan to spin. I have a detector (fan is on); if I can systematically freeze and unfreeze processes on the system, I can convert that to something that tells me . It's also very similar to the well known ideas of code bisection and (my favorite name) Wolf Fence Debugging. So if this defense is ever implemented I hope it becomes known as the wolf fence defense :)

This project failed primarily due to personnel issues: a couple different students took it on as a semester or internship project, but didn't manage to finish it up before they graduated or moved on to other things. It also turned out to be more difficult than expected to reliably carry out Flush+Flush attacks; at the time we started the project, a lot less was known about the Intel microarchitecture, and we were tripped up by features like hardware prefetching. We were also hampered by the lack of widely available implementations of hardware performance counter-based detections, and our own implementation was prone to false positives (at least in this we are not alone: a recent SoK paper has pointed out that many of these HPCs are not very reliable for security purposes).

I think this could still make a nice little paper, particularly if some of the more complex scenarios and advanced attackers were taken into account. When I discussed this with a more theoretically minded friend, he thought there might be connections to compressed sensing; meanwhile, a game theoretically oriented colleague I spoke to thought it sounded like an instance of a Multi-Armed Bandit problem. So there's still plenty of meat to this problem for an interested researcher.

Reverse engineering the precise detection technique could have a lot of benefits. First, as a matter of principle, I think defensive techniques must be attacked if we are to rely on them; the fact that this has been used in the wild for more than 15 years across a wide range of devices without any rigorous public analysis is really surprising to me. Second, there is a fun application if we can pin down precisely what makes the currency detector fire: we could create something that placed a currency watermark on arbitrary documents, making them impossible to scan or edit on devices that implement the currency detector! We could even, perhaps, imagine making T-shirts that trigger the detector when photographed :) I believe this idea has been floated before with the EURion constellation, but based on Murdoch's research we know that the EURion is not the only feature used (and it may not be used at all by modern detectors).

Our technical approach to this problem was to use dynamic taint analysis and measure the amount of computation performed by Photoshop in each pixel of the input image. I now think this was a mistake. First, many image analyses compute global features (such as color histograms) over the whole image; taint analysis on these will only tell you that every pixel contributes to the decision [1]. A more profitable approach might be something like differential slicing, which pins down the causal differences between two closely related execution traces. This would hopefully help isolate the code responsible for the detection for more manual analysis.

As with the Flush+Flush detection above, this project could still be viable if approached by someone with strength in binary analysis and some knowledge of image processing and watermarking algorithms.

1. This also made me start thinking about variants of taint analysis that would be better suited to image analyses. One possibility is something based on quantitative information flow; Steven McCamant et al. had some success at using this to analyze the security of image redactions. Even with traditional dynamic taint analysis, though, I think it might be possible to try tainting the results of intermediate stages of the computation (e.g., in the histogram example, one could try tainting each bucket and seeing how it contributed to the eventual decision, or for an FFT one could apply the taint analysis after the transformation into the frequency domain).

Monday, March 19, 2018

Summary: recently published results on the LAVA-M synthetic bug dataset are exciting. However, I show that much simpler techniques can also do startlingly well on this dataset; we need to be cautious in our evaluations and not rely too much on getting a high score on a single benchmark.

A New Record

The LAVA synthetic bug corpora have been available now for about a year and a half. I've been really excited to see new bug-finding approaches (particularly fuzzers) use the LAVA-M dataset as a benchmark, and to watch as performance on that dataset steadily improved. Here's how things have progressed over time.

Performance on the LAVA-M dataset over time. Note that because the different utilities have differing numbers of bugs, this picture presents a slightly skewed view of how successful each approach was by normalizing the performance on each utility. Also, SBF was only evaluated on base64 (where it did very well), and Vuzzer's performance on md5sum is due to largely to a problem in the LAVA-M dataset.

You may notice that Angora (Chen and Chen, to appear at Oakland ’18), on the far right, has come within striking distance of perfect performance on the dataset! This is a great result, and the authors pulled it by combining several really neat techniques (I mention a few of them in this Twitter thread). They also managed to do it in just 10 minutes for base64, md5sum, and uniq, and 45 minutes for who. Kudos to the authors!

My feeling is that the effective “shelf life” of this dataset is pretty much up – current techniques are very close to being able to cover everything in this dataset. This is perhaps not too surprising, since the coreutils are quite small, and the trigger mechanism used by the original LAVA system (a comparison against a 4 byte magic number) is a bit too simple since you can often solve it by extracting constants from the binary. Luckily, we have been working on new bug injection techniques and new corpora, and we will hopefully have some results to announce soon – watch this space :)

Covering your Baseline

When talking about progress in bug-finding, it is important to figure out a good baseline. One such baseline, of course, is the results we provided in the original paper (the first two bars on the graph). However, during the summer following the paper's publication, I pointed out two fairly simple ways to strengthen the baseline for fuzzing: using a constant dictionary and using a compiler pass to split up integer comparisons. The former is built in to AFL, and the latter was implemented in an AFL fork by laf-intel.

What does our baseline look like when we use these techniques? To test this, I ran fuzzers on the LAVA-M dataset for 1 hour per program. This is a very short amount of time, but the intent here is to see what's possible with minimal effort and simple techniques. For the dictionary, I used a simple script that extracts strings and also runs objdump to extract integer constants. I tested:

AFL with a dictionary

laf-intel with AFL's "fidgety" mode (the -d option) and LAF_SPLIT_COMPARES=1

The same laf-intel configuration, but with a 256KiB map size rather than the default 64KiB.

The laf-intel configurations were suggested by Caroline Lemieux and relayed to me by Kevin Laeufer, who also pointed out that one really ought to be doing multiple runs of each fuzzer since the instrumentation as well as some fuzzing stages are nondeterministic – advice I have, alas, ignored for now. The larger map size helps reduce hash collisions with laf-intel – splitting up comparisons adds a large number of new branches to the program, so fuzzing performance may suffer with too small a map.

Here's what our new baseline results look like when put next to published work.

In this light, published tools are still much stronger than our original weak baseline, but (aside from Angora) are not better than AFL with a dictionary. I suspect that this says more about deficiencies in our dataset than about the tools themselves (more on this later).

Who's Who

Because the who program has so many bugs, and published tools seem to have had some trouble with it, I also wanted to see how different techniques would fare when given 24 hours to chew on it. I should note here, for clarity, that I do not think 2000+ bugs in who is a reasonable benchmark by which to judge bug-finding tools. You should not use the results of this test to decide what you will use to fuzz your software. But it does help us characterize what works well on LAVA-style bugs in a very simple, small program that takes structured binary input.

AFL, fidgety mode, with the laf-intel pass, with all four combinations of {dict,nodict} x {64KiB,256KiB} map.

You can see the results below; I've included Angora in the chart since it's the best-performing of the non-baseline approaches. Note, though, that Angora only ran for 45 minutes – it's possible it would have done better if allowed to run for 24 hours.

The top of the chart, at 2136, represents the total number of (validated) bugs in who. We can see that the combination of our techniques comes close to finding all of the bugs (93.7%), and any variant of AFL that was able to use a dictionary does very well. The only exception to this trend is AFL with default settings – this is because by default, AFL starts with deterministic fuzzing. With a large dictionary, this can take an inordinately long time: we noticed that after 24 hours, default AFL still hadn't finished a full cycle, and was only 12% of the way through the stage. Fidgety mode (the -d option) bypasses deterministic fuzzing and does much better here.

Notes on Seed Selection

When we put together the LAVA-M benchmark, one thing we thought would be important is to include the seed file we used for our experiments. The performance of mutation-based fuzzers like AFL can vary significantly depending on the quality of their input corpus, so to reproduce our results you really need to know what seed we used. Sadly, I almost never see papers on fuzzing actually talk about their seeds. (Perhaps they usually start off with no seeds? But this is an unrealistic worst-case for AFL.)

In the experiments above, I used the corpus-provided seeds for all programs except md5sum. In the case of md5sum, the program is run with "md5sum -c", which causes it to go compute and check the MD5 of each file listed in the input file. The seed we used in the LAVA-M corpus lists 20 files, which slows down md5sum significantly as it has to compute 20 MD5 sums. Replacing the seed with one that only lists a single, small file allows AFL to achieve more than 10 times as many executions per second. In retrospect, our seed choice when running the experiments for the paper were definitely sub-optimal.

Conclusion

The upshot? We can almost completely solve the LAVA-M dataset using stock AFL with a dictionary. The only utility that this doesn't work for is "who"; however, a few simple tricks (a dictionary, constant splitting, and skipping the deterministic phase of AFL with "-d") and 24 hours suffice to get us to very high coverage (1731/2136 bugs, or 81%). So although the results of fuzzers like Angora are exciting, we should be cautious not to read too much into their performance on LAVA-M and instead also ask how they perform at finding important bugs in programs we care about. (John Regehr points out that the most important question to ask about a bug-finding paper is “Did you report the bugs, and what did they say?”)

To be honest, I think this is not a great result for the LAVA-M benchmark. If a fairly simple baseline can find most of its bugs, then we have miscalibrated its difficulty: ideally we want our bug datasets to be a bit beyond what the best techniques can do today. This also confirms our intuition that it's time to create a new, harder dataset of bugs!

Acknowledgements

Thanks to Caroline Lemieux and Kevin Laeufer for feedback on a draft of this post, which fixed several mistakes and provided several interesting insights (naturally, if you see any more errors – that's my fault). Additional thanks are due to Kevin Laeufer (again), Alex Gantman, John Regehr, Kristopher Micinski, Sam Tobin-Hochstadt, Michal Zalewski (aka lcamtuf), and Valentin Manès (aka Jiliac) for a lively Twitter argument that turned into some actual useful science.

Thursday, October 20, 2016

Every year the NYU School of Engineering hosts Cyber Security Awareness Week (CSAW) – the largest student-run security event in the country. This year, we're trying something new that combines two of my favorite things: security and open source.

The inaugural Security: Open Source (SOS) workshop, held this November 10 at NYU Tandon will feature the creators of some really cool new security tools talking about their projects. It's happening the day before one of the best CTF challenges out there, so we're expecting an audience that's not afraid of technical detail :)

What will you hear about at SOS? Here some of the cool speakers and topics:

Félix Cloutier will tell us about his open-source decompiler, fcd. This is a great example of incorporating cutting edge academic research into an open-source tool that anyone can use. Félix is also a former CSAW CTF competititor.

Mike Arpaia, co-founder of Kolide, will talk about osquery, a new open-source operating system instrumentation framework and toolset he created while at Facebook. Mike will talk about his experience managing an open-source security project and how to make it successful.

Patrick Hulin from MIT Lincoln Laboratory will talk about a new differential debugging technique he's devised. Patrick is one of the lead developers on PANDA, and he'll talk about how he used another great open-source tool, Mozilla rr, to automatically do root-cause debugging on devilishly tricky record/replay bugs.

Jamie Levy, one of the core developers on the Volatility memory forensics framework, will talk about taking memory forensics to the next level. Jamie is one of the most talented forensic investigators and developers I know and this should be a great talk!

Jonathan Salwan and Romain Thomas from Quarkslab will present a deep dive on Triton, their exciting binary analysis platform that combines symbolic execution and dynamic taint analysis, and demonstrate how it can be used to defeat virtualization-based obfuscation techniques.

Andrew Dutcher of UCSB will talk about angr, their Python-based binary analysis platform that aims to bring together tons of state-of-the-art analyses under one roof. They've recently used it to get third place in the DARPA Cyber Grand Challenge, and it's become a popular tool for CTF players around the world.

SOS will take place in the Pfizer Auditorium at the NYU Tandon School of Engineering in Brooklyn from 10:30am-5:30pm on November 10, the day before the CSAW CTF.

LAVA-1 is a corpus consisting of 69 versions of the "file" utility, each of which has had a single bug injected into it. Each bug is a named branch in a git repository. The triggering input can be found in the file named CRASH_INPUT. To run the validation, you can use validate.sh, which builds each buggy version of file and evaluates it on the corresponding triggering input.

LAVA-M is a corpus consisting of four GNU coreutils programs (base64, md5sum, uniq, and who), each of which has had a large number of bugs added. Each injected, validated bug is listed in the validated_bugs file, and the corresponding triggering inputs can be found in the inputs subdirectory. To run the validation, you can use the validate.sh script, which builds the buggy utility and evaluates it on triggering and non-triggering inputs.

For both corpora, the "backtraces" subdirectory contains the output of gdb's backtrace command for each bug.

Thursday, July 21, 2016

Using one of the test cases from the previous post, I examine what affects AFL's ability to find a bug placed by LAVA in a program. Along the way, I found what's probably a harmless bug in AFL, and some interesting factors that affect its performance. Although its interface is admirably simple, AFL can still require some tuning, and unexpected things can determine its success or failure on a bug.

I started off with the toy program we looked at in the previous post, with a single bug added. The bug added by LAVA will trigger whenever the first four bytes of a float-type file_entry are set to 0x6c6175de or 0xde75616c, and will cause printf to be called with an invalid format string, crashing the program.

After verifying that the bug could be triggered reliably, I compiled it with afl-gcc and started a fuzzing run. To get things started, I used a well-formed input file for the program that contained both int and float file_entry types:

Because I'm lucky enough to have a 24 core server sitting around, I gave it 24 cores (one using -M and the rest using -S) and let it run for about 4 and a half days, fully expecting that it would find the input in that time.

This did not turn out so well.

Around 20 billion executions later, AFL had found zilch.

At this point, I turned to Twitter, where John Regehr suggested that I look into what coverage AFL was achieving. I realized that I actually had no idea how AFL's instrumentation worked, and that this would be a great opportunity to find out.

@moyix you should look at the actual coverage achieved by afl's test cases and see what is going on

Diving Into AFL's Instrumentation

The basic afl-gcc and afl-clang tools are actually very simple. They wrap gcc and clang, respectively, and modify the compile process to emit an intermediate assembly code file (using the -S option). Finally they do some simple string matching (in C, ew) to find out where to add in calls to AFL's coverage logging functions. You can get AFL to save the assembly code it generates using the AFL_KEEP_ASSEMBLY environment variable, and see exactly what it's doing. (There's actually also a newer way of getting instrumentation that was added recently using an LLVM pass; more on this later.)

Left, the original assembly code. Right, the same code after AFL's instrumentation has been added.

After looking at the generated assembly, I noticed that the code corresponding to the buggy branch of the if statement wasn't getting instrumented. This seemed like it could be a problem, since AFL can't try to use coverage to reach a part of the program if there's no logging to tell it that an input has caused it to reach that point.

Looking into the source code of afl-as, the program that instruments the assembly code, I noticed a curious bit of code:

AFL skips labels following p2align directives in the assembly code.

According to the comment, this should only affect programs compiled under OpenBSD. However, the branch I wanted instrumented was being affected by this even though I was running under Linux, not OpenBSD, and there were no jump tables present in the program.

The .L18 block should be instrumented by AFL, but won't be because it's right after an alignment statement.

Since I'm not on OpenBSD, I just commented out this if statement. As an alternate workaround, you can also add "-fno-align-labels -fno-align-loops -fno-align-jumps" to the compile command (at the cost of potentially slower binaries). After making the change I restarted, once again confident AFL would soon find my bug.

Alas, it was not to be. Another 17 hours of fuzzing on 24 cores yielded nothing, and so I went back to the drawing board. I am still fairly sure I found a real bug in AFL, but fixing it didn't help find the bug I was interested in. (Note: it's possible that if I had waited four days again it would have found my bug. On the other hand, AFL's cycle counter had turned green, indicating that it thought there was little benefit in continuing to fuzz.)

5.2 billion executions, no crashes :(

“Unrolling” Constants

Thinking about what would be required to find the bug by AFL, I realized that its chances of hitting our failing test case were pretty low. AFL will only prioritize a test case if it has seen that it leads to new coverage. In the case of our toy program, it would have to guess one of the two exact 32-bit trigger values at exactly the right place in the file, and the odds of this happening are pretty slim.

So how was AFL able to generate a CDATA tag out of thin air? It turns out that libxml2 has a set of macros that expand out some string comparisons into character-by-character comparisons that use simple if statements. This allows AFL to discover valid strings character by character, since each correct character will add new coverage, and cause further fuzzing to be done with that input.

We can also apply this to our test program. Rather than checking for the fixed constant 0x6c6175de, we can compare each byte individually. This should allow AFL to identify the trigger value one byte at a time. The new code looks like this:

The monolithic if statement has been replaced by 4 individual branches.

Once we make this change and compile with afl-gcc, AFL finds a crash in just 3 minutes on a single CPU!

AFL has found the bug!

This also makes me wonder if it might be worthwhile to implement a compiler pass that breaks down large integer comparisons into byte-sized chunks that AFL can deal with more easily. For string comparisons, one can already substitute in an inline implementation of strcmp/memcmp; an example is available in the AFL source.

A Hidden Coverage Pitfall

While investigating the coverage issues, I noticed that AFL has a new compiler: afl-clang-fast. This module, contributed by László Szekeres, performs instrumentation as an LLVM pass rather than by modifying the generated assembly code. As a result, it should be less brittle and allow for more instrumentation options; from what I can tell it's slated to become the default compiler for AFL at some point.

However, I discovered that its instrumentation is not identical to the instrumentation done by afl-as. Whereas afl-as instruments each x86 assembly conditional branch (that is, any of the instructions starting with "j" aside from "jmp"), afl-clang-fast works at the level of LLVM basic blocks, which are closer to the blocks of code found in the original source. And since by default AFL adds -O3 to the compile command, multiple conditional checks may end up getting merged into a single basic block.

As a result, even though we have added multiple if statements to our source, the generated LLVM looks more like our original statement – the AFL instrumentation is only placed in the innermost if body, and so AFL is forced to try and guess the entire 32-bit trigger at once again.

Using the LLVM instrumentation mode, AFL is no longer able to find our bug.

We can tell AFL not to enable the compiler optimizations, however, by setting the AFL_DONT_OPTIMIZE environment variable. If we do that and recompile with afl-clang-fast, the if statements do not get merged, and AFL is able to find the trigger for the bug in about 7 minutes.

So this is something to keep in mind when using afl-clang-fast: the instrumentation does not work in quite the same way as the traditional afl-gcc mode, and in some special cases you may need to use AFL_DONT_OPTIMIZE in order to get the coverage instrumentation that you want.

Making AFL Smarter with a Dictionary

Although it's great that we were able to get AFL to generate the triggering input that reveals the bug by tweaking the program, it would be nice if we could somehow get it to find the bugs in our original programs.

AFL is having trouble with our bugs because they require it to guess a 32-bit input all at once. The search space for this is pretty large: even supposing that it starts systematically flipping bits in the right part of the file, it's going to take an average of 2 billion executions to find the right value. And of course, unless it has some reason to believe that working on that part of the file will get improved coverage, it won't be focusing on the right file position, making it even less likely it will find the right input.

However, we can give AFL a leg up by allowing it to pick inputs that aren't completely random. One of AFL's features is that it supports using a dictionary of values when fuzzing. This is basically just a set of tokens that it can use when mutating a file instead of picking values at random. So one classic trick is to take all of the constants and strings found in the program binary and add them to the dictionary. Here's a quick and dirty script that extracts the constants and strings from a binary for use with AFL:

Once we give AFL a dictionary, it finds 94% of our bugs (149/159) within 15 minutes!

Now, does this mean that LAVA's bugs are too easy to find? At the moment, probably yes. In the real world, the triggering conditions will not always be something you can just extract with objdump and strings. The key improvement needed in LAVA is a wider variety of triggering mechanisms, which is something we're working on.

Conclusion

By looking in detail at a bug we already knew was there, we found out some very interesting facts about AFL:

Its ability to find bugs is strongly related to the quality of its coverage instrumentation, and that instrumentation can vary due both to bugs in AFL and inherent differences in the various compile-time passes AFL supports.

The structure of the code also heavily influences AFL's behavior: seemingly small differences (making 4 one-byte comparisons vs one 4-byte comparison) can have a huge effect.

Seeding AFL with even a naïve dictionary can be devastatingly effective.

In the end, this is precisely what we hoped to accomplish with LAVA. By carefully examining cases where current bug-finding tools have trouble on our synthetic bugs, we can better understand how they work and figure out how to make them better at finding real bugs as well.

Thanks

Thanks to Josh Hofing, Kevin Chung, and Ryan Stortz for helpful feedback and comments on this post, and of course Michal Zalewski for making AFL.

Monday, July 11, 2016

This is the second in a series of posts about evaluating and improving bug detection software by automatically injecting bugs into programs. Part one, which discussed the setting and motivation, is available here.
Now that we understand why we might want to automatically add bugs to programs, let's look at how we can actually do it. We'll first investigate an existing approach (mutation testing), show why it doesn't work very well in our scenario, and then develop a more sophisticated injection technique that tells us exactly how to modify the program to insert bugs that meet the goals we laid out in the introductory post.

A Mutant Strawman that Doesn't Work

One way of approaching the problem of bug injection is to just pick parts of the program that we think are currently correct and then mutate them somehow. This, essentially, is the idea behind mutation testing: you use some predefined mutation operators that mangle the program somehow and then declare that it is now buggy.

For example, we could take every instance of strncpy and change it to strcpy. Presumably, this would add lots of potential buffer overflows to a program that previously had none.

Unfortunately, this method has a couple problems. First, it is likely that many such changes will break the program on every input, which would make the bug trivial to find. The following program will always fail if strncpy is changed to strcpy:

We also face the opposite problem: if the bug doesn't trigger every time, we won't necessarily know how to trigger it when we want to. This will make it hard to prove that there really is a bug, and violates one of the requirements we described last time: each bug must come with a triggering input that proves the bug exists. If we wanted to find the triggering input for a given mutation, we'd have to find an input that reaches our mutant, which is actually a large part of what makes finding bugs hard!

Dead, Uncomplicated and Available Data

Instead of doing random, local mutations, LAVA first tries to characterize the program's behavior on some concrete input. We'll run the program on an input file, and then try to see where that input data reaches in the program. This solves the triggering program because we will know a concrete path through the program, and the input needed to traverse that path. Now, if we can place bugs in code along that path, we will be able to reach them using the concrete input we know about.

We need a couple other properties. Because we want to create bugs that are triggered only for certain values, we will want the ability to manipulate the input of the program. However, doing so might cause the program to take a different path, and the input data may get transformed along the way, making it difficult to predict what value it will have when we actually want to use it to trigger our bug.

To resolve this, we will try to find parts of the program's input data that are:

Dead: not currently used much in the program (i.e., we can set to arbitrary values)

Uncomplicated: not altered very much (i.e., we can predict their value throughout the program's lifetime)

Available in some program variables

We'll call data that satisfies these three properties a DUA. DUAs try to capture the notion of attacker-controlled data: a DUA is something that can be set to an arbitrary value without changing the program's control flow, is available somewhere along the program path we're interested in, and whose value is predictable.

Measuring Liveness and Complication with Dynamic Taint Analysis

Having defined these properties, we need some way to measure them. We'll do that using a technique called dynamic taint analysis1. You can think of dynamic taint analysis like a PET scan or a barium swallow, where a radionuclide is introduced into a patient, allowed to propagate throughout the body, and then a scan checks to see where it ends up. Similarly, with taint analysis, we can mark some data, allow it to propagate through the program, and later query to see where it ended up. This is an extremely useful feature in all sorts of reverse engineering and security tasks.

Like a PET scan, dynamic taint analysis works by seeing where marked input ends up in your program.

To find out where input data is available, we can taint the input data to the program – essentially assigning a unique label to each byte of the program's input. Then, as the program runs, we'll propagate those labels as data is copied around the program, and query any variables in scope as the program runs to see if they are derived from some portion of the input data, and if so, from precisely which bytes.

Next, we want to figure out what data is currently unused. To do so, we'll extend simple dynamic taint analysis by checking, every time there's a branch in the program, whether the data used to decide it was tainted, and if so, which input bytes were used make the decision. At the end, we'll know exactly how many branches in the program each byte of the input was used to decide. This measure is known as liveness.

Liveness measures how many branches use each input byte.

Finally, we want some measure of how complicated the data in each tainted program variable is. We can do this with another addition to the taint analysis. In standard taint analysis, whenever data is copied or computed in the program, the taint system checks if the source operands are tainted and if so propagates the taint labels to the destination. If we want to measure how complicated a piece of data is – that is, how much it has been changed since it was first introduced to the program – we can simply add a new rule that increments a counter whenever an arithmetic operation on tainted data occurs. That is, if you have something like c = a + b; then the taint compute number (TCN)of c is tcn(c) = max(tcn(a),tcn(b)) + 1.

TCN measures how much computation has been done on a variable at a given point in the program.

On the implementation side, all this is done using PANDA, our platform for dynamic analysis. PANDA's taint system allows us to taint an input file with unique byte labels. To query the state of program variables, we use a clang tool that modifies the original program source code2 to add code that asks PANDA to query and log the taint information about a particular program variable. When we run the program under PANDA, we'll get a log telling us exactly which program variables were tainted, how complicated the data was, and how live each byte of input is.

PANDA's taint system allows us to find DUAs in the program.

After running PANDA, we can pick out the variables that are uncomplicated and derived from input bytes with low liveness. These are our DUAs, approximations of attacker controlled data that can be used to create bugs.

Finding Attack Points

With some DUAs in hand, we now have the raw material we need to create our bugs. The last missing piece is finding some code we want to have an effect on. These are places where we can use the data from a DUA to trigger some buggy effect on the program, which we call attack points (ATP). In our current implementation, we look for places in the program where pointers are passed into functions. We can then use the DUA to modify the pointer, which will hopefully cause the program to perform an out of bounds read or write – a classic memory safety violation.

Because we want the bug to trigger only under certain conditions, we will also add code at the attack point that checks if the data from the DUA has a specific value or is in a specific range of values. This gives us some control over how much of the input space triggers the bug. The current implementation can produce both specific-value triggers (DUA == magic_value) and range-based triggers of varying sizes (x < DUA < y).

Each LAVA bug, then is just a pair (DUA, ATP) where the attack point occurs in the program trace after the DUA. If there are many DUAs and many attack points, then we will be able to inject a number of bugs roughly proportional to the product of the two. In large programs like Wireshark, this adds up to hundreds of thousands of potential bugs for a single input file! In our tests, multiple files increased the number of bugs roughly linearly, in proportion to the amount of coverage achieved by the extra input. Thus, with just a handful of input files on a complex program you can easily reach millions of bugs.

Our "formula" for injecting a bug. Any (DUA, ATP) pair where the DUA occurs before the attack point is a potential bug we can inject.

Modifying the Source Code

The last step is to modify the source code to add our bug. We will insert code in two places:

At the DUA site, to save a copy of the input data to a global variable.

At the attack point, to retrieve the DUA's data, check if it satisfies the trigger condition, and use it to corrupt the pointer.

By doing so, we create a new data flow between the place where our attacker-controlled data is available and the place where we want to manifest the bug.

A Toy Example

To see LAVA in action, let's step through a full example. Have a look at this small program, which parses and prints information about a very simple binary file format:

We start by instrumenting the source code to add taint queries. The queries will be inserted to check taint on program variables, and, for aggregate data structures, the members inside each structure. The result is a bit too long to include inline, since it quadruples the size of the original program, but you can see it in this gist.

When we compile and run that program on some input inside of PANDA with taint tracking enabled, we get information about taint compute numbers and the liveness of each byte of the input. For example, here's the liveness map for a small (88 byte) input:

Liveness map for the input to our toy program. The bytes with a white background are completely dead – they can be set to arbitrary values without affecting the behavior of the program.

LAVA's analysis finds 82 DUAs and 8 attack points, for a total of 407 potential bugs. Not all of these bugs will be viable: because we want to measure the effect of liveness and taint compute number, the current implementation does not impose limits on how live or complicated the DUAs used in bugs are.

To make sure that an injected bug really is a bug, we do two tests. First, we run the modified program on a non-triggering input, and verify that it runs correctly. This ensures that we didn't accidentally break the program in a way we weren't expecting. Second, we run it on the triggering input and check that it causes a crash (a segfault or bus error). If it passes both tests we deem it a valid bug. This could miss some valid bugs, of course – not all memory corruptions will cause the program to crash – but we're interested mainly in bugs that we can easily prove are real. Another approach might be to run the buggy program under Address Sanitizer and check to see if it flags any memory errors. After validation, we find that LAVA is able to inject 159 bugs into the toy program, for a yield of around 39%.

Let's look at an example bug (I've cleaned up the source a little bit by hand to make it easier to read; programmatically generated code is not pretty):

On lines 6–15, after parsing the file header, we add code that saves off the value of the reserved field3, which our analysis correctly told us was dead, uncomplicated, and available in head.reserved. Then, on line 20, we retrieve the value and conditionally add it to the pointer ent that is being passed to consume_record (checking the value in both possible byte orders, because endianness is hard). When consume_record tries to access fields inside the file_entry, it crashes. In this case, the DUA and attack point were in the same function, and so the use of a global variable was not actually necessary, but in a larger program the DUA and attack point could be in different functions or even different compilation units.

If you like, you can download all 407 buggy program versions, along with the original source code and triggering inputs. Note that the current implementation does not make any attempt to hide the bugs from human eyes, so you will very easily be able to spot them by looking at the source code.

Next Time

Having developed a bug injection system, we would like to know how well it performs. In the next post, we'll examine questions of evaluation: how many bugs can we inject, and how do the liveness and taint compute measures influence the number of viable bugs? How realistic are the bugs? (much more complicated than it may first appear!) And how effective are some common bug-finding techniques like symbolic execution and fuzzing? We'll explore all these and more.

1 Having worked with dynamic program analysis for so long, I sometimes forget how ridiculous the term "dynamic taint analysis" is. If you're looking for another way to say the same thing, you can use "information flow" but dynamic taint analysis is the name that seems to have stuck.↩2 Getting taint information by instrumenting the source works, but has a few drawbacks. Most notably, it causes a huge increase in the size of the source program, and slows it down dramatically. We're currently finishing up a new method, pri_taint, which can do the taint queries on uninstrumented programs as long as they have debug symbols. This should allow LAVA to scale to larger programs like Firefox.↩3 The slightly weird ({ }) construct is a non-standard extension to C called a statement expression. It allows multiple statements to be executed in a block with control over what the block as a whole evaluates to. It's a nice feature to have available for automatically generated code, as it allows you to insert arbitrary statements in the middle of an expression without worrying about messing up the evaluation.↩

Tuesday, June 7, 2016

This is the first in a series of posts about evaluating and improving bug detection software by automatically injecting bugs into programs. You can find part two, with technical details of our bug injection technique, here.

In this series of posts, I'm going to describe how to automatically put bugs in programs, a topic on which we just published a paper at Oakland, one of the top academic security conferences. The system we developed, LAVA, can put millions of bugs into real-world programs. Why would anyone want to do this? Are my coauthors and I sociopaths who just want to watch the world burn? No, but to see why we need such a system requires a little bit of background, which is what I hope to provide in this first post.

I am sure this will come as a shock to most, but programs written by humans have bugs. Finding and fixing them is immensely time consuming; just how much of a developer's time is spent debugging is hard to pin down, but estimates range between 40% and 75%. And of course these errors can be not only costly for developers but catastrophic for users: attackers can exploit software bugs to run their own code, install malware, set your computer on fire, etc.

Great, so by now we must surely have solved the problem, right? Well, not so fast. We should probably check to see how well these tools and techniques work. Since they're detectors, the usual way would be to measure the false positive and false negative rates. To measure false positives, we can just run one of these tools on our program, go through the output, and decide whether we think each bug it found is real.

The same strategy does not work for measuring false negatives. If a bug finder reports finding 42 bugs in a program, we have no way of knowing whether that's 99% or 1% of the total. And this seems like the piece of information we'd most like to have!

Heartbleed: detectable with static analysis tools, but only after the fact.

To measure false negatives we need a source of bugs so that we can tell how many of them our bug-finder detects. One strategy might be to look at historical bug databases and see how many of those bugs are detected. Unfortunately, these sorts of corpora are fixed in size – there are only so many bugs out there, and analysis tools will, over time, be capable of detecting most of them. We can see how this dynamic played out with Heartbleed: shortly after the bug was found, Coverity and GrammaTech quickly found ways to improve their software so that it could find Heartbleed.

Let me be clear – it's a good thing that vendors can use test cases like these to improve their products! But it's bad when these test cases are in short supply, leaving users with no good way of evaluating false negatives and bug finders with no clear path to improving their techniques.

This is where LAVA enters the picture. If we can find a way to automatically add realistic bugs to pre-existing programs, we can both measure how well current bug finding tools are doing, and provide an endless stream of examples that bug-finding tools can use to get better.

LAVA: Large-scale Automated Vulnerability Analysis

Goals for Automated Bug Corpora

So what do we want out of our bug injection? In our paper, we defined five goals for automated bug injection, requiring that injected bugs

Be cheap and plentiful

Span the execution lifetime of a program

Be embedded in representative control and data flow

Come with a triggering input that proves the bug exists

Manifest for a very small fraction of possible inputs

The first goal we've already discussed – if we want to evaluate tools and enable "hill climbing" by bug finders we will want a lot of bugs. If it's too expensive to add a bug, or if we can only add a handful per program, then we don't gain much by doing it automatically – expensive humans can already add small numbers of bugs to programs by hand.

The next two relate to whether our (necessarily artificial) bugs are reasonable proxies for real bugs. This is a tricky and contentious point, which we'll return to in part three. For now, I'll note that the two things called out here – occurring throughout the program and being embedded in "normal" control and data flow – are intended to capture the idea that program analyses will need to do essentially the same reasoning about program behavior to find them as they would for any other bugs. In other words, they're intended to help ensure that getting better at finding LAVA bugs will make tools better at understanding programs generally.

The fourth is important because it allows us to demonstrate, conclusively, that the bugs we inject are real problems. Concretely, with LAVA we can demonstrate an input for each bug we inject that causes the program to crash with a segfault or bus error.

The final property is critical but not immediately obvious. We don't want the bugs we inject to be too easy to find. In particular, if a bug manifests on most inputs, then it's trivial to find it – just run the program and wait for the crash. We might even want this to be a tunable parameter, so that we could specify what fraction of the input space of a program causes a crash and dial the difficulty of finding the right input up or down.

Ethics of Bug Injection

A common worry about bug injection is that it could be misused to add backdoors into legitimate software. I think these worries are, for the most part, misplaced. To see why, consider the goals of a would-be attacker trying to sneak a backdoor into some program. They want:

A way to get the program to do something bad on some secret input.

Not to get caught (i.e., to be stealthy, and for the bugs to be deniable).

Looking at (1), it's clear that one bug suffices to achieve the goal; there's no need to add millions of bugs to a program. Indeed, adding millions of bugs harms goal (2) – it would require lots of changes to the program source, which would be very difficult to hide.

In other words, the benefit that LAVA provides is in adding lots of bugs at scale. An attacker that wants to add a backdoor can easily do it by hand – they only need to add one, and even if it takes a lot of effort to understand the program, that effort will be rewarded with extra stealth and deniability. Although the bugs that LAVA injects are realistic in many ways, they do not look like mistakes a programmer would have naturally made, which means that manual code review would be very likely to spot them.

(There is one area where LAVA might help a would-be attacker – the analysis we do to locate portions of the program that have access to attacker controlled input could conceivably speed up the process of inserting a backdoor by hand. But this analysis is quite general, and is useful for far more than just adding bugs to programs.)

The Road Ahead

The next post will discuss the actual mechanics of automated bug injection. We'll see how, using some new taint analyses in PANDA we can analyze a program to find small modifications that cause attacker-controlled input to reach sensitive points in the program and selectively trigger memory safety errors when the input is just right.

Once we understand how LAVA works, the final post will be about evaluation: how can we tell if LAVA succeeded in its goals of injecting massive numbers of realistic bugs? And how well do current bug-finders fare at finding LAVA bugs?