A large fraction of the costs of
maintaining software applications is associated with detecting and
fixing errors introduced by patches. Patches are prone to introducing
failures [4, 10] and as a result, users are often
reluctant to upgrade their software to the most recent
version [3], relying instead on older
versions which typically expose a reduced set of features and are
frequently susceptible to critical bugs and security vulnerabilities.
To properly test a patch, each line of code in the patch, and all the
new behaviours introduced by the patch, should be covered by at least
one test case. However, the large effort involved in coming up with
relevant test cases means that such thorough testing rarely happens in
practice.

The Software Reliability Group at Imperial College London has invested a significant amount of effort in the
last few years on devising techniques and tools for comprehensively testing
software patches. The main focus of our work has been on developing dynamic
symbolic execution techniques that automatically detect bugs and augment
program test suites. In this article, we discuss two interrelated projects,
KATCH [7] and Shadow [9], whose objectives and relationship are depicted in
Figure 1. In a nutshell, KATCH aims to generate program inputs that cover the
lines of code added or modified by a patch, while Shadow further modifies
existing test inputs in order to trigger the new behaviour introduced by a
patch.

It is rather surprising that lots of code is being added or modified in mature
software projects without a single test that exercises it [8]. In response to such poor
culture of testing software patches, we have designed KATCH, a system whose
goal is to automatically generate inputs that exercise the lines of code of a
patch. KATCH is based on symbolic execution [2], a program analysis
technique that can systematically explore a program’s possible executions.
Symbolic execution replaces regular program inputs with symbolic variables that
initially represent any possible value. Whenever the program executes a
conditional branch instruction that depends on symbolic data, the possibility of
following each branch is analysed and execution is forked for each feasible
branch.

In its standard instantiation, symbolic execution aims to achieve high
coverage of the entire program. Instead, the challenge for KATCH is to focus on
the patch code in order to quickly generate inputs that cover it. KATCH
accomplishes this goal by starting from existing test cases that partially
cover or come close to covering the patch (such test cases often exist, so why
not make use of them?), and combining symbolic execution with targeted
heuristics that aim to direct the exploration toward uncovered parts of the
patch.

We evaluated KATCH on 19 programs
from the GNU diffutils, binutils and findutils systems, a set of
mature and widely-used programs, installed on virtually all UNIX-based
distributions. We included all the patches written over an
extended period of time, for a cumulative period of approximately six
years across all programs. Such an unbiased selection is essential if
one is to show both the strengths and weaknesses of a technique.

The results are presented in Table 1, which shows the number of basic blocks
across all patches for each application suite, and the fraction of basic blocks exercised
by the existing test suite and the existing test suite plus KATCH.
It also shows the number of crash bugs revealed by KATCH.
The highlights of our results are: (1) KATCH can significantly increase the patch coverage in
these applications in a completely automatic fashion, and (2) there are still
significant (engineering and fundamental) limitations: for instance, patch coverage for
binutils is still at only 33% of basic blocks, although the additional coverage achieved
by KATCH was enough to find 14 previously-unknown crash bugs in these
programs.

Now consider a hypothetical scenario where KATCH can generate a test input for
every single line in a patch. Would that be sufficient in order to adequately test
the patch? Let us illustrate the question using a simple example. Consider
the two versions of code below, in which the second version changes only x % 2
to x % 3, and let us assume that this statement can be executed at most
once by a deterministic program and that the variable x is only referenced
here:

Let us assume that the developers chose inputs x = 6 and x = 7 to test
the patch. Do these inputs comprehensively test the patch? The first
reaction might be to say yes, since they cover each side of the branch in both
the old and the new version. However, the true answer is that these are bad choices, as
each of these inputs follows the same side of the branch in both versions, making
the program behave identically before and after the patch. Instead, inputs such as x = 8 or x = 9 are
good choices, as they trigger different behaviour at the code level: e.g. x =
8 follows the ‘then’ side in the old version, but the ‘else’ side in the new
version.

Shadow symbolic execution is a new technique that assists developers with
the generation of such inputs. Our tool Shadow starts with an input that
exercises the patch, either constructed manually or synthesised by KATCH, and
generates new inputs that trigger different behaviour at the code level in the
unpatched and patched versions. Like KATCH, Shadow is based on symbolic
execution, this time augmented with the ability to perform a “four-way fork”, as
shown in Figure 2. Whenever we reach a branch condition that evaluates to
semantically-different expressions in the two versions — say, ‘old’ in the old
version, and ‘new’ in the new version, instead of forking execution into two
paths (as in standard symbolic execution) based on the behaviour of the
new version, we fork into up to four ways. On two of these cases, the two
versions behave identically (denoted by same in the figure): both versions
take either the then (new ∧ old) or the else (¬new ∧¬old) branch. On the
other two paths, the executions of the two versions diverge (denoted by diff in the
figure): either the new version takes the then branch and the old version
the else branch (new ∧¬old), or vice versa (¬new ∧ old). Of course, the
last two cases are of interest to Shadow, and when it encounters them, it
generates a new input triggering the divergent behaviour between the two
versions.

Shadow is not a fully automatic technique, requiring developers to create a single
unified program in which the two versions are merged via ‘change’ annotations. For
instance, in our example, the two versions would be unified by creating the if
statement if change(x % 2, x % 3), in which the first argument represents the
code expression from the old version and the second argument the corresponding
expression from the new version. While mapping program elements across versions
is a difficult task [5], we discovered that in practice the process can be
made sufficiently precise and could be (partially) automated using predefined
patterns.

We evaluated Shadow on the 22 patches
from the GNU Coreutils programs included in the CoREBench suite of regression bugs. Similar to the benchmarks
used to evaluate KATCH, these are mature, widely-used applications
available in most UNIX distributions. We chose the CoREBench patches
because they are known to introduce bugs, and furthermore, the bug
fixes are known.

After applying Shadow, we were able to generate inputs that trigger code-level
divergences for all but one patch. We were further able to generate inputs for which
the two versions generate different outputs, as well as inputs that abort or trigger
memory errors in the new version. Some sample inputs generated by Shadow are
shown in Table 2.

While Shadow was not successful in all cases, the results are promising in terms of
its ability to find regression bugs and augment the test suite in a meaningful
way. Furthermore, even generated inputs exposing expected divergences are
great candidates for addition to the program test suites, and can act as good
documentation for program changes.

As most readers would agree, continuously changing a program without having a
single input that exercises those changes is unsustainable. On the other hand, as all
developers know, comprehensively testing program patches is a difficult, tedious, and
time-consuming task. Fortunately, program analysis techniques such as symbolic
execution are becoming more and more scalable, and can be effectively extended to
generate inputs that exercise program changes (as we did in KATCH) and trigger
different behaviour across versions (as we did in Shadow). More work is
required to scale these approaches to large software systems and integrate them
in the development cycle, but initial results are promising. To stimulate
research in this space, we make our benchmarks and our systems available
for comparison (see https://srg.doc.ic.ac.uk/projects/katch/ and
https://srg.doc.ic.ac.uk/projects/shadow/). More details on our techniques
can be found in our publications [1, 6, 7, 9].