Abstracts for George C. Necula

The EECS Research Summary for 2003

A Compatible Representation for CCured Metadata

CCured is a program transformation system that provides memory safety guarantees for C programs using a combination of static verification and run-time checks [1]. In order to support run-time checks, the instrumented program must store metadata with certain pointers; thus, one-word pointers in the original program are represented as multi-word pointers in the transformed program.

Unfortunately, this multi-word representation is incompatible with the original C representation used by libraries that were not compiled with CCured. To solve this problem, we have designed an alternative representation for CCured metadata in which metadata for a given data structure is stored in a separate data structure that has a similar shape. For each value in the original program, our transformed program contains two values: one for data and one for metadata. Each operation in the original program is split as well, producing one operation on the data value and a parallel operation on the metadata value. An inference algorithm allows us to minimize the number of values within the transformed program that must use this compatible representation.

We are currently using this representation to improve the usability of CCured when applied to large, real-world programs. In addition, we are investigating possible applications of this technique outside the realm of CCured.

CCured is a program transformation system that adds memory safety and type safety guarantees to C programs by verifying statically that memory errors cannot occur and by inserting run-time checks where static verification is insufficient. Programs instrumented by CCured run with modest (typically 10-50%) slowdown, due to the effective combination of static and dynamic checking.

Perhaps the greatest potential impact of CCured is in the domain of security-critical software. Memory safety is an absolute prerequisite for security, and it is the failure of memory safety that is most often to blame for insecurity in deployed software. Further, CCured's relatively modest performance cost makes it plausible for security-critical production systems to use binaries compiled with CCured's run-time checks enabled. We have used CCured to guarantee memory safety for several real-world, security-critical programs, such as bind, sendmail, OpenSSL, and an FTP daemon.

Recent extensions to CCured include:
support for object-oriented programming styles using physical subtyping of structs and run-time type information in pointers; and
support for interoperability with legacy code by changing the representation of metadata. See the related abstract "A Compatible Representation for CCured Metadata," also in this Research Summary.

Elkhound: A Fast, Practical GLR Parser Generator

Scott McPeak
(Professor George C. Necula)

Elkhound is a parser generator based on the Generalized LR (GLR)
parsing algorithm. Because it uses GLR, Elkhound can parse with
any context-free grammar, including those that are ambiguous
or require unbounded lookahead. Due to a novel improvement to the GLR
algorithm, wherein the parser can choose between GLR and ordinary LR
on a token-by-token basis, Elkhound parsers are about as fast as
LALR (1) parsers on deterministic portions of the input. Unlike existing
GLR implementations, Elkhound allows the user to associate arbitrary
action code with reductions, and to directly manage subtree sharing
and merging.

To demonstrate Elkhound's effectiveness, we used it to build a parser
for C++, a language notorious for being difficult to parse. Our C++
parser is small (3500 lines), efficient, maintainable, and elegantly
handles even the most troublesome constructs--by delaying
disambiguation until semantic analysis when necessary.

Use of Randomization in Program Analysis

Analyzing programs is provably hard and we have learned to pay a price
in terms of loss of completeness (or precision), and the complexity
and running time of the analysis. This project is trying to
explore how randomization can be used to obtain simplicity and
efficiency at the expense of making soundness probabilistic.

We have used some randomized techniques for the problem of discovering
affine equalities in a program [1]. This problem arises in several
program analyses like compiler optimizations (e.g., constant
propagation, common sub-expression elimination, elimination of
conditionals), discovering loop invariants, and translation validation.
The key idea of the randomized algorithm for this problem is to
execute a code fragment on a few random inputs, but in such a way that
all paths are covered on each run. This makes it possible to rule out
invalid relationships even with very few runs. The algorithm is based
on two main techniques. First, both branches of a conditional are
executed on each run and at joint points we perform an affine
combination of the joining states. Secondly, in the branches of an
equality conditional we adjust the data values on the fly to reflect
the truth value of the guarding Boolean expression. This increases the
number of affine equalities that the analysis discovers.
The algorithm is simpler to implement than alternative deterministic
versions, has better computational complexity, and has an extremely
small probability of error for even a small number of runs. This
algorithm is an example of how randomization can provide a trade-off
between the cost and complexity of program analysis, and a small
probability of unsoundness. We are currently exploring if these
ideas can be extended to handle other analyses involving inequalities,
non-linear assignments, memory expressions, and dependent conditionals.

It is also interesting to see if randomized techniques can be helpful
in the context of theorem proving. We have demonstrated this for the
congruence closure problem involving uninterpreted functions and the
theory of linear equalities, and are yet to explore if these ideas can
be extended to handle a combination of other convex theories too.

Verification and Analysis of Reactive Systems

Existing software systems are becoming larger and more complex without becoming safer. Many separate theorem-proving, model-checking, and program analysis techniques tackle the problem of verifying that a program meets a specification. Each of these three methods alone can only address certain subsets of this larger problem. Although a few projects have attempted to combine these three techniques, much unifying work remains to be done. In addition, such techniques often rely on simplifying assumptions (e.g., about pointer arithmetic or type safety) that may not hold in practice, especially for code written in lower-level languages. While previous approaches concentrated on stand-alone programs with a single entry point, my focus is on reactive systems. Such systems exist in various states over time (e.g., device drivers, web server modules, database extensions) and respond to requests. In such systems both the interfaces between modules and the internal behavior of modules are important. Finally, such techniques can be extended to produce more detailed and useful reports than the traditional error trace.

I propose a two-phase approach for verifying that such a system meets a specification. In the first phase the system is made type- and memory-safe by a combination of static analyses and run-time checks. In essence, this ensures that the system adheres to a universal specification of memory safety and paves the way for more complicated analyses that assume memory-safe programs. Existing techniques usually generate false positives or false negatives in the presence of legacy code that violates such assumptions.

In the second phase the transformed system can be checked against a more detailed specification. This checking process should verify local properties, the behavior of the system over time, and the correct use of interfaces. The verification will use techniques from program analysis, model checking, and theorem proving with an emphasis on automation. When the specification permits it, an error report should contain not just a path demonstrating the error but also detailed suggestions on how to deal with the problem in the original source. Few existing techniques address either what to do when a bug is found or how to model systems that exist over time. If the system does adhere to the specification, I propose to emit a machine-checkable proof of the fact.