Archive for the ‘Verification’ Category

Matt Might gave a nice intro to QuickCheck via testing red-black trees recently. Of course, QuickCheck has been around for over a decade now, but it’s still useful (if underused–why aren’t you QuickChecking your programs!?).

SmartCheck focuses on efficiently shrinking and generalizing large counterexamples. I thought it’d be fun to try some of Matt’s examples with SmartCheck.

The kinds of properties Matt Checked really aren’t in the sweet spot of SmartCheck, since the counterexamples are so small (Matt didn’t even have to define instances for shrink!). SmartCheck focuses on shrinking and generalizing large counterexamples.

SmartCheck is only interesting for failed properties, so let’s look at an early example in Matt’s blog post where something goes wrong. A lot of the blog post focuses on generating sufficiently constrained arbitrary red-black trees. In the section entitled, “A property for balanced black depth”, a property is given to check that the path from the root of a tree to every leaf passes through the same number of black nodes. An early generator for trees fails to satisfy the property.

To get the code to work with SmartCheck, we derive Typeable and Generic instances for the datatypes, and use GHC Generics to automatically derive instances for SmartCheck’s typeclass. The only other main issue is that SmartCheck doesn’t support a `forall` function like in QuickCheck. So instead of a call to QuickCheck such as

> quickCheck (forAll nrrTree prop_BlackBalanced)

We change the arbitrary instance to be the nrrTree generator.

Because it is so easy to find a small counterexample, SmartCheck’s reduction algorithm does a little bit of automatic shrinking, but not too much. For example, a typical minimal counterexample returned by SmartCheck looks like

T R E 2 (T B E 5 E)

which is about as small as possible. Now onto generalization!

There are three generalization phases in SmartCheck, but we’ll look at just one, in which a formula is returned that is universally quantified if every test case fails. For the test case above, SmartCheck returns the following formula:

forall values x0 x1:
T R E 2 (T B x1 5 x0)

Intuitively, this means that for any well-typed trees chosen that could replace the variables x0 and x1, the resulting formula is still a counterexample.

The benefit to developers is seeing instantly that those subterms in the counterexample probably don’t matter. The real issue is that E on the left is unbalanced with (T B E 5 E) on the right.

One of the early design decisions in SmartCheck was focus on structurally shrinking data types and essentially ignore “base types” like Int, char, etc. The motivation was to improve efficiency on shrinking large counterexamples.

But for a case like this, generalizing base types would be interesting. We’d hypothetically get something like

further generalizing the counterexample. It may be worth adding this behavior to SmartCheck.

SmartCheck’s generalization begins to bridge the gap from specific counterexamples to formulas characterizing counterexamples. The idea is related to QuickSpec, another cool tool developed by Claessen and Hughes (and SmallBone). Moreover, it’s a bridge between testing and verification, or as Matt puts it, from the 80% to the 20%.

I listened to a Freakonomics podcast the other day in which a story about hand-washing at Cedars-Sinai Medical Center was told. It turns out that of all the staff, the doctors had the lowest rate of hand-washing (around 65%, if I recall correctly) in the hospital. This was surprising, since the doctors supposedly have the best education and so should know the most about the dangers of bacterial infection.

The chief of the staff tried different means to increase hygiene. Incentives like gift cards were tried. Then he tried another approach: shame. Doctors had their hands cultured so that they could see the bacteria on their hands. Images of the cultures were even used as a hospital-wide screensaver on the computers.

I listened to this wondering if there are any parallels to formal verification engineers and researchers. Ostensibly, we are the experts on the perils of software bugs. Academic papers often begin with some story about bugs in the wild, motivating the new formal verification technique described therein.

But to what extent do we practice what we preach? How many of us might write a Python script with no unit tests? C/C++ programs without running Lint/CppCheck? Compile without fixing all the -Wall warnings the compiler emits?

These kinds of activities represent the lowest rung of software assurance; they’re the “hand washing of software assurance”, if you will. I’m certainly guilty myself of not always practicing good “software hygiene”. The justifications you might give for failing to do so is that you’re just writing a program for personal use, or you’re writing prototype software, “just for research”. I could imagine doctors telling themselves something similar: “I always scrub thoroughly before a major surgery, but it’s not such a big deal for a simple office visit.” But this justification can become a slippery slope.

It could be an interesting empirical study to analyze the software quality of open-source formal verification tools to see just how buggy they are (spoiler alert: they are buggy). What I’m interested in are not deep flaws that might be caught via theorem-proving, but simple ones that could be caught with a little better test coverage or lightweight static analysis.

I’ve been working on a Haskell library for testing Haskell programs I call SmartCheck. SmartCheck is focused on testing algebraic data and generalizing counterexamples found. Below is the README for SmartCheck, which I have located on GitHub (I haven’t put it on Hackage yet). The following is a high-level explanation that doesn’t go into details about the algorithms or implementation (that’s another post!).

I’d be interested in feedback on

Real-world examples to try SmartCheck on.

Whether there are other interesting ways to generalize counterexamples.

If there’s similar work out there I should know about (in addition to QuickCheck and SmallCheck.

Your experiences, if you try the library out.

Thanks!

Synopsis

SmartCheck is a smarter QuickCheck, a powerful testing library for Haskell. The purpose of SmartCheck is to help you more quickly get to the heart of a bug and to quickly discover each possible way that a property may fail.

SmartCheck is useful for debugging programs operating on algebraic datatypes. When a property is true, SmartCheck is just like QuickCheck (SmartCheck uses QuickCheck as a backend). When a property fails, SmartCheck kicks into gear. First, it attempts to find a minimal counterexample to the property is a robust, systematic way. (You do not need to define any custom shrink instances, like with QuickCheck, but if you do, those are used. SmartCheck usually can do much better than even custom shrink instances.) Second, once a minimal counterexample is found, SmartCheck then attempts to generalize the failed value d by replacing d‘s substructures with new values to make d', and QuickChecking each new d'. If for each new d' generated, the property also fails, we claim the property fails for any substructure replaced here (of course, this is true modulo the coverage of the tests).

SmartCheck executes in a real-eval-print loop. In each iteration, all values that have the “same shape” as the generalized value are removed from possible created tests. The loop can be iterated until a fixed-point is reached, and SmartCheck is not able to create any new values that fail the property.

A typical example

In the package there is an examples directory containing a number of examples. Let’s look at the simplest, Div0.hs.

Because SmartCheck performs data-generic operations using GHC.Generics we have to derive Typeable and Generic. To use GHC.Generics, you also need the following pragmas: and the single automatically-derived instance:

In this example, we won’t redefine any of QuickCheck’s standard arguments, but it’s certainly possible. the treeShow field tells SmartCheck whether you want generalized counterexamples shown in a tree format or printed as a long string (the default is the tree format).

Oh, that’s confusing, and for such a simple property and small datatype! SmartCheck takes the output from QuickCheck and tries systematic shrinking for the one failed test-case, kind of like SmallCheck might. We get the following reduced counterexample:

Ok, that’s some progress! Now SmartCheck attempts to generalize this (local) minimal counterexample. SmartCheck has two generalization steps that we’ll explain separately although SmartCheck combines their results in practice (you can turn off each kind of generalization in the flags). First, SmartCheck tries to generalize values in the shrunk counterexample. SmartCheck returns

Ahah! We see that for any possible subvalues x0, the above value fails. Our precondition divSubTerms did not account for the possibility of a non-terminal divisor evaluating to 0; we only pattern-matched on constants.

In addition, SmartCheck tries to do something I call constructor generalization. For a datatype with a finite number of constructors, the idea is to see if for each subvalue in the counterexample, there is are subvalues that also fail the property, using every possible constructor in the datatype. So for example, for our counterexample above

We find another counterexample; this time, the main constructor is addition.

We might ask SmartCheck to find another counterexample:

...
*** Extrapolating ...
*** Could not extrapolate a new value; done.

At this point, SmartCheck can’t find a newly-shaped counterexample. (This doesn’t mean there aren’t more—you can try to increase the standard arguments to QuickCheck to allow more failed test before giving up (maxDiscard) or increasing the size of tests (maxSize). Or you could simply just keep running the real-eval-print loop.)

Looking to make a statement this holiday season? You could try to win the office “ugly holiday sweater” contest. Or, you could play “Jingle Bells” on your Arduino microcontroller, using Haskell. This post is about the latter.

We’re going to write this small program using the Copilot embedded domain-specific language (on Hackage and the source on Github). Copilot is a stream language that allows you to generate embedded C code from programs written essentially as Haskell lists (using Atom as a backend for the C code generation). This post is about how to use Copilot/Haskell (v. 1.0) to make your embedded C programming easier and more likely to be correct. Here’s what we’re going to do—please don’t look too closely at my soldering, and turn the volume up, since a piezo speaker isn’t loud:

(For the impatient, the Haskell file is here, and the generated .c and .h files are here and here, respectively.)

We’re going to essentially recreate this C/Wiring program, plus flash some LEDs, but hopefully in a easier, safer way. We need to manage three tasks:

There are two classes of variables: Copilot variables that will refer to streams (infinite lists), and external variables, which can refer to data from C (including the return values of functions, global variables, and arrays). The constructors are mnemonics for the type of the variables; for example, varI32 is a variable that will refer to a stream of Int32s. Similarly, extArrI32 is an external variable referring to a C array of Int32s (i.e., int32_t). Notice the idx argument—it is the stream of values from which the index into the array is drawn (constants can also be used for indexes).

And that’s basically it. There are two parts to the program, the definition of Copilot streams, which manage data-flow and control, and triggers, which call an external C function when some property is true. Copilot streams look pretty much like Haskell lists, except that functions are automatically lifted to the stream level for convenience. Thus, instead of writing,

x = [0] ++ map (+1) x

in Copilot, you simply write

x .= [0] ++ x + 1

Similarly for constants, so the Copilot stream definition

playNote .= true

lifts the constant true to an infinite stream of true values. The function mux is if then else—mux refers to a 2-to-1 multiplexer. So that means that the stream odd takes the value of green when idx is odd, and red otherwise, where green and red refer to the pins controlling the respective LEDs.

Just to round out the description of the other defined streams, idx is the index into the C arrays containing the notes and beats, respectively—that’s why we perform modular arithmetic. The stream duration tells us how long to hold a note; 300 is a magic “tempo” constant.

Now for the triggers. Each of our triggers “fires” whenever the stream playNote is true; in our case, because it is a constant stream of trues, this happens on each iteration. So on each iteration, the C functions digitalWrite and playTone are called with the respective function arguments (‘<>‘ separates arguments). The function digitalWrite is a function that is part of the Wiring language, which is basically C with some standard libraries, from which digitalWrite comes. We’ll write playTone ourselves in a second.

The C Code

We need a little C code now. We could write this directly, but we’ll just do this in Haskell, since there’s so little we need to write—the Copilot program handles most of the work. But a caveat: it’s a little ugly, since we’re just constructing Haskell strings. Here are a few functions (included with Copilot) to make this easier, and here are some more. (If someone properly writes a package to write ad-hoc C code from Haskell, please leave a comment!)

First, we need more magic constants to give the frequency associated with notes (a space is a rest).

The other main piece of C code we need to write is the function playtone. The piezo speaker is controlled by pulse width modulation, basically meaning we’ll turn it on and off really fast to simulate an analogue signal. Here is it’s definition (using a little helper Haskell function to construct C functions):

HIGH, LOW, digitalWrite, and delayMicroseconds are all part of the Wiring standard library. That ifdef is for verification purposes, which we’ll describe in just a bit.

Besides a little more cruft, that’s it!

Test, Build, Verify

“Jersey Shore” may have introduced you to the concept of gym, tan, laundry, but here we’ll stick to test, build, verify. That is, first we’ll test our program using the Copilot interpreter, then we’ll build it, then we’ll prove the memory safety of the generated C program.

This calls the Copilot interpreter, saying to unroll cycleSong 20 times. Because the Copilot program samples some external C values, we need to provide that data to the interpreter. Fortunately, we have those arrays already defined as Haskell lists. Executing this, we get the following:

Good, it looks right. (period isn’t a Copilot variable but just keeps track of what round we’re on.)

Build. To build, we generate the C code from the Copilot program, then we’ll use a cross-compiler targeting the ATmega328. The easiest way (I’ve found) is via Homin Lee’s Arscons. Arscons is based on Scons, a Python-based build system. Arscons makes three assumptions: (1) the program is written as a Wiring program (e.g., there’s a loop() function instead of a main() function is the main difference), (2) the extension of the Wiring program is .pde, and (3) the directory containing the XXX.pde is XXX. For us, all that really means is that we have to change the extension of the generated program from .c to .pde. So we define

Verify. Is the generated C program memory safe? Wait… What do I mean by ‘memory safe’? I’ll consider the program to be memory safe if the following hold:

No arithmetic underflows or overflows.

No floating-point not-a-numbers (NaNs).

No division by zero.

No array bounds underflows or overflows.

No Null pointer dereferences.

Of course this is an approximates memory-safety, but it’s a pretty good start. If the compiler is built correctly, we should be pretty close to a memory-safe program. But we want to check the compiler, even though Haskell’s type system gives us some evidence of guarantees already. Furthermore, the compiler knows nothing about arbitrary C functions, and it doesn’t know how large external C arrays are.

We can prove that the program is memory safe. We call out to CBMC, a C model-checker developed primarily by Daniel Kröning. This is whole-program analysis, so we have to provide the location of the libraries. We define

which calls cbmc on our generated C program. Let me briefly explain the arguments. First we give the name of the C program.

Then we say how many times to unroll the loops. This requires a little thinking. We want to unroll the loops enough times to potentially get into a state where we might have an out of bounds array access (recall that the Copilot stream idx generates indexes into the arrays). The length of the C arrays are given by length notesLst. When compiling the Copilot program (calling the module’s main function, a periodic schedule is generated for the program). From the schedule, we can see that idx is updated every fourth pass through the loop. So we unwind it enough loop passes for the counter to have the opportunity to walk off the end of the array, plus a few extra passes for setup. This is a minimum bound; you could of course over-approximate and unroll the loop, say, 1000 times.

Regarding loop unrolling, remember that #ifdef from the definition of playtone()? We include that to reduce the difficulty of loop unrolling. playtone() gets called on every fourth pass through the main loop, and unrolling both loops is just too much for symbolic model-checking (at least on my laptop). So we give ourselves an informal argument that the loop in playtone() won’t introduce any memory safety violations, and the #ifdef gives us one iteration through the loop if we’re verifying the system. A lot of times with embedded code, this is a non-issue, since loops can just be completely unrolled.

The -D flag defines a preprocessor macro, and the -I defines a include path. Finally, the --function flag gives the entry point into the program. Because we generated a Wiring program which generates a while(1) loop for us through macro magic, we have to create an explicit loop ourselves for verification purposes.

If you’re interested in seeing what things look like when they fail, change the idx stream to be

I was recently interviewed by Flight International magazine, one of the oldest aviation news magazines. Their reporter, Stephen Trimble, was writing on the Air Force’s Chief Scientist’s recent report stating that new software verification and validation techniques are desperately needed.

I’ve recently got a few technical papers out the door involving Haskell, physical-layer protocols, SMT, security modeling, and run-time verification of embedded systems (phew!). One of the benefits of industrial research is getting your hands involved in a lot of different research projects.

This paper is about using Haskell to model physical-layer protocols and using QuickCheck to test them. Physical-layer protocols are used to transmit bits from one clock-domain to another and are used in ethernet, credit card swipers, CD players, and so on. The gist of the paper is that even though Haskell is pure & lazy, it works great for modeling and testing real-time protocols and even for computing reliability statistics. I presented it at the Haskell Symposium in September ’09, which was associated with ICFP. (The talk video is online!) The paper is a short experience report—indeed, it is the only experience report that was accepted at the symposium. The Haskell Symposium was an entertaining and friendly environment for presenting.

This paper actually precedes the Haskell paper, but it extends the results by describing how to formally verify physical-layer protocols using SMT solvers and k-induction (we use SRI’s SAL tool in this work). The paper is a journal article accepted at Formal Aspects of Computing. You’ll find at least two things interesting about this article: (1) For all the excitement about SMT, there don’t seem to be a lot of great examples demonstrating its efficacy—the problems described in this paper were (laboriously!) verified using theorem-provers by others previously, and our approach using SMT is much more automated. (2) We provide a nice general model of cross clock-domain circuits and particularly metastability.

So if you can verify physical-layer protocols, why model them in Haskell and QuickCheck them (as we did above)? There are at least two reasons. First, if you’re using SMT, then your timing constraints need to be linear inequalities to be decidable. For systems that with nonlinear constraints, QuickCheck might be your only recourse. Second, QuickCheck gives you concrete counterexamples and test-cases that you can use to test implementations (SMT solvers often return symbolic counterexamples).

This paper describes a simple model for analyzing information flow in a system (where a “system” could be a program, a network, an OS, etc.). The main portion of the paper describes heuristics based on graph algorithms for deciding what sort of information flow policies you might want to enforce in your system. In general, there’s been a lot of work on analyzing access control policies but not so much work in figuring out what kind of policy you should have in the first place (if you know of such work, please tell me!). The paper isn’t deep, and it’s also preliminary insofar as I don’t describe building a complex system using the techniques. Still, there’s a small (Haskell) script available that implements the algorithms described; I’d love to see these analyses find their way into a tool to help system designers build secure systems.

Finally, this report describes the field of run-time monitoring (or run-time verification) as it applies to safety-critical real-time embedded software. Run-time monitoring compliments formal verification since when a system is too complicated to verify a priori, it can be monitored at run-time to ensure it conforms to its specification. Not a lot of work has been done on monitoring software that’s hard real-time, distributed, or fault-tolerant—which ironically could benefit the most from run-time monitoring. The report should serve as a nice, gentle introduction. The report should be published soon as a NASA Contractor Report—the work was done under a NASA-sponsored project for which I’m the PI.

Don’t hesitate to give me feedback on any of these papers. Ok, time to fill up the queue again…

Here’s a simple challenge-problem for model-checking Boolean functions: Suppose you want to compute some Boolean function , where represents 0 or more Boolean arguments.

Let , , , range over 2-ary Boolean functions, (of type ), and suppose that is a fixed composition of , , , . (By the way, I’m going to talk about functions, but you can think of these as combinatorial circuits, if you prefer.)

Our question is, “Do there exist instantiations of , , , such that for all inputs, ?

What is interesting to me is that our question is quantified and of the form, “exists a forall b …”, and it is “higher-order” insofar as we want to find whether there exist satisfying functions. That said, the property is easy to encode as a model-checking problem. Here, I’ll encode it into SRI’s Symbolic Analysis Laboratory (SAL) using its BDD engine. (The SAL file in its entirety is here.)

To code the problem in SAL, we’ll first define for convenience a shorthand for the built-in type, BOOLEAN:

Now we need an application function that takes an element f from B2ARY and two Boolean arguments, and depending on f, applies the appropriate 2-ary Boolean function:

app(f: B2ARY, a: B, b: B): B =
IF f = False THEN FALSE
ELSIF f = Nor THEN NOT (a OR b)
ELSIF f = NorNot THEN NOT a AND b
ELSIF f = NotA THEN NOT a
ELSIF f = AndNot THEN a AND NOT b
ELSIF f = NotB THEN NOT b
ELSIF f = Xor THEN a XOR b
ELSIF f = Nand THEN NOT (a AND b)
ELSIF f = And THEN a AND b
ELSIF f = Eqv THEN NOT (a XOR b)
ELSIF f = B THEN b
ELSIF f = NandNot THEN NOT a OR b
ELSIF f = A THEN a
ELSIF f = OrNot THEN a OR NOT b
ELSIF f = Or THEN a OR b
ELSE TRUE
ENDIF;

Let’s give a concrete definition to f and say that it is the composition of five 2-ary Boolean functions, f0 through f4. In the language of SAL:

Now, we’ll define a module m; modules are SAL’s building blocks for defining state machines. However, in our case, we won’t define an actual state machine since we’re only modeling function composition (or combinatorial circuits). The module has variables corresponding the function inputs, function identifiers, and a Boolean stating whether f is equivalent to its specification (we’ll label the classes of variables INPUT, LOCAL, and OUTPUT, to distinguish them, but for our purposes, the distinction doesn’t matter).

Notice we’ve universally quantified the free variables in spec and the definition of f.

Finally, all we have to do is state the following theorem:

instance : THEOREM m |- NOT equiv;

Asking whether equiv is false in module m. Issuing

$ sal-smc FindingBoole.sal instance

asks SAL’s BDD-based model-checker to solve theorem instance. In a couple of seconds, SAL says the theorem is proved. So spec can’t be implemented by f, for any instantiation of f0 through f4! OK, what about

spec(b0: B, b1: B, b2: B, b3: B, b4: B, b5: B): B =
TRUE;

Issuing

$ sal-smc FindingBoole.sal instance

we get a counterexample this time:

f0 = True
f1 = NandNot
f2 = NorNot
f3 = And
f4 = Xor

which is an assignment to the function symbols. Obviously, to compute the constant TRUE, only the outermost function, f0, matters, and as we see, it is defined to be TRUE.

By the way, the purpose of defining the enumerated type B2ARY should be clear now—if we hadn’t, SAL would just return a mess in which the value of each function f0 through f4 is enumerated:

I recently presented a position paper at a workshop addressing software challenges for unpiloted air vehicles (UAVs). The paper is coauthored with Don Stewart and John Van Enk. From a software perspective, UAVs (and aircraft, in general) are fascinating. Modern aircraft are essentially computers that fly, with a pilot for backup. UAVs are computers that fly, without the backup.

Some day in the not-so-distant future, we may have UPS and FedEx cargo planes that are completely autonomous (it’ll be a while before people are comfortable with a pilot-less airplane). These planes will be in the commercial airspace and landing at commercial airports. Ultimately, UAVs must be transparent: an observer should not be able to discern whether an airplane is human or computer controlled by its behavior.

You won’t be surprised to know a lot of software is required to make this happen. To put things in perspective, Boeing’s 777 is said to contain over 2 million lines of newly-developed code; the Joint Strike Fighter aircraft is said to have over 5 million lines. Next-generation UAVs, with pilot AI, UAV-to-UAV and UAV-to-ground communications, and arbitrary other functionality, will have millions more lines of code. And the software must be correct.

In our paper, we argue that the only way to get a hold on the complexity of UAV software is through the use of domain-specific languages (DSLs). A good DSL relieves the application programmer from carrying out boilerplate activities and providers her with specific tools for her domain. We go on and advocate the need for lightweight DSLs (LwDSLs), also known as embedded DSLs. A LwDSL is one that is hosted in a mature, high-level language (like Haskell); it can be thought of as domain-specific libraries and domain-specific syntax. The big benefit of a LwDSL is that a new compiler and tools don’t need to be reinvented. Indeed, as we report in the paper, companies realizing the power of LwDSLs are quietly gaining a competitive advantage.

Safety-critical systems, like those on UAVs, combine multiple software subsystems. If each subsystem is programmed in its own LwDSL hosted in the same language, then it is easy to compose testing and validation across subsystem boundaries. (Only testing within each design-domain just won’t fly, pun intended.)

The “LwDSL approach” won’t magically make the problems of verifying life-critical software, but “raising the abstraction level” must be our mantra moving forward.