[This is the second part of a two-post series. The first part is here.]

Recently, I talked about using program verifiers in teaching proof techniques in undergraduate CS, and my own experience with using the Dafny verifier in an algorithms class. One thing that that post was missing was a student perspective. So I asked three of my students — Julie Eisenberg, Rebecca Smith, and Matthew Dunn-Rankin — what they thought of Dafny. In this post, I summarize their insightful feedback and ponder about its implications.

The takeaway message for me is that verification tools such as Dafny are an excellent way of introducing students to formal proofs. At the same time, some care is needed as we integrate these tools with our pedagogy. In particular, the automated reasoning capabilities of Dafny aren’t an unadulterated blessing, and we must make sure that students understand how to do formal proofs fully manually. We ought to also invest more effort on tools that make the proof process more productive, for example by giving users more feedback on why an attempt at a formal proof failed.

Verification as structured proving

One of the benefits of using program verifiers to teach proofs is that they connect proving to programming, something all CS students know and (likely) love. Just as programming languages ask that you code using a fixed set of language constructs, program verifiers ask you to do proofs using a set of structured proof primitives. Rebecca argued that this could make verification tools especially useful for students who are not naturally theory-inclined:

For context, I am not a theory person, I don’t particularly like proofs in general, and I really don’t have much intuition for them… So, it’s a major point in Dafny’s favor that despite all that, I really liked using Dafny! In fact, the assignment where we used Dafny was my favorite homework of the entire class; I actually enjoyed working on it to the point that once I started I just wanted to continue, and I ended up doing the whole thing in one night about a week before it was actually due…

I think the fact that I liked Dafny so much was due to a combination of factors. Mostly, I think I just liked the structure that it imposed on proofs. I guess this wasn’t so much an artifact of Dafny itself (since we also did proofs of this form by hand in class) but of the specific proof technique of breaking things down into four very concrete pieces (pre-conditions, post-conditions, ranking functions, and loop invariants). To me, this makes the proof much more accessible/easier to reason about, since I could break things down into these very clear-cut categories. But more specific to Dafny, I do think that the particular language that Dafny used was very intuitive — both in the sense that it had nice lightweight syntax, and in that there was a nice correlation between the four concrete pieces of the proof and the four keywords/primitives in the Dafny language (requires, ensures, decreases, invariant, respectively). It felt very logical, so I do think it was helpful in teaching me to think about proofs in this structured way.

Also,… it was refreshing to work on proofs in an environment that felt more familiar, in that it kind of felt like I was just writing code.

Verification for gamification

Julie made the valuable point that verification systems had a “game-like” quality, which could be used to make proofs more appealing to students:

[Dafny] eased me comfortably into the process of verifying programs, because I could kind of guess and check until I got it right. It made it more exciting to write the rules into the program, because it felt like a game to make Dafny go green. I think it could be really useful for engaging more students and getting them interested in proving programs, even freshmen.

Too much automation?

In my last post, I argued that one of the biggest advantages of semi-automated verification tools is that they can use tactics to automate the lower-level details of proofs. Both Rebecca and Julie argued that this could be a double-edged sword. To quote Julie:

The downside [of Dafny] is that all the work that Dafny did is still magic to me. I quickly got to the point where I could easily place the rules correctly into the program in Dafny, but I was really ready to learn how to prove the algorithms myself… I felt like having Dafny had sheltered me from all the interesting stuff that was going on behind the scenes… I’d recommend that professors use Dafny to introduce the topic [of proofs of programs], and to let students play with their ideas and attempt to prove programs while having fun, but then in an advanced course also teach the fundamental techniques that are required in order for tools like Dafny to work.

And here’s Rebecca’s take:

More so than other proofs, doing a proof in Dafny felt a little bit like solving a puzzle, in that (for me at least) it ended up being a somewhat non-linear process where I would start by adding annotations for everything that I knew for sure, and then try to build up from there until things came together. While for me this made it fun, I think this could also be seen as a potential downside, in that it sometimes allowed me to get away with “intuitively guessing”. I don’t mean that I was just randomly guessing, but rather that there were definitely times when, rather than using the methodical, working-backwards approach that you taught in class, I would instead begin with an annotation that I felt was on the right track and end up incrementally tweaking it until it worked. So, in this sense, I do think there’s a danger that a program like Dafny could make it too easy to arrive at the correct result without really understanding the process. Which, from an educational standpoint, I’m sure is not the desired result!

Not enough feedback

Matt thought that Dafny did not offer enough feedback during the proof process:

It wasn’t always clear how to make my program check out. I don’t remember the details of this frustration, but my vague memory is that it would say things like “this assertion is unproven” and I’m like “okaaay, what do you need from me?” I wonder to what extent that’s an issue with the language and to what extent it’s my inexperience. Clearer error messages would be nice, but, then again, I’m not actually sure that it’s possible to be any clearer… the ability to inspect what the prover can infer about any given line would be really cool.

He got around this issue by developing a method for “debugging” proofs:

One small technique I developed near the end of the course was to include little assert statements in the program to sorta move the bar back: if my overall program wasn’t checking out, I’d add an assertion near the end of the program and confirm that the error moved to that line instead.

The takeaway

A clear takeaway from this feedback is that semi-automated program verifiers can be a good “gateway” to proofs. The issues that my students had with Dafny open up some interesting avenues for research.

Proof debugging: Specifically, building better debugging tools for computer-assisted proofs of programs is an interesting research question. Such a debugger would automatically find reasons why an attempt at a formal proof failed, and also suggest ways in which this mistake could be fixed. This research can perhaps build on existing work on automatic explanation of type errors and automated repair of programs.

Proof automation levels: The problem of too much automation is more of a user interface issue. One fix to this is to have automatic proof tactics not be the default but instead be enabled by conscious choice. This is already the case in proof assistants like Coq. As for Dafny, automation there takes several forms. First, Dafny uses a boolean reasoning engine (an SMT solver) to check the validity of verification conditions. Second, it automatically computes the weakest preconditions for straight-line code. Third, it uses some elementary heuristics to infer certain basic invariants and ranking functions. We could then imagine different “levels” of proof automation where some or all of these techniques are permitted. Students could start at a basic level where all sorts of automation are permitted, then progress to advanced levels where they are not.

Making proofs robust:While my students didn’t bring this up, automation can also raise robustness problems: some reasoning tasks are just not automated easily, and heuristics for these problems can break on changes to the problem instance. For instance, if a program uses nonlinear integer arithmetic, then the problem of checking its verification conditions is undecidable. Dafny’s SMT solver (Z3) does have a heuristic for solving nonlinear arithmetic, but this heuristic can break easily. If it does, Dafny currently gives up on the proof. Perhaps it should ask the user for guidance instead. As I will explain in a later post, the use of appropriate language abstractions — notably, functional rather than imperative constructs — can also make automated reasoning more robust. Instructors could use this fact when choosing a programming language for their class.

A longer-term challenge: I note that that even if we fixed the above issues, there’s no way I would be able to use a verification tool as an automated tutor for the entirety of my course. This is because many of the proof techniques that I teach can’t be effectively mechanized using existing verification technology. For example, I would very much like to mechanize arguments such as the correctness of Dijkstra’s shortest path algorithm or the optimality of greedy interval scheduling. However, in today’s deductive verification systems, proofs of these arguments would just be too long and involved.

I imagine that to get around these difficulties, we would have to develop libraries of proved facts about basic data structures like graphs. We would probably also benefit from databases of “proof strategies” that encode common proof techniques and impose structural constraints on the proofs that a student needs to explore. How to do this well enough so that the result can be used in a class like mine is an open and interesting research challenge. I intend to invest at least a few of my research cycles on this question, and I hope y’all do too!