What high school Algebra quizzes and NP-complete problems have in common

What I did for my summer internship at Galois

World of algebra quizzes. As a high schooler, I was using concepts from computer science long before I even knew what computer science was. I can recall taking a math quiz—calculators banned—facing a difficult task: the multiplication of large numbers. I was (and still am) very sloppy when it came to pencil-and-paper arithmetic—if I didn’t check my answers, I would invariably lose points because of “stupid mistakes.” Fortunately, I knew the following trick: if I summed together the digits of my factors (re-summing if the result was ten or more), the product of these two numbers should match the sum of the digits of the result. If not, I knew I had the wrong answer. It wasn’t until much later that I discovered that this was a very rudimentary form of the checksum.

In fact, most of the tricks I rediscovered were motivated by a simple academic need: Was my answer correct or not? Indeed, while I didn’t know it at the time, this question would become the fundamental basis for my internship at Galois this summer.

At about the time I started learning algebra, I began to notice that my tricks for checking arithmetic had become insufficient. If a teacher asked me to calculate the expanded form of the polynomial (x + 2)(x - 3)(x - 5), I had to carry out multiple arithmetic steps before I arrived at an answer. Checking each step was tedious and prone to error—I knew too well that I would probably be blind to errors in the work I had just written. I wanted a different way to check that my answer was correct.

Eventually, I realized that all I had to do was pick a value of x and substitute it into the original question and the answer x³ - 6x² - x + 30. If the values matched, I would be fairly confident in my answer. I also realized that if I picked a number like x = -2, I wouldn’t even have to calculate the value of the original problem: the answer was obviously zero! I had “invented” unit testing, and at the hand of this technique, many symbolic expressions bent to my pencil. (I independently learned about unit testing as a teething programmer, but since a PHP programmer never codes very much math, I never made the connection.)

World of practical software testing. Here, we pass from the world of algebra quizzes to the world of software testing. The expressions being tested are more complicated than x³ - 6x² - x + 30, but most people still adopt the strategy of the high school me: they hand pick a few inputs to test that will give them reasonable confidence that their new implementation is correct. How does one know that the output of the program is the correct one? For many simple programs, the functionality being tested is simple enough that the tester mentally “knows” what the correct result is, and write it down manually—akin to picking inputs like x = -2 that are particularly easy for a human to infer the answer to. For more complex programs, a tester may use a reference implementation to figure out what the expected behavior is supposed to be.

Testing like this can only show the presence of bugs, not the absence of them. But, as many software companies have discovered, this is good enough! If the programmer misses an important test case and a bug report comes in, he fixes the bug and adds a regression test to deal with that buggy input. So, as pragmatists, we have settled for this state of affairs: manual case-by-case testing (which hopefully is automated). The state of the art of conventional software testing is fundamentally the same as how a high-schooler checks his answers on an algebra quiz. Anything better lies beyond the dragons of theoretical computer science research.

Aside. As anyone who has written automated tests before can attest, automated tests are characterized by two primary chores: getting your code to be automatically testable in the first place (much easier if it’s arithmetic than if it’s a kernel driver) and coming up with interesting situations to test your code in. For the latter, it turns out that while humans can come up with decent edge-cases, they’re really bad at coming up with random test-cases. Thus, some extremely practical high-tech testing techniques involve having a computer generate random inputs. Fuzz testing and QuickCheck style testing are both characterized by this methodology, though fuzz testing prides itself in nonsensical inputs, while QuickCheck tries hard to generate sensible inputs.

World of theoretical computer science. The teacher grading your algebra quiz doesn’t do something so simple as pick a few random numbers, substitute them into your answer, and see if she gets the right answer. Instead, she compares your answer (the program itself) against the one she has in the answer key (a reference implementation), and marks you correct if she is able to judge that the answers are the same. If you phrase your answer in terms of Fermat’s last theorem, she’ll mark you off for being cheeky.

The reference implementation may be wrong (bug in the answer key), but in this case it’s our best metric for whether or not a program is “correct.” Since we’ve wandered into the land of theoretical computer science, we might ask this question to the Literal Genie: Is it possible, in general, to determine if two programs are equivalent? The Literal Genie responds, “No!” The question is undecidable: there is no algorithm that can answer this question for all inputs. If you could determine if two programs were equivalent, you could solve the halting problem (the canonical example of an unsolvable problem): just check if a program was equivalent to an infinitely looping one.

While the working theoretician may tame uncountably huge infinities on a regular basis, for a working programmer, the quantities handled on a regular basis are very much finite—the size of their machine integer, the amount of memory on their system, the amount of time a program is allowed to run. When you deal with infinity, all sorts of strange results appear. For example, Rice’s theorem states that figuring out whether or not a program has any non-trivial property (that is, there exists some program that has the property and some program that doesn’t) is undecidable! If we impose some reasonable constraints, such as “the program terminates in polynomial time for all inputs”, the answer to this question is yes! But can we do so in a way that is better than testing that the programs do the same thing on every input?

World of more practical computer science. We’ve relinquished enough theoretical purity to make our question interesting again for software engineers, but it is still very difficult for the programmer to prove to himself that the algorithm is equivalent to his reference implementation. In contrast, it's easy for a user to show that the algorithm is wrong: all they have to do is give the programmer an input for which his implementation and the reference implementation disagree.

Computer scientists have a name for this situation: problems for which you can verify their solutions (in this case, more of an anti-solution: a counter-example) in polynomial time are NP. Even if both programs run in constant time, as a combinational logic circuit might (to simulate such a circuit, we only need to propagate the inputs through as many gates as they are in the circuit: there is no dependence on the input), it still takes exponential time to brute-force an equivalence check. Every time we add another bit to the input, we double the amount of possible inputs to check.

In fact, the question of circuit non-equivalence is NP-complete. We’ve been talking about program equivalence, but we can also talk about problem equivalence, for which you can translate one problem (graph coloring) into another one (traveling salesman). In the seventies, computer scientists spent a lot of time proving that a lot of problems that required “brute force” were actually all the same problem. Stephen Cook introduced the idea that there were problems that were NP-complete: problems in NP for which we could translate all other problems in NP into. The most famous example of an NP-complete problem is SAT, in which given a logical formula with boolean variables, you ask whether or not there is a satisfying assignment of variables, variables that will cause this formula to be true.

To show that circuit non-equivalence is NP-complete, we need to show that it is in NP (which we’ve done already) and show that we can translate some other NP-complete problem into this problem. This is quite easy to do with SAT: write a program that takes the boolean variables of SAT as inputs and outputs the result of the logical formula and then see if it’s equivalent to a program that always returns false.

The other direction is only slightly less trivial, but important practically speaking: if we can reduce our problem into an instance of SAT, I can chuck it a highly optimized SAT solver. A satisfiability problem is isomorphic to a logic circuit that outputs a single bit. We can translate a circuit equivalence problem into SAT by combining the circuits into what is called a “miter”: we combine the inputs of the two original logic circuits into a single set that feeds into both circuits, and then test the corresponding output bits between the two circuits for equality (XOR), ORing the entire result together. The resulting circuit outputs 0 if the outputs were the same between the two circuits (all of the XORs returned 0), and outputs 1 if there is a mismatch.

“Great,” you may be thinking, “but I’m a programmer, not a hardware designer. Most of my programs can’t be expressed just in terms of logic gates!” That is true: to encode state, you also need latches, and input/output needs to be simulated with special input and output “ports”. However, there are many important problems that are purely combinational: the shining example of which is cryptography, which protects your money, employs a lot of complicated math and is ruthlessly optimized.

But there still is one standing complaint: even if my programs are just logic circuits, I wouldn’t want to write them in terms of ANDs, ORs and NOTs. That just seems painful!

Enter Cryptol, the project that I am working on at Galois. Cryptol bills itself as follows:

Cryptol is a language for writing specifications for cryptographic algorithms. It is also a tool set for producing high-assurance, efficient implementations in VHDL, C, and Haskell. The Cryptol tools include the ability to equivalence check the reference specification against an implementation, whether or not it was compiled from the specifications.

But what really makes it notable, in my humble intern opinion, is the fact that it can take programs written in programming languages like C, VHDL or Cryptol and convert them into logic circuits, or, as we call them, “formal models”, which you can chuck at a SAT solver which will do something more sensible than brute-force all possible inputs. At one point, I thought to myself, “It’s a wonder that Cryptol even works at all!” But it does, and remarkably well for its problem domain of cryptographic algorithms. The state of the art in conventional software testing is manually written tests that can only show the presence of bugs in an implementation; the state of the art in Cryptol is a fully automatic test that gives assurance that an implementation has no bugs. (Of course, Cryptol could be buggy, but such is the life of high assurance.)

SAT solvers are perhaps one of the most under-utilized high-tech tools that a programmer has at their fingertips. An industrial strength SAT solver can solve most NP-complete problems in time for lunch, and there are many, many problems in NP with wide-ranging practical applications. However, the usual roadblocks to using a SAT solver include:

No easy way to translate your problem into SAT and then run it on one of the highly optimized solvers, which are frequently poorly documented, library-unfriendly projects in academia,

Generating friendly error messages when your SAT solver passes or fails (depending on what is an “error”), and

My primary project was addressing issue one, in Haskell, by building a set of bindings for ABC, a System for Sequential Synthesis and Verification called abcBridge. One might observe that Haskell already has a number of SAT solving libraries: ABC is notable because it employs an alternative formulation of SAT in the form of And-Inverter Graphs (NAND gates are capable of simulating all boolean logic) as well as some novel technology for handling AIGs such as fraiging, which is a high-level strategy that looks for functionally equivalent subsets of your circuits.

The project itself has been a lot of fun: since I was building this library from scratch, I had a lot of flexibility with API decisions, but at the same time got my hands into the Cryptol codebase, which I needed to integrate my bindings with. With any luck, we’ll be releasing the code as open source at the end of my internship. But I’m going to miss a lot more than my project when my internship ends in two weeks. I hope to follow up with a non-technical post about my internship. Stay tuned!

3 Responses to “What high school Algebra quizzes and NP-complete problems have in common”

Yes, software (unit) testing is indeed like double-checking high school math answers.

However, high school math students rarely use mutation testing, a cool idea a friend introduced me to. It works by modifying the source code. If the unit tests still pass, then there must be a bug in the original code.

Andrew, mutation testing is indeed an interesting way of verifying unit tests: I’d heard about it but not applied any automation to it (I regularly manually break code in order to test that my unit tests are working.) One of the primary difficulties with automating this process is introducing interesting mutations, that is, mutations that are actually likely to show that your test suite is wrong.