🔗 You can’t automate SemVer, or: There is no way around Rice’s Theorem

Rice’s Theorem, proved in 1951, states that it is impossible to write a program that performs precisely any non-trivial analysis of the execution of other programs. More precisely, that’s impossible to code an analyzer for some non-trivial property that is able to decide whether an any given analyzed program has that property or not. And by “trivial property” we mean a property that either _all_ algorithms in the world have or _none_ has. So, yeah, “non-trivial property” is basically any property you can think of: “does it ever calculate 5 + 2”, “does it always use less than 10MB of memory”, “does it ever print something to the screen”, “does it ever access the network”?

At this point you might say “wait! I can write a program that checks if programs access the network or not! We can parse the code and if there are no calls whatsoever in it to any networking code such as connect(), then it doesn’t access the network!”. Sure, you can do that: but if the code has calls to connect(), you can’t decide for sure that it will access the network when it’s executed.

In 1936 Alan Turing proved that it is impossible to write a program that solves the Halting Problem, that is, to write an analyzer that checks programs and tells if it always terminates (”halts”) or it might enter an endless loop given some specific input. Okay, that’s a classic result, but that’s one property, how can Rice’s Theorem say we can’t make an analyzer for any property at all, even the silliest ones?

The proof for this amazingly powerful theorem is surprisingly simple. Turns out that if we had an analyzer for any silly property, we could use it to make a Halting Problem analyzer (which Turing proved to be impossible). Like this:

If the code in analyzedProgram always terminates, then the code in modifiedProgram will always reach the part that has the silly property, so my_silly_property_analyzer will return true, and my_halting_problem_analyzer returns true as well. If there is some input that makes the analyzedProgram hang in a loop, that means there’s some input that makes the silly property fail, resulting in false. Yay, we solved the Halting Problem using the silly property analyzer! Not.

Of course, this explanation is quite simplified1 (in particular, (1) I’m ignoring in the pseudo code the input of the analyzedProgram, which should be an argument of my_halting_problem_analyzer; (2) Rice’s Theorem is formally stated about partially recursive functions, but the formalism maps to Turing Machines and therefore the result applies to programming languages in general; (3) I didn’t stress enough that the “non-trivial properties” should be on the functions themselves (that is, their input-output pairs) and not their specifications (their “code”), but I believe I conveyed the general idea), so head to Wikipedia and your favorite formal languages book for the precise details. But the point stands that general semantic analysis of programs is impossible.

In particular, you can’t write a program that takes versions 1.0 and 1.1 of any program X and answer the question: “do they behave the same?”. In other words, it’s impossible to write an analyzer that looks at your master branch before you make a release and answers the question “should your new release tag be a major, minor or tiny release” according to the rules of SemVer (or any other API-compatibility-bound set of rules, for that matter).

This is because API compatibility is not only based on syntactically-expressible issues (that is, type signatures for functions and data structures). Any semantic changes to the code also break compatibility. A function may change its behavior but not its type signature (it still returns a string, but it used to be lower-case and it’s now upper-case), a struct can change they way it is used but the fields remain the same (field foo returned numbers from 0 to 10 and -1 when executed on Sundays, now it returns -1 on Saturdays as well). An automated tool won’t catch all this.

So, it is possible only to write a “pessimistic” tool, that may detect lots of situations syntactically and give the bad news: “hey, you must increment the major version here!”. But you can’t write a tool that is always able to look at code semantically and say the good news: “I assure you that no API behaviors have changed, you can safely name this a tiny version increase.”2Edit: It was pointed out to me that I missed the “always” in the previous version of the last sentence. Yes, of course you can write an analyzer that checks for API-comptabiility between versions for some cases: a silly example would be an analyzer that detects if the code is identical save for carefully renamed local variables, or that has some no-op x = x; assignments sprinkled in it — yep, the code is different, but evidently behaves the same. It’s a silly example but it shows that it is possible to catch some cases. My point is that it is impossible to write a program that reaches a conclusive answer on API compatibility on all cases, so the conclusion stands that making versioning decisions fully automated is impossible. Further, writing one that would catch even a significant amount of positive cases would be impractical. There are hard limits to semantic analysis. However, given that lots and lots of API breakage happen in function signatures and data types, those are not hard to detect, as they are syntactically specified: that’s why the “pessimistic” tools are much more practical.

Yes, you can use test suites as an approximation for detecting semantic changes in API behaviors beyond type signatures and data structures. That would certainly improve your pessimistic analyzer — you’d be able to detect more situations where “you must increment major”. But even then it can only go so far, because in practice one can’t test for every possible input/output combination, so you still can’t be 100% sure. fuzz testing has uncovered bugs and unexpected behaviors even in programs with extensive test suites; as Dijkstra famously said, “Testing shows the presence, not the absence of bugs.” — likewise, test suites can show inconsistencies to the API specification, but not their adherence. So they can’t be taken to represent the semantics of a program entirely.

Anyway, in the end of the day, Rice’s Theorem shows us that general bullet-proof analysis of program behavior is not attainable, so no tool will ever be able to compare codebases and always tell us precisely that a new release is really “tiny-safe”. Semantic versioning just can’t be automated.

Posted by hisham on Thursday, March 24, 2016 02:05:46 in en_US, Coding