Thursday, May 26, 2016

PostgreSQL Regression Test Coverage

Yesterday evening, I ran the PostgreSQL regression tests (make check-world) on master and on each supported back-branch three times on hydra, a community test machine provided by IBM. Here are the median results:

9.1 - 3m49.942s
9.2 - 5m17.278s
9.3 - 6m36.609s
9.4 - 9m48.211s
9.5 - 8m58.544s
master, or 9.6 - 13m16.762s
There are at least three effects in play here. First, of course, many new tests have been added in recent years. Second, some existing tests may have been optimized to make them run faster, or de-optimized to make them run slower. Finally, the server itself may have become faster or slower, changing the time that it takes to run the test suite. It's not entirely easy to tease apart these effects, because a given test suite can only be expected to pass when run against the matching server version. Still, the change in runtime is probably due mostly to new test suites, especially the so-called TAP tests, which have been gradually added beginning in 9.4.

In 9.6, for example, we have new tests for pg_dump, more tests for the commit_timestamp feature, and tests for recovery. In one sense, it's hard to argue that this is anything other than a good thing. Do you want your database product to test whether crash recovery actually works? You sure do! And, as a developer, whenever I have to modify existing code, particularly in sensitive areas of the system, I am always happier when I know that the code I'm changing is well-covered by tests. That way, if I break something, I'll probably find out right away.

On the other hand, how do we judge whether these new test suites have value, and how do we judge how much value they have? Certainly, running all of the tests takes longer than not running all of the tests, and more time spent waiting for tests to run means less time spent actually developing. Sometimes you can work while the tests are running, but it's kind of tricky, because if you modify anything then by the time the tests finish, the version you've got is no longer the version you've tested. You can work on a different copy of the repository or a different project altogether, but context-switching carries its own costs.

Moreover, once you have tests, you have to keep them working. Noah Misch gave a talk at PGCon entitled Not Easy Being Green in which he explored some of the buildfarm failures that he's spent a lot of time researching. Of course, some of those failures turned out to be real bugs that were just hard to hit, and we can feel good about the time spent finding those bugs. But some of them were simply cases where tests hit timeouts that are more than reasonable on normal hardware, but maybe insufficient on oddball hardware where filesystem metadata operations are excruciatingly slow, or on test machines that build with certain testing-oriented compilation options enabled that can slow down test runs by 100x or more. We can't feel as good about the time spent tracking down these failures. It's good for the buildfarm to be green, and it's good to have lots of tests, but we have to face the fact that more tests means more time spent tracking down test failures some of which will turn out to be spurious. So, again, how do we assess the value of a test, or of an overall larger test suite?

One possible metric is code coverage. What percentage of functions, and of lines of code, do the regression tests hit? Here's a quick comparison, taken on my (well, EnterpriseDB's) MacBook Pro:

Between 9.1 and 9.6, the percentage of functions hit by the basic regression test suite (that is, make check) stayed basically unchanged, just below 70%, while the percentage of lines hit improved by about 2.6%. The absolute numbers of functions and lines covered went up significantly, but the percentage of them covered by the basic regression test suite did not change very much, which seems fine. For the full test suite (make check-world), we did manage to improve coverage. In 9.1, running make check-world instead of make check resulted in hitting an additional 213 functions, 1.7% of the total number. In 9.6, it results in hitting an additional 2290 functions, 6.9% of the total number. Overall, we've improved test coverage by about 5%. It's worth noting that only about 40% of that improvement comes from the new, relatively slow TAP tests; the majority is because check-world instead now covers more than it once did. We're spending about half the duration of make check-world on master running TAP tests that, in the aggregate, provide coverage of about 2% of the code.

Given all that, it seems pretty clear that neither run time nor test coverage are good proxies for the value of a test suite. Run time isn't a good proxy for test value because we don't know whether that additional run time is actually hitting any code that we weren't testing anyway. Test coverage isn't a good proxy for test value because not all code is equally important. As a percentage of our total code base, the recovery code is very small - all the TAP tests together improve code coverage by only about 2%, and the recovery tests are only one small part of the TAP tests. Yet, there must be value in testing something that is so important to our users.

It seems to me that the only real way to judge the value of a test suite is to look at how sensitive it is to the presence of bugs. The best test would be one that was 100% certain to pass if the code is free of bugs, and 100% certain to fail the moment a bug was introduced. In reality, no test achieves this. There's always some bug which is subtle enough to allow the test to pass, and there's always some pathological condition under which the test will fail even if the code is bug-free. This is inevitable. But it is the function of regression testing to serve as a sort of canary in the coal mine - so the closer we come to tests that are sensitive only to the presence of bugs and not to anything else, the closer we come to perfect regression testing. The mere fact that the regression tests end up calling a particular function does not necessarily mean that they will fail if that function gets broken.

Empirically, I know that the core regression tests are pretty good. I spend a lot of time hacking on PostgreSQL, either with my own patches or those others have written, and I run the "make check" thousands of times per year. There are almost no false positives. If the regression test outputs change in a way that isn't a logical result of some improvement you made to the code, you've definitely broken something. There are more false negatives. If you're writing new code that isn't hit by the existing regression tests, then those regression tests won't tell you whether it's broken; and even if you're modifying existing code, you can sometimes do so in ways that don't necessarily affect the tests. Still, my experience is that the core regression tests are extremely sensitive.

The rest of the regression test suite, at least in my experience, is much less sensitive. So far, I've had a TAP test fail as a result of a bug in my code just once, but I've had probably a half-dozen failures that were due to portability problems in the test suite. Of course, that's partly because many of the new regression tests are testing code that I'm not likely to be modifying. Nonetheless, as we get more experience with these new test suites, I think we should be asking this question: if, on our development machines, we break recovery, or pg_dump, or some other bit of code that is tested by make check-world, do the tests fail? If they do, they're good tests.

You could take a look at "mutation testing" [https://en.wikipedia.org/wiki/Mutation_testing] to answer the question if a certain test/test suite has real value. It involves the following steps:- Modify the code under test- Run the test suite (or just parts of it)- Look if at least one test failedIf no test failed, you have found code that has not been tested at all. So this goes one step further than "code coverage" which just tells you that a test has executed certain code.Additionally, the same procedure can be used to compare two tests. If the fail for the same mutations, the are equivalent. You could then reduce the test suite by removing test cases that do not add any value.