Jenkins automatically runs all tests for all build configurations. The regressiontests results are show on the Jenkins page under "Test" and a test report looks like this.

Jenkins also automatically computes the code coverage. This is currently only done when the regressiontests are modifified not when the tests are modified. The code coverage of the regressiontests for the 4.5 source is here and for the regressiontests and the unit tests together for the master branch source is here.

Policy for generating regression tests

Goal: Provide a way to detect errors introduced in old code paths by new or modified features Method: Compare output of reference input and reference code with the same input with new code, and use that to signal significant differences

Issues to be resolved

Full conditions of the reference version of the test needs to be documented. We don’t know now what bug we might suspect later might have been present in either the test or the reference case. We need to provide as much information as we reasonably can now, so that we can attempt to replicate it exactly later if that’s needed when trouble happens. This means we need:

code requirements

code needs a commit hash that exists in the main GROMACS repo (so not some developer repo that might get rebased, and definitely not a dirty one)

what branch the commit is on is of secondary importance, but has to make sense for the functionality being tested and the time span over which that functionality has been stable (or not!)

compiler requirements

open source

code base likely to remain available and compilable on future hardware

version we trust from previous experience

we’re not going to make trouble by requiring test makers to have a particular hand-compiled compiler whose pedigree we know - Mark would rather have an extra test contributed by a volunteer developer than really be able to know whether a particular compiler bug was present in our compiler when we made the reference case (if/when we do need to know that about some suspected compiler bug, then we will be recompiling with later versions of that compiler anyway, and those will tell us what we need to know). Yes, this is not perfect.

so gcc 4.7 sounds like a good way to do the above

build requirements

-O0 optimization level (Mark’s reflex was to be happy with -O2, but for some kinds of tests being sure of IEEE conformance with -O0 is worth it, because we are often trying to cut numerical corners and want to have chances of knowing we’re wrong). This means that if the compiler can’t be reproduced for some reason (e.g. vendor won’t support it) there’s a decent chance that some other compiler will do a comparable job.

Later, if we later add regular checking of ensemble consistency over longer time scales, we aren’t going to be doing that with -O0 code. One of the purposes of such tests is to do so under “battle conditions” where we can’t control numerical reproducibility (e.g. in parallel).

use built-in versions of GROMACS dependencies (Generally, we will be regularly testing code that uses the external dependencies that we encourage people to use for performance, so we can be reasonably sure that this process will gradually eliminate bugs in our built-in code. When tracing problems, we don’t have the resources to go looking for whether bugs existed in the external libfoo used during the test. We don’t want to constrain test creators to have to install some particular version of some library they don’t care about. We don’t want to have to discuss updating that dependency. We don’t want to have to try to detect and document dependency versions (unless, say, the test is actually about whether FFTW is conformant to our requirements)

minimize the use of acceleration levels at configure time (no SIMD, no GPU, generally no parallelism)

test case requirements

reasonably likely to be responsive to a range of possible problems without producing false signal. The reproducible quantity might be numerical or an ensemble average (with error estimates), or other things?

Mark doubts we will ever be able to afford to write function-level test code, so the purpose of the tests is to signal, rather than diagnose, the problem

run time requirements

the code path of the reference test should depend somewhat on the nature of the test, but generally use code that is easy to check by eye for correctness (so interaction-specific C non-bonded kernels are OK)

prefer commodity hardware (so x86 for the moment)

For this to be useful we need

to make it reasonably easy for any contributor to satisfy the above, so

wiki and README needs a sample CMake invocation that works now and is reasonably future-proof.