I am trying to improve the tests for a bunch of renderers that generate 2D images, and I am looking for advice and ideas for testing that are both efficient and thorough but also branch friendly.

All renderers can be set up in a massive number of ways - there are at least 30 independent variables in the newest(simplest) renderers, and easily hundreds of independent variables in some of the more mature ones. All these are stored in one giant blob of a state class which is used by all renderers. (If you want any further detail that would help, just ask)

A human then checks the actual output is acceptable. If it is, the actual ouptut is copied to the expected output directory.

Problems at the moment:

The expected output directory is stored in SVN in the trunk of our source (to allow easy branching). However this is bloating our repository and inflaming checkout times (1hour clean checkout) and disk usage (1gb and growing).

Test code is too long and tests are unclear due to the copy and pasting

Adding new tests that combines the test with all prior features is never done because it's unfeasible, so test coverage is poor

And here's my ideas so far:

Compare image hashes instead of pixel by pixel image comparison, with some sort of 'AddImageToHashDatabase' program run by the human verifying test output.

Then create (finite) lists of possible input for each variable and iterate over them all, testing every combination

And some other ideas from my team:

Don't try to test everything, just write a set of tests that randomly test certain things and generally hit the 'known to be used' cases (keep doing the current thing)

Move expected output to another repository

Make each test test a combination of totally different things to reduce number of images needed

Any ideas for improving our testing, whether they're from prior experience or just ideas are greatly appreciated. Thanks!

3 Answers
3

In a sense, this is like the UI automation problem. A human is the best judge of quality, and while an algorithm can't always tell you the UI is right, it might be able to tell you the UI is wrong and it might be able to tell you if something has changed since last time.

As I understand it, you have two problems: an image comparison problem and a combinatorial testing problem. It sounds as if you keep a collection of canonical images that a human has certified to be correct. Since you suggested alternatives to pixel-to-pixel comparison, I assume comparing images consumes a significant amount of time. A hash code sounds like a reasonable shortcut for comparing image files. Of course there are many hashing algorithms. You may be better off using something like MD5 rather than, say, an algorithm that just sums the pixel values.

If an image differs from the canonical version, what does a human look at to judge whether the new image is nonetheless correct? Surely there will be cases when the image is obviously flawed, and other times when it's a judgement call. I assume it would be valuable to have a reliable-enough image comparison algorithm. Google seems to retrieve relevant hits when I search for "image comparison metrics".

It also seems to me that in some cases you have to judge an image's quality in terms of its similarity to other images that were generated with similar variables. So for example, if I want to test the impact of the amount of ambient light, I might want to look at a series of images rendered with exactly the same variables except for the amount of ambient light. For continuous variables, it might be feasible to eyeball one image of the series and then rely on an image comparison algorithm to judge whether the other images in the series are similar enough.

Regarding the combinatorial problem, clearly you cannot test all 2^30 combinations of variables. I recommend starting with an All-Pairs approach and then increasing your coverage from there.

To be honest, this sounds like a fun problem to work on: there's lot of data, it's a real problem, there are many possible approaches, and it will require some experimentation

Thanks for your suggestion of the all pairs approach. Combinatorics was (/is) a new field to me, so plenty of interesting stuff to learn.
–
BomadenoMay 30 '11 at 16:11

(enter is 'save', whoops) At the moment speed of image comparison is not the problem, but disk storage space. I had thought hashes would solve this, but then we have problems seeing 'what changed'. I clearly need to add some way of comparing one build output against another. Also, we don't use any eyeball comparison - everything is done automatically till failure. On failure, we can recognise our own artifacts, or we have analysts for 'judgement calls'.
–
BomadenoMay 30 '11 at 16:17

An idea I got from a touch screen manufacturer- skip the human eye as much as possible.
There are algorithmic ways to compare two images, for example imagemagick, they might be less accurate for catching fine details but will catch most of the problems.
For every test run add a short (short depending on your resources) stage of manual comparisons.

Sorry, I over edited and lost clarity! The initial comparison (in the copy and pasted test code) is automated. It's done by an internal image comparison library that does pixel based diffs. Only if the automated comparison fails do we do a human inspection to checkthe changes are acceptable.
–
BomadenoMay 26 '11 at 12:15

Write your expected images in code: depending on your images, you might be able to hard-code specific options more easily than the actual renderers. Put this under source control instead of images. Binaries are generally a bad thing to try and change under source control.

Keep a separate QA repository, and try not to branch it. This will mean being more fastidious over what tests are there, but it sounds like you might need to do that anyhow. Don't just make it the expected output, but all QA stuff. If you need something from the dev repository, put in place an automagic push from dev to QA (not synch!).