Thursday, 17 November 2016

Overly Prescriptive Tests

In my recent post “Tautologies in Tests” I adapted one of Einstein’s apocryphal sayings and suggested that tests should be “as precise as possible, but not too precise”. But what did I mean by that? How can you be too precise, in fact isn’t that the point?

Mocking

One way is to be overly specific when tracking the interactions with mocks. It’s very easy when using a mocking framework to go overboard with your expectations, just because you can. My personal preference (detailed before in “Mock To Test the Outcome, Not the Implementation”) is to keep the details of any interactions loose, but be specific about the outcomes. In other words what matters most is (usually) the observable behaviour, not necessarily how it’s achieved.

For example, rather than set-up detailed instructions on a mock that cover all the expected parameters and call counts I’ll mostly use simple hand-crafted mocks [1] where the method maps to a delegate where I’ll capture only the salient details. Then in the assertions at the end I verify whatever I need to in the same style as the rest of the test. Usually though the canned response is test case specific and so rarely needs any actual logic.

In essence what I’m creating some people prefer to call stubs as they reserve the term “mocks” for more meatier test fakes that record interactions for you. I’d argue that using the more complex form of mock is largely unnecessary and will hurt in the long run. To date (anecdotally speaking) I’ve wasted too much time “fixing” broken tests that overused mocks by specifying every little detail and were never written to give the implementation room to manoeuvre, e.g. during refactoring. In fact an automated refactoring tool is mandatory on code like this because the methods are referenced in so many tests it would take forever to fix-up manually.

I often feel that some of the interactions with dependencies I’ve seen in the past have felt analogous to testing private methods. Another of my previous posts that was inspired by mocking hell is “Don’t Pass Factories, Pass Workers”. Naturally there is a fine line here and maybe I’ve just not seen enough of it done well to appreciate how this particular tool can be used effectively.

White-Box Testing

The other form of overly specific test I’ve seen comes from what I believe is relying too much on a white-box testing approach so that the tests express the output exactly.

The problem with example based tests is that they are often taken literally, which I guess is kind of the point, but as software engineers we should try and see passed the rigid examples and verify the underlying behaviour instead, which is what we’re really after.

For example, consider a pool of numbers [2] up to some predefined limit, say, 10. A naïve approach to the problem might test the pool by asserting a very specific sequence, i.e. the starting one:

for (var number in expected) Assert.That(pool.Acquire(), Is.EqualTo(number)); }

From a white-box testing approach we can look inside the NumberPool and probably see that it’s initially generating numbers using the ++ operator. The implementation might eagerly generate that sequence in the constructor, add them to the end of a queue, and then divvy out the front of the queue.

From a “programmer’s test” point of view (aka unit test) it does indeed verify that, if my expectation is that the implementation should return the exact sequence 1..10, then it will. But how useful is that for the maintainer of this code? I’d argue that we’ve over-specified the way this unit should be allowed to behave.

Verify Behaviours

And that, I think, lies at that heart of the problem. For tests to be truly effective they should not describe exactly what they do, but should describe how they need to behave. Going back to our example above the NumberPool class does not need to return the exact sequence 1..10, it needs to satisfy some looser constraints, such as not returning a duplicate value (until re-acquired), and limiting the range of numbers to between 1 and 10.

With these two tests we are free to change the implementation to generate a random sequence in the constructor instead if we wanted, and they would still pass, because it conforms to the looser, albeit still well defined, behaviour. (It may have unpredictable performance characteristics but that is a different matter.)

Once again we are beginning to enter the realm of property based testing which forces us to think harder about what behaviours our code exhibits rather than what it should do in one single scenario.

This does not mean there is no place for tests that take a specific set of inputs and validate the result against a known set of outputs. On the contrary they are an excellent starting point for thinking about what the real test should do. They are also important in scenarios where you need some smoke tests that “kick the tyres” or you are naturally handling a very specific scenario.

Indicative Inputs

Sometimes we don’t intend to make our test look specific but it just turns out that way to the future reader. For example in our NumberPool tests above what is the significance of the number “10”? Hopefully in this example it is fairly obvious that it is an arbitrary value as the test names only talk about “a limit”. But what about a test for code that handles, say, an HTTP error?

In this test we have a mock (nay stub) HTTP server that will return a non-2XX style result code. Now, what is the significance of the InternalServerError result code returned by the stub? Is it a specific result code we’re handling here, or an indicative one in the 5XX range? The test name uses the term “service unavailable” which maps to the more specific HTTP code 503, so is this in fact a bug in the code or test?

Unless the original author is around to ask (and even remembers) we don’t know. We can surmise what they probably meant by inspecting the production code and seeing how it processes the result code (e.g. a direct comparison or a range based one). From there we might choose to see how we can avoid the ambiguity by refactoring the test. In the case where InternalServerError is merely indicative we can use a suitably named constant instead, e.g.

A clue that there is a disconnect is when the language used in the test name isn’t correctly reflected in the test body itself. So if the name isn’t specific then nor should the test be, but also vice-versa, if the name is specific then expect the test to be. A corollary to this is that if your test name is vague don’t surprised when the test itself turns out equally vague.

Effective Tests

For a suite of tests to be truly effective you need them to remain quietly in the background until you change the code in a way that raises your awareness around some behaviour you didn’t anticipate. The fact that you didn’t anticipate it means that you’ll be relying heavily on the test rather than the code you just changed to make sense of the original intended behaviour.

When it comes under the spotlight (fails) a test needs to convince you that it was well thought out and worthy of your consideration. To be effective a guard dog has to learn the difference between friend and foe and when we write tests we need to learn how to leave enough room for safe manoeuvring without forgetting to bark loudly when we exceed our remit.

[1] When you keep your interfaces simple and focused this is pretty easy given how much a modern IDE can generate for you when using a statically typed language.

[2] This example comes from a real one where the numbers where identifiers used to distinguish compute engines in a grid.

1 comment:

Well said.I failed at this early on and am still fighting my way out of tests which are massively overly detailed...

BUT, I sometimes also quite like the way the overly detailed tests can catch things that I don't expect them to catch. My (hand crafted) mocks are generally a bit too noisy and simply log interactions (all interactions) and so things break when these interactions change... Mostly it's a pain and I try and move away from it or refactor so that the interaction is checked in one place rather than in each test...

BUT, sometimes these interactions catch things that I've no other way to catch. I have mock reference counted objects that have caught leaks because they're so noisy. I have mock locks which have caught potential deadlock because they're so noisy.

I'm torn between tearing out the noise and putting more in. Depending on which angle I'm coming at it from... Updating the code and finding that I need to change loads of tests just because I fiddle with a parameter object in a slightly different way but the outcome is the same; I want to simplify and quieten the mocks. Fixing a bug where a noisy mock has announced that I've just checked in a change that meant we're leaking an object in one specific scenario; more noise please...

Of course, the danger with the noise, if you don't try and remove it, is that you just assume the new noisy behaviour is correct rather than spending the time to determine if it is correct or broken; updating the tests with new noise to pass even though you haven't worked out if the new noise is good or the old noise was correct... Then the noise has no value.