Main menu

To Kill a Mockingtest

Please don’t use mocks or stubs in tests. While they are seemingly ubiquitous in enterprise development, they have serious drawbacks, and typically mask easily fixable deficiencies in the underlying code.

Even their most ardent defenders concede that mocks and stubs have flaws. These include:

Dependence on fragile implementation details

Mocks and stubs require intimate knowledge of how code interacts with other modules. Even if the implementation is correctly refactored without altering public contracts, these tests will tend to break, and draw your attention away from more productive tasks.

Testing incidental properties with no bearing on correctness

What is the point of this code? This is an essential question to ask, in order to understand it. Tests have a story to tell here, and mocks invariably tell the wrong one. Is the point of makeCoffee() that we made a coffee, or that we opened the fridge to get the milk? When we payShopkeeper(), do we care that we completed a transaction, or that we rummaged though our wallet for change? When mocking tests fail, the poor maintainer is left to reconstruct the real intent from a trail of indirect clues and anecdotes.

Web of lies

It is good practice to write data structures that are correct-by-construction; any constructor or sequence of method calls is guaranteed to leave the data in a meaningful state. Stubs introduce test-only fictions that are stripped of any of the safety latches and guarantees that may have been built in; they introduce fresh sources of error that are not present in the codebase. There is no value in detecting any failure that arises in such a way.

As time goes on, lies beget more lies. It is not unusual for a stubbed input in one place to result in another here, and another there; the fiction leaks and spreads into some kind of evil facsimile of the original code, but with more bulk, complexity and defects.

Common Scenarios

Sometimes mocking and stubbing appears to be the most appropriate way to test a class, given the buttons and levers it affords. Usually though, this tells us that there is a better way to write it, with better separation of concerns, modularity and reusability. Let’s look at some common examples:

Here we have a class VillainHideout, that depends on Config, which is a sprawling data structure of 52 fields. We only care about two: the number of bees and sharks to release. Because of this, Config is stubbed; it is too hard to construct otherwise. Apart from the ugliness of the sock-puppetry, we can already see some avoidable problems:

VillainHideout is less useful and reusable than it might be, because it knows too much about things that are of no use to it.

We have introduced a new source of error: how can we know what an acceptable state for Config is? By re-implementing it piecemeal, we are undermining efforts that it might have taken to establish guarantees, and arrogating construction knowledge that has no place in a foreign test.

There are several ways to address this. We can ask of Config: “Who could possibly want to know all of the information you hold?” If the answer is “nobody”, Config could be broken into several smaller structures of more specific interest: VillainConfig, HeroConfig, DamselInDistressConfig and the like.

We can also ask of VillainHideout: “What do you need to do your job? Do you really care where it comes from?” The answer to the first question is simply the number of bees and sharks; the answer to the second is probably “no”. None of the interesting functionality in the class that we might want to test would depend on the specific source of the configuration items. The top level of the application might care, but that is a matter of wiring, rather than the nefarious misdeeds in VillainHideout that we care about.

The code is cleaner, simpler, has less dependencies and is more reusable. The separate concern of application wiring has been moved elsewhere; the test is shorter, clearer, and has no mocks or stubs. It’s all win so far.

This is a typical OO domain model; we have classes that are metaphors for real-world objects, that change in-place. Obeying the “Tell, Don’t Ask” principle, there are a handful of actions that drive the behaviour, with most state hidden. A Customer holds a ShoppingBasket, can add Items to it, and can pay() for it with a CreditCard.

In our test, we (correctly) assess that we cannot locally reason about Customer while it talks to mutable collaborators, so we stub them out, providing fixed input from the basket, and detecting the charge() action on the card. In order to stub the ShoppingBasket, we can’t allow Customer to create its own, so we pass in a factory for that extra layer of indirection.

I’m reminded of a quote I read recently about testing:

“(Testing is) to create a tiny universe where the software exists to do one thing and do it well”.

The obvious insight is damning: why wouldn’t you write the software like this in the first place?

Mutable state is far more complex than the alternative. An immutable data structure is simply a value; like an integer, or a point. A mutable data structure, on the other hand, hold many values over time; it intrinsically represents an identity that strings together this series of facts. It is far harder to reason about; we are irretrievably entangled with the passage of time, and we cannot use equational reasoning or substitute calculations with their results. Encapsulation cannot save us; these properties are transitive, and will virally leak into anything that interacts with the mutable structure.

We should also be suspicious of appeals to familiarity, and especially appeals of similarity to the “real world”. Familiarity is no friend of simplicity. The world we experience is shackled to the arrow of passing time, and is limited by what can squeeze into three dimensions, and built with unreliable or expensive materials. Software can effortlessly cast these aside; we can do better.

The interesting behaviour in this example can easily be represented by pure functions and immutable values:

The code is now already composed of things that “do one thing and do it well”. There are no moving parts; everything is immutable. There are no “identities” that vary over time. The factory is gone. The methods are all pure functions, representing a straight mapping from inputs to outputs. The test, naturally, provides an input, and checks an output. This is clearly a better assessment of correctness than the earlier mock, which beats around the bush and sniffs for evidence.

“But then it’s an integration test!”

This shouldn’t concern us when we are dealing with pure functions. Compose two functions, and you still have a function. Compose ten, or a hundred, and it is still a function mapping values to values. In an important sense, the code is no more complex.

By contrast, by chaining even two mutable collaborators together, we have lost the ability to easily reason about the system; the answer to any question we might ask is “it depends”. It depends on when we ask; it depends on where they’ve been; it may even depend on the order in which we ask them.

You can see why OO test-writers are tempted to use intrusive shims to stem the bloodflow of complexity. Not only do they fail, but the problem they are trying to solve need not even exist.

A further example

Do you think we should test multiply() by checking, say, multiply(3, 9) == 27, or mocking the add() call and seeing if it gets called 3 times? Should we stub the Ints we pass in?

Mocking and stubbing is plainly ridiculous here, but not because the example is so simple. Int is a value, but essentially any well-defined immutable structure is a value too. Mocking the add() call provides no value whatsoever, not because it is trivial, but because it is testing something that has no bearing on a correct result. The code takes responsibility for its own correctness, and can pick whatever tools it pleases; the test has no business peeking further than that.

Here we have a service that performs some logic and sends an email. We don’t want to actually send emails in our test, so we mock the sending mechanism.

This is similar to the previous example in some ways; our code contains side effects, and we cannot treat it as a pure function. However, this time we cannot make the effect vanish in a puff of smoke; the decision to send the email has to happen one way or another. If we mock the call though, we have still lost the benefits of purity and referential transparency; mutable state might get all the attention, but other side effects are just as bad.

Let’s consider: how much of this scenario can we represent without the side effect at all? Checking unsubscribe status is pure, generating the email content is pure, and importantly, the decision to send the email is pure. Perhaps we can rewrite it as a pure function of Customer => Option[SendEmail], and let something else pull the trigger?

We have called out the EmailSender call as a new data type, representing the decision to send; all of the interesting behaviour is now in a pure function. We didn’t need the stubs in the first place of course; the immutable Customer can be considered a plain value, rather than some kind of foreign collaborator. The mocks have all but evaporated.

We have pushed the actual effect out into an interpreter. Do we want to use mocks to test this? Maybe; but perhaps it’s not so interesting to test anymore. It’s often easier to check that last inch of inter-system I/O manually.

Lifting mocked calls into messages

Mocking a method call makes a statement that the purpose of the code under test is to call the next thing. In a sense, the method call is the logical output of the function. When this is the case, we can always represent the call as a returned message object, like we did above. This has several advantages:

The message is a true output of your function, and can be tested more easily.

The code need not know about the “next thing”; application wiring can be handled separately.

The functionality is now far more reusable and recombinable; anything can consume the message and continue the flow. In the mocked version, only a single specific type is allowed to consume the message, and how to propagate it is hard-wired.

It seems far more often the case, though, that message-sending is not the intent, and the tests would be better served by simply looking at the broader input and output.

Controlling side effects is the real battle

We have seen that mocks and stubs have to compensate for serious flaws before they can hope to provide value:

They are linked to fragile implementation details, and will constantly break under routine refactoring.

What they test is almost always beside the point of what’s actually required.

The real battle for clean tests, and code for that matter, is about controlling side-effects. Mocks and stubs attempt to provide some level of testing in the face of entangled effects in output or input, respectively. However, in practice, almost every usage can be obviated by simply writing better code; the entanglement is the problem, and mocks only allow the developer to ignore it. Most common usages fall under the categories described here:

Huge dependencies that are too hard to create, which can be broken down or pushed further out.

Fragile mutable domain modelling that can be made simpler and more robust, by replacing with equivalent immutable values and functions.

Essential side effects or I/O coupled with interesting pure behaviour, that can be separated and pushed out.

10 thoughts on “To Kill a Mockingtest”

Hi Ken. It’s a great post and I agree with almost all of it. *almost*. 🙂

The bit I’m concerned with is where you say: *It’s often easier to check that last inch of inter-system I/O manually.*

Anything that has to be checked manually is bound to break. I had a recent instance of this where a thread pool wasn’t being shutdown properly, leaving several JVM processes hanging. Shutting down the thread pool is a necessary side effect for the correct execution of the program. And so is sending an email. Perhaps you can do this check using an integration test instead, or using a mock, either way I believe it needs to happen.

Additionally it will prevent regressions which are more likely to happen when new team members join and there is little to no handover.

Thanks Leo! I think if you’ve isolated the side-effect to the point
where it’s just a dumb trigger-pull, the mock doesn’t really test
anything more than the intent to send an email. If you’ve already
pulled the intent out as some kind of decision ADT, then you’ve already
got that intent covered. What does the mock add?

To test that an
email actually got sent, you could rig up mail servers in your test
environment, but you’re getting towards the high-cost low-reward part of
the testing pyramid. YMMV 🙂

Yes. And I’ve done that – pinging a local email server to make sure an email has been sent. It’s only low reward if there are no major impacts to the business or its users. In the particular system I’m thinking of, not sending an email meant direct money loss. Not a great outcome. As you say, YMMV 🙂

There is some good stuff here on how mutable state introduces complexity and how eliminating it makes for code that is simpler and more easily maintained and tested. I need to do more of this.

Your main argument, however, seems to be that this ALSO eliminates the need for isolation in testing. Personally, I don’t understand that position.

Even a pure function with no side effects has behaviour; some sort of processing that it applies to its inputs to arrive at its outputs. If we string together a number of these functions we end up with another function that has more complex behaviour. I am somewhat mystified by your assertion that it is no more complex. If the end result is not more complex, then why bother breaking it down into smaller functions in the first place?

I submit that for the very same reason that we would break a complex behavior down into a number of simpler behaviours strung together, we probably want to be able to test those simpler behaviors independently.

An important reason is those behaviours are subject to change as we change the code in response to new requirements.

In your Credit Card example, what happens if a new requirement arises for a certain % fee to be added to all credit cards transactions. Maybe we change the implementation of CreditCard, or maybe we insert a new function in there somewhere. In any case, this is none of Customer’s concern as it has no relevance to its decision to charge the card or not.

Currently though, we are going to break your CustomerTest. That test is ostensibly testing the ‘do I charge or not’ behaviour in Customer, but it is *also* implicity relying on a particular implementation of the ‘subtract from balance’ behaviour in CreditCard.

Perhaps that is ok for this example as the end result is still pretty simple; Perhaps it makes sense to consider (Customer + CreditCard) as a ‘unit’ for testing purposes.

I submit however, that for a system of any size there is a level of granularity at which we want to do ‘unit’ testing, and that we want to be able to make assertions about the behaviour of that unit independently of behaviours outside that unit.

For this we need fake implementations of those behaviours. Ideally ones that actually do not have any behaviour… ie stubs at least.

Obviously when you compose a function of other functions, the resulting mapping might be more intricate. I say it’s no more complex “in an important sense” because the result is still pure too, and equational reasoning still applies. Even observing an effectful collaborator instantly voids these guarantees, which is where forcible isolation with mocks becomes more desirable.

Why not isolate the component parts? Well they are already isolated; what’s left is either more pure logic that can be extracted, or wiring. I tend to think wiring is not very interesting to test in isolation; either the component can take full responsibility for its outputs, or the wiring can be reified and actually become the output, as per my email example. Otherwise if integration is the only point, then an integration test seems appropriate, no?

The credit card test does suffer from the problem you mention; thanks for pointing that out. The desired result should be that the credit card is modified in whatever-way-is-appropriate-for-the-credit-card, not simply deducting the literal fee.

So, should we stub it? Morally, the credit card is two things here; a data structure and an attached (CreditCard, Int) => CreditCard transformation that we want to use. In such a situation, there’s nothing wrong with providing a replacement function to test that it has been applied. That sounds very much like a stub; so what am I opposing?

When we test card.charge(555), no-one would call 555 a stub, even though it is not a number we use in prod code. It is simply a value of the type required; the code under test must be agnostic to production config concerns. More interesting compound immutable structures are still just values, and wouldn’t be called stubs either. So what if the required type is say, CreditCard => CreditCard? The distinction between values and stubs is starting to disappear, and I don’t object to it at all; a C=>C value is required, and to test, one is simply provided.

Where problems start to appear is where single-dispatch subtype-polymorphism is the only tool available to express these kinds of dependencies. Using an interface or class CreditCardModifier { def modify(c: CreditCard): CreditCard } immediately creates a tempting extension seam where unnecessary cruft can be added to the required C=>C. Many modules will find themselves unawares depending on more and more things that they don’t care about in the slightest.

Also, when subclasses are the only way to transport functions, we are forced to not simply provide a Thing, but create a new kind of Thing and supply that. The “web of lies” starts here; this considerably weakens the guarantees that the author may have hoped to provide around their type, as this is much harder to do in an open inheritance hierarchy. So a new source of error has appeared already. As a last resort, we can at least force ourselves to use subclasses that are morally just instances and not a new kind of thing.

Naturally, the byte-diddling proxies that all popular mock frameworks use completely abandon any pretense of maintaining guarantees, and are in no way necessary to test code well.

You can take your pick where you draw the line in that continuum and start calling things “stubs”; the nature of my objection is not that components can be agnostic pass-in behaviour, but that novel traps and dangers are introduced that obscure the test code and obscure obvious avenues of code improvement.

In terms of the “little universe” quote, I was responding to Jay specifically about unit tests. Nonetheless, in the context of a microservice, I believe one can have all the tests following this principle. Indeed, my design for running the CI build for Atlassian’s PaaS (all microservices) has the build agent creating a little universe with real instances of all the important downstream services to run tests (one of the many benefits of containerization tools like Docker).

I agree with a lot of what you have written, and your examples are well suited to support your assertions (pardon the pun). I still find good uses for stubs and mocks. I have also found tests using mocks may become less valuable over time.

I agree very much with Leo’s thoughts on when end-to-end automated testing becomes valuable. I have gone to extraordinary lengths to test a system’s interaction with major public APIs (which often have no sandbox environment against which to test), because of the high value impact on a failure of our systems to correctly interact with those 3rd party systems.

Thanks Josh! Sorry I didn’t attribute your quote; I lifted it from Jay’s book “Working Effectively With Unit Tests”, but couldn’t find the original source or context.

Microservices are interesting; we use https://github.com/realestate-com-au/pact and https://github.com/DiUS/pact-jvm to test our MS interactions in isolation. This is effectively mocking! The difference is, between systems we are inescapably in effect-land; we can’t bypass the problem. It’s appropriate here, and I would argue not within a codebase. There’s a bunch of different concerns that come into play when we get into e2e system land, for sure.

I think this could have used more references to the actual literature of “To Kill A Mockingbird.” The similarities in meaning are quite profound, as it is about the dubious, hidden truth and the opposites/role reversal “You never really understand a person until you consider things from his point of view… Until you climb inside of his skin and walk around in it.” ― “People generally see what they look for, and hear what they listen for.” ― “Sometimes the Bible in the hand of one man is worse than a whisky bottle in the hand of (another)… There are just some kind of men who – who’re so busy worrying about the next world they’ve never learned to live in this one, and you can look down the street and see the results.”