Even if you’re not a programmer, read on. There’s something in this for you, too, and you’ll probably want to share it with the programmers around you. I promise.

Dear programmers, your tests are probably dragging you down. Here we have another delightful irony of agile software development, this time related to test-driven development. I give you The Fundamental Irony of Test-Driven Development.

If you practise test-driven development, then depending on how you manage your short-term tasks, you run the risk of increasing your stress levels compared to not doing test-driven development at all.

Strange, isn’t it? Here’s how it works:

Practising TDD encourages you to break your work down into microtasks: in many cases, individual tests. This clarifies what you intend to do, by breaking it down into more, small tasks.

You’re used to keeping the salient details of your current work in your mind’s short-term memory, even when you work for a few hours at a time.

Now, you have even more details to try to keep in your head while you work. Congratulations!

It reminds me of some companies who adopt Scrum, switch from weekly status meetings to monthly sprint reviews, and the flow of information about the progress of the project actually goes down. Great, no?

This bears repeating.

Practising TDD while keeping everything in your head:

decreases focus

hurts productivity

wastes precious energy

increases stress

You didn’t see that coming, did you? Maybe you did.

It doesn’t happen right away, mind you. You probably found TDD a relief when you first started practising it. The constant sense of completion. The continuous positive feedback. The orderly progress towards working, solid code. It feels fantastic… but maybe you’ve noticed a few unexpected side effects.

Tasks drag on: often when you think you’re done you notice one more test that you’d forgot to write.

Thinking about all those error cases and side effects up front makes every feature feel like so much work.

In the planning meeting it seems like you need half a day, but once you get going it feels like it’s never going to end.

When you get near the end of a task, you can’t shake this feeling that you’ve forgot something important.

In your weakest moments, it might encourage you to long for the days when you didn’t think about your work quite so systematically and carefully. It’s like writing your second book: once you see how much work it takes to write one, many authors can’t stomach going through it again.

Kent Beck To the Rescue!

Naturally, Kent Beck had good advice for us way back in his seminal book Test-Driven Development: By Example. He encouraged us to start a programming session by writing a test list: literally a list of the tests that we think we’ll need to write. As we make tests pass, we cross them off the list; as we think of more tests to write, we add them to the list. When we’ve crossed the last item off the list, we’re done! Simple.

As you might expect, this advice can help anyone, and not just programmers. It just so happened that Kent aimed the advice at programmers writing tests. The more generic version of this advice can help everyone:

When you sit down to work, start by getting things out of your head!

A Few Basic Rules

If you’re a programmer, then the moment you hear about a programming problem, you probably start solving it in your head. That enthusiasm will drag you down if you don’t do something about all those ideas buzzing around your mind. Thinking in terms of tests helps organise your thoughts, but if you keep them all inside, they’ll crush you. Try this the next time you sit down to program:

Get yourself something to write with and something to write on.

Sit comfortably. (So many people overlook this one.)

Take all the ideas buzzing around your mind and write them down.

Draw some high-level design diagrams

List the tests that have leapt to mind

List any refactorings that you already know you might need to do

Once your mind feels quiet, pick a test and start test-driving.

As ideas pop into your head, quickly write them down.

Add some tests or refactorings to your list.

If unrelated work pops in your head (“I need to prepare for that damn meeting tomorrow morning…”), write that anywhere else: on another page, in a notebook, wherever.

As you make tests pass, cross them off the list.

As you perform the refactorings you had in mind, cross them off the list.

Repeat until you can’t think of anything more to do to complete your task.

OK: more than a few rules, but only because I want to leave less room for differences of understanding or interpretation.

A few things to keep in mind:

Just because you write the test or refactoring on your list, that doesn’t mean that you have to do it. If you don’t need it, then don’t do it.

When first getting things out of your head, don’t try to list all the tests that you might need. Focus on listing all the tests that immediately come to mind. First, unburden yourself, then you’ll find it easier to notice that you’ve missed something.

I write tests from the top of the card or page and refactorings from the bottom. I guess that’s just personal style.

When you finish your programming task, if you have other work left over, then put it wherever it needs to go in order for you to follow up: email, calendar, to do list, whatever system you use.1

Even Simpler…

Not only that, but I like the satisfaction of crossing things off and ripping up cards when I’m done with them. What can I say? Little things like that amuse me. Whatever works.

So don’t let your tests drag you down. Lighten the load by getting them out of your head. Try it now! I’ll wait.

References

Kent Beck, Test-Driven Development: By Example. When you reach a certain point in studying a field, you long you read the classics. If you’ve reached that point, then read this classic.

David Allen, Getting Things Done. When I first read Getting Things Done, with its emphasis on getting things out of your head, I immediately recalled Kent’s “Test List” trick. As much as Test Lists helped me with programming tasks, getting everything out of my head has helped with the rest of my work. You can’t imagine the difference until you try it. David’s trademark: Your mind is for having ideas, not holding them. Kinda cultish, but true.

J. B. Rainsberger, “Getting Started with Getting Things Done”. If you’re not eager to read 200 pages of productivity advice, then start with a handful. I give you enough detail to try the basics of the system for a few days and decide whether you want to learn more. How agile of me!

If you don’t have a system, then you’ll need one, but let’s solve one problem at a time.↩

When clients ask me to help them with legacy code it generally takes less than 30 minutes for me to run into a debilitating constructor—a term I use to describe a constructor that does too much. You might think me melodramatic for calling it “debilitating”, but these constructors not only slow us down when we try to understand the code, but block our every effort to improve it. These constructors often hardwire a dependency either to some horrifying external resource (a database or a web service end point) or into a framework (even something as “simple” as threads or a lifecycle manager). These constructors also often grow to dozens, and in some extreme cases hundreds, of lines. These constructors kill code bases and, by extension, suck the life out of programmers.

Do not underestimate the power of the debilitating constructor. If you need convincing, then take ten minutes now and click here to read a more in-depth description of the damage these constructors do.

Let Me Tell You What I’m Looking For in a Fucking Constructor…

If you don’t recognise this reference, then you don’t know the comedy of Ron White, then shame on you. Go find some now and listen to it. I’ll wait.

A constructor has one job: build objects. You can see that right in the name. I want constructors to build objects.

An Anti-Pattern, While I’m In the Neighborhood

When I see this, I have the typical alpha programmer impulse to belittle the programmer who wrote this test. I take a few deep breaths to let that feeling pass. True, in Java, the language specification guarantees that every constructor either return a non-null reference or throw an exception; however, in languages that let you do whatever you want with new (like C++), you can’t make that assumption. Moreover, after you’ve seen enough debilitating constructors, you assume that every constructor you touch wants to kill you. Rightly so. This makes assertNotNull(new X()) almost look reasonable. (It still isn’t reasonable in Java, so please just stop doing this in Java. You have bigger problems.)

Constructors Can’t Reveal Intent

I limit constructors to building objects (as opposed to connecting to databases) in part because constructors in many languages can’t reveal intent. In Java, C#, C++ and its cousins, you must name the constructor after the class to which it belongs. You may not change the name. Those languages expressly forbid constructors from expressing their intent to do anything more complicated. As a defence mechanism, I limit constructors to very specific behavior, so that I never have to guess what a constructor will do.

What Should Constructors Do, Then?

I expect to constructors to initialise new instances to a “rational” state.

When I encounter a constructor that doesn’t do this, I end up having to grope around in the dark looking for the magic sequence of method invocations that allows me to put the object in a suitable state so that I have even a modicum of confidence that I can safely invoke the method that I really need to invoke. Taking a cue from the Pragmatic Programmers, I too call this programming by coincidence—although sometimes I call it programming by accident.

Programmers routinely and drastically underestimate the time, effort, and energy wasted by dealing with constructors that leave new instances in some half-initialised, irrational state.

Initialising instances to a rational state means assigning values to enough fields to avoid spurious runtime errors like null pointers/references or illegal/unexpected states. If you’ve initialised an instance to a rational state, then I feel confident that I can invoke any method on the object without fear of it belching red smoke in my face. I find this expectation entirely reasonable, don’t you?

I also expect constructors to only initialise new instances to a rational state.

Erik Dietrich covered this case particularly well in his article, so I won’t repeat all that detail here. We can agree that a constructor always needs to initialise new instances, and we might even agree by now that a constructor always needs initialise instances completely. Everything else a constructor might do remains subject to change. As soon as you think that you’ll never want to change it, your employer will acquire another company, turn it into a new division that supplies your project with software, and you’ll need to change it. To change a constructor, you have to change code: you can’t subclass, you can’t plug in a different implementation of an interface, you can’t override a method, you can’t even invoke a different method. You have to change existing code. Remember that we’re talking about legacy code here: profitable code that you’re afraid to change. So don’t make design decisions that force you to change code that you’re afraid to change. Really.

Fine! What Do I Do Now?

You can start by introducing named constructors, which are class-level (static) functions that instantiate objects.

Please stop calling these things factories. This causes confusion. A Factory creates objects while hiding the exact class it has instantiated.1 This means, for example, that it chooses an implementation to instantiate, but returns the object through a reference to the interface type. If you know exactly which class it’s returning, then please call it a creation method or a named constructor.

You can introduce a named constructor quite easily. Many refactoring browsers do this automatically. (IntelliJ IDEA frustratingly calls this refactoring “Replace Constructor with Factory Method…”, while Eclipse also frustratingly calls it “Introduce Factory…”.) Even if you don’t use an automated refactoring, you’ll find it relatively easy.

It looks harmless, but give it a few months. Moreover, I didn’t want to scare you off. I follow these steps:

Copy all the code into a new class-level function on the same class. Temporarily make fields more visible if you need to.

Migrate clients to use the new named constructor.

Add each assigned field to the constructor as a parameter.

After Step 1

This new code duplicates the constructor, so don’t stop here. I don’t yet know what this named constructor actually does, so I’ve given it the mechanical name newGame. I don’t like this name, but I expect to rename it soon.

In order to write this code, I had to change createRockQuestion() to be class-level (static). Unexpectedly, only the Game constructor used this method, so I had no obstacles to making this method class-level.

After Step 2

Well, you can imagine. I don’t have to show you all the instances of new Game() turned into Game.newGame().

After Step 3

I did this in a few microsteps. My strategy involves moving new Game() down to the bottom of the named constructor. Most named constructors do all the interesting work, assign the results to the new instance through its conventional constructor.

Next, I build up the first list of questions, then assign it through the constructor. I can foresee having to do the same thing four times, so I use my usual strategy: do the first one, then do the second one, then do the rest.

By now I think I understand what this named constructor is doing. I can rename it from newGame (accurate but vague) to newConsoleReportingDummyGame, because, let’s face it, those questions look like placeholders just for testing.

Hey! The questions look like placeholders just for testing, but the GameReporter seems intended for production use. We should probably just split that behavior apart.

And now we can.

And that’s the point.

By the way… we should probably introduce a Parameter Object for the four sets of questions. That would leave a simple, clear constructor Game(GameReporter, QuestionDeck). The resulting constructor might even reveal its own intent well enough that we can get rid of our named constructor.

The Underlying Design Principle

Introducing a named constructor has highlighted a dependency problem: the constructor knew too much about where its data was coming from. This bound the Game object to a single context, and so for example, if we had to track down a bug related to asking the tenth question, then we would be forced to play a game that moved turn by turn to the tenth question in an arbitrary category. Click. Click. Click. This seems risky and invites failure.

The constructor violated the Dependency Inversion Principle, and introducing a named constructor moved details up the call stack towards the client, leaving behind parameters that we can easily change in order to more easily explore Game’s behavior. Now if we have a problem asking the tenth question, we can pass in a deck with a single question and figure out what happens.

This doesn’t solve every design problem in Game, but I can only take one step at a time.

A Quick Summary

I want constructors only to initialise new instances to a “rational” state, so I move more interesting code into named constructors.

Now that I have to give this construction behavior a name, cohesion and dependency problems become explicit and obvious.

Now that I have identified explicit, obvious design problems, I can decide how and when to improve the design.

As a pleasant side effect, testing the affected code becomes much easier. If you care about that kind of thing.

One More Thing

If you’d like one simple rule of thumb to follow:

Don’t let a constructor invoke an instance method on the instance that it’s in the process of creating.

If you found yourself wanting to do this, then you have two responsibilities in the same place. (Why? Think about it.)

References

Andrew Hunt and Dave Thomas, The Pragmatic Programmer. Item 31 in this classic book is Programming By Coincidence, in which the authors paint a beautiful picture of just how slowly and deliberately the programmer has to work when encountering code such as debilitating constructors.

Erich Gemma and others, Design Patterns. The first book I ever bought as hypertext, and a book truly designed for it. It explains what makes a factory a factory. Please don’t just look at the class diagrams. Please read the “Drawbacks and Alternatives” sections.

For years I’ve written about contract tests, and most notably have never had any clear examples of them to share with you. I’d like to change that today.

I’m sharing this with you as a rough cut, meaning that I haven’t yet taken the time to seriously edit this, nor add much explanation. I feel tired just now, so I want to stop, but at the same time, my loyal readers deserve to see a real example.

I’ve spent the last few days messing around with the Atom text editor, writing a package for it that calculates some basic text document statistics. I started with https://github.com/cjoh/status-stats, but noticed that this had fallen rather seriously out of date, and… well, I’ll spare you the story. I’m here now.

Get To The Point!

The text statistics library that @cjoh used doesn’ count words correctly. (I’m working in CoffeeScript.)

I decided to look for a library that counts words correctly. I found one called words.js. (I could tell you a whole story here, but I won’t.) It seems to count words correctly.

12345678910111213141516171819202122

require("words.js")
describe "The Words API", ->
describe "counting words", ->
# CONTRACT
# Words.count always answers a valid Number, even
# when it encounters invalid input.
# When the input is text, then .count appears to
# answer the correct count of the words.
# When the input is not text, then .count returns
# an arbitrary Number, often 0.
countWords = (text) ->
new Words(text).count
it "counts words, maybe", ->
expect(countWords("There are four words!")).toBe(4)
it "counts the empty string!!!111!1!1!", ->
expect(countWords("")).toBe(0)
it "gracefully handles null", ->
expect(countWords(null)).toBe(0)

I wrote more, but you get the idea. When I felt confident that words.js counts words correctly, I decided to jump to the key question: What contract does my package need with the thing that counts words? I decided on this:

countWords turns text into a number

countWords’ return value represents the number of words in its input text, as long as the input consists of text

if the input is not text, then countWords can return any number it likes, as long as it returns a number, and nothing else, and doesn’t throw an error

This contract implies two things:

The invoker can’t tell whether countWords has returned an accurate count of words, because countWords doesn’t signal invalid input; therefore

The invoker shouldn’t send countWords invalid input.

Fortunately as long as my package uses Atom correctly, this won’t cause a problem. More importantly, my package will never cause Atom to blow up. In the worst case, it shows strange results on the status bar.

My next step involves exploring and clarifying the contract for two more little microfeatures before putting the whole thing together and shipping it.

Shout Out

I would like to thank Kevin Sawicki for treating me so nicely as I made my first few microcontributions to Atom. Kevin, you have made me feel very welcome in your community.

References

J. B. Rainsberger, “Integrated Tests are a Scam”. An hour-long talk, so save it for when you have the time. It presents how contract tests fit into my practice as a programmer.

c2.com Wiki Community, “Abstract Test Cases” When we first discussed contract tests, we called them “Abstract Test Cases”, because we name things like programmers. I remember writing my example sometime in 2000. Of note: “This kind of test case ensures that concrete classes do not violate the contracts of their superclasses.”

After one of the mob programming sessions I did with RubySteps last year, a viewer singled out this as a valuable piece of advice:

One of the most important things that a programmer must learn how to do is to stop when they get to the end.

I remember the first time I encountered this issue: in Steve McConnell’s excellent book Rapid Development, which I remember reading in 1997. (Wow! I should read it again.) He referred to “gold-plating” as one of his Classic Development Mistakes. You might immediately think, Uh… stop it!, but until then I hadn’t realised just how pervasively programmers in particular tended to do this. They would continue working on something, even though they’d solved the underlying problem. This sounded utterly insane to me. Suddenly, I saw it everywhere. I noticed when others did it. (They loved that.) I noticed when I did it. I didn’t really understand why.

When I started practising test-first programming (and later test-driven development), this problem slowly disappeared—at least for a while. (Read on; you’ll see why.) For years, one of my IBM office mates challenged me to justify writing the tests first. He argued that as long as he wrote his tests and production code in tiny cycles, it didn’t much matter which he wrote first. For years, I never had a good answer for him. Of course, now that I do, I have no idea where to find him.

Write Until You Get To The End, Then Stop

Among the microbenefits of writing tests first: knowing when to stop. I approach test-driven development roughly like this:

Write a failing test that describes a very small bit of new behavior.

Run the tests to watch the new test fail.

Write just enough production code to make the tests pass.

Run the tests to watch all the tests pass.

Now that I have a solid pool of change detectors, try to improve the design, usually following the Simple Design Dynamo.

All these steps help me write better code, but I want to focus on just two of these steps.

First, we have the act of writing a failing list, which describes when to stop. You can’t know when to stop if you haven’t thought about when you should stop. I imagine that programmers gold-plate in part because they don’t really understand the goal behind their task. They might not even know the goal behind a microtask on the way to completing their larger task, even though they themselves have divided their task into microtasks. Amazing, isn’t it? They might have a vague idea of the steps, but writing tests encourages the programmer to articulate very clearly the goal of the next step. By doing this, one can’t help but know exactly what “there” looks like, so as to stop when arriving “there”.

In addition to knowing where to stop, we need to know when we’ve got “there”. For this purpose, the step “write just enough production code” matters most. When I write tests first, I find it much easier to judge when I’ve written just enough production code, because the new tests now passes and I haven’t broken anything that used to work. When I see the green bar, I know I’ve arrived “there”. Doing this brings more benefits than only that:

I know when to stop writing code right now related to the current task/feature/behavior.

I train myself in general to stop writing code once I’ve written enough.

I train myself to think about what it would mean to have written enough code, so that I get better at spotting it when it happens.

It might seem like overkill to you, but remember: programmers famously tend to gold-plate, and so any technique that helps a programmer ingrain the habit of stopping will probably help. Test-first programming fits the bill.

Imagine that you had on your computer a red flashing light that signaled that’s enough as you wrote code. When you saw it flash, you could stop, then go on to the next microtask. Wouldn’t that help you finish your work sooner? Wouldn’t that help you question whether you need to do what you have in mind next? Wouldn’t that help you avoid unnecessary work in the first place? It sounds great to me! Writing tests first gives me exactly that, so I do it even to this day, lest I fall back to gold-plating.

Danger: Gold-Plating the Design

The Universe, however, in its desire to respect the Law of Conservation of Irony, intervenes. Although I write the tests first in order to tell me when to stop writing code, the refactoring step gives me a new opportunity to flex my gold-plating muscles. I love refactoring because I never have to stop. I can always find something more to improve. I can go around the Simple Design Dynamo again and again and again… try to stop me!

Now you see why I pointed out more benefits than simply “I know when to stop writing code for the current microtask”. Knowing that the programmer mindset tends to lean towards gold-plating in general, I continue to write tests first in order to train myself to remain aware of my own tendency towards gold-plating. As a result, I do it less often, and in a variety of situations.

I learn to refactor just enough to keep new features flowing without turning modularity into a means unto itself. I don’t want to turn clean code into a fetish.1

I learn to build just enough features to keep new customers flowing, rather than trying to deliver the perfect product.

I learn to edit just enough in order to convey my ideas clearly, rather than trying to turn my work into literature.

I could probably find more, but that will do for now. What do you do in order to counter your own tendency towards gold-plating? It might have nothing to do with code. I’d love to learn your secrets either at tell.jbrains.ca or in the comments.

I had intended to write a nice article showing a concrete example of learning tests in action, then something wonderful happened: all the code disappeared.

The Situation

I love Typeform, especially because it integrates with Stripe for processing payments. Sadly, Typeform does not allow my customer to pay-what-e-thinks-it-was-worth1 unless I anchor em by presenting a handful of price options, similar to what you see here.

Not the way I want to let my customers choose the amount they wish to pay. How many buttons should I give them?

The Plan of Attack

After reviewing my options, I settled on letting a customer enter the amount e wants to pay in a Typeform form, which will send an email to a custom application that responds with a link to a Stripe checkout form for the amount e chose. It feels a little convoluted, but it will work. I know how to create the Stripe checkout and I know how to create the Typeform form, but I have not handled incoming email with Ruby and Sinatra and Heroku before, so I need to learn that.

The Architecture Diagram

An Hour of Pointless Coding Later…

After writing code (and documenting what I did so that I could publish it for you to read), I found out that I didn’t need to do any of it. Instead, I could use existing tools to learn what I needed to know. Specifically, I needed to know the format of what Mailgun will forward to my Sinatra application’s POST handler.

With this, I have documentation of one half of the contract between Mailgun and my Sinatra application, namely the contents of the POST request representing the incoming email, but what about the contract of the response? I found this in the documentation.

For Route POSTs, Mailgun listens for the following codes from your server and reacts accordingly:

If Mailgun receives a 200 (Success) code it will determine the webhook POST is successful and not retry.

If Mailgun receives a 406 (Not Acceptable) code, Mailgun will determine the POST is rejected and not retry.

For any other code, Mailgun will retry POSTing according to the schedule below for Webhooks other than the delivery notification.

If your application is unable to process the webhook request but you do not return a 406 error code, Mailgun will retry (other than for delivery notification) during 8 hours at the following intervals before stop trying: 10 minutes, 10 minutes, 15 minutes, 30 minutes, 1 hour, 2 hour and 4 hours.

Excellent. I have enough information about the contract between Mailgun and a POST handler that I can design my Sinatra application. I can even test-drive it! (Don’t mind the diagram. I promise that I’ll return the appropriate HTTP status code when things go wrong.)

I loves me some Plain Ruby!

Uh… Where Are the Learning Tests?

I ran them manually. I suppose I could have automated them, but I don’t see the value. If I notice behavior that seems not to conform to the contract I’ve discovered here today, then I’ll fire up RSpec; for now, this will do.

I upgraded to Mac OS 10.10 Yosemite and something strange happened with my installations of IntelliJ IDEA. They just disappeared. I still don’t know what happened to them. When I tried to reinstall IDEA 13 Community Edition, it crashed on launch.

Evidently, IDEA normally gracefully handles not finding a JRE with which to launch itself; but it didn’t for me, and I still don’t know why. After fixing the situation, IDEA now indeed handles a missing JRE gracefully. My Java SE 6 installation must have found itself in an irrational state.

Fortunately, my Twitter community came to my rescue. I’m sharing my experience here, just to make it easier for those who upgrade after me. I found two solutions.

A Risky Solution

When I reinstalled Java SE 6 and made it my system Java virtual machine, IDEA 13 (and IEA 14 beta) launched successfully. I followed these steps.

This, of course, provides a system-wide solution. It works, but it perpetuates an annoying dependency between IDEA and the state of my operating system. As with any reliance on global variables, it carries certain risk. It made me happy, then, that my Twitter community pointed me towards a more localised solution.

A Less Risky Solution

Thanks to this tweet, I have managed to solve the problem in a less risky fashion.

References

How do I gain confidence in code that generates HTML, such as tag libraries or view templates?

Well, it depends on what I’m trying to do.

Am I learning how an existing tag library works? If so, then I create a bare project, install the tag library, use it to generate some HTML, then use something like HTMLUnit1 (any HTML parser will do) to check the results. This way, I can explore all the features that the tag library has without mixing those tests up with the tests for my project’s core behavior. I can use what I learn from these Learning Tests2, meaning the contract of the tag library features that matter to me, to write tests for my core behavior that make safe—well, safer—assumptions about what the tag libraries do.

Am I creating my own tag library? I typically create custom tags by extracting duplication from HTML, so whatever tests I already have for HTML indirectly test my custom tags. Once I extract enough behavior into a little group of custom tags, then I begin to feel like I have a proper, reusable library3, and then I treat it exactly like I do any existing tag library, so this reduces to my answer above.

Am I testing other view code that generates HTML, meaning not a tag library? In this case, I make sure to separate that code from the rest of the system. In particular, I don’t want to have click-click-click in order to get to the right page so that I can check the resulting HTML. If I have to click-click-click, then I’ve clearly violated the Dependency Inversion Principle, since the view depends on its invoker, the controller.

Please note that automating this click-click-click with something like Selenium doesn’t make this problem go away; it merely makes it easier to tolerate the problem… for a while.

This means finding a way to render my HTML template directly without invoking the rest of the application. How to do this varies from framework to framework, or from library to library. It’s one of the reasons that, way back in the 2000s, I preferred using an HTML template engine like Apache Velocity over using JSP. I never did figure out how to reliably render a JSP without involving the rest of the web container and its nonsense. Are there any standalone JSP engines now? I don’t know.

I know that RSpec does this well for Rails. I can simply render a view template with whatever data I desire, and I never have to invoke a controller nor run the rest of the system. Now how RSpec-Rails does this amounts to killing kittens, but that’s mostly because Rails likes coupling everything to everything else and expects you to like it, too. I try to ignore the mewling of dying kittens as I run my view specs.

The Two Key Points

To check the HTML that X generates, run X without running the things that invoke X. (Dependency Inversion Principle.) This is true for X = JSP processor; X = HTML template engine; X = whatever. Write a test like this:

As you do this, you describe the contract of the view. You can use this information to check that the controller puts the right data in the right view template variables without having to run the controller and view together.

For example, if you know that your view expects the scripting variable customers with a collection of Customer objects, then your controller tests can check that it puts a valid (non-null) collection of Customer objects wherever the view rendering engine will look for the scripting variable customers. In the Spring WebMVC world—and I realise I’m old—this meant the customers key in the model Map inside the ModelAndView object that the controller returns from its handleRequest() implementation.

Don’t test a tag library by testing your application. If you want to test the tag library, then test it in isolation. This also applies to learning about the tag library by writing Learning Tests for it.

When you want to use a tag library, you think about which features of it you want to use and how you expect those features to behave. You can probably explore those more thoroughly by not limiting yourself to the exact context in which you plan to use that tag library feature right now. You’ll probably learn more than simply trying to get the current thing working that you want to get working. This helps you better understand which part of the tag library’s contract your application will depend on. You will find this useful, I promise.

Notice that, in what I’ve just written here, you can substitute “tag library” for other generic services like “database driver”.

References

J. B. Rainsberger and Scott Stirling, JUnit Recipes. In particular, chapter 12, “Testing Web Components” covers a lot of this ground. Even if you don’t use JUnit, the principles apply.

I know that I’m showing my age here, but I was there when HTMLUnit was born, so I like to mention it every now and then.↩

Tests that I write to document how a library behaves. When they pass, then I understand what the library does; when they fail, I don’t. Michael Feathers also refers to characterisation tests, which characterise what the code does, rather than specify what we want the code to do.↩

Almost everyone starts organising their tests according to the module (or class or object) that they’re testing. If they have a class called Customer, then they have a test case class called CustomerTest and put all the tests for Customer into this one bundle (module, class, describe block, whatever you call it).

Don’t stop here.

If you continue to add all your Customer tests to CustomerTest, then you’ll certainly judge it as “too big” after a while. Even if you don’t, you’ll notice some patterns inthe names of the tests themselves.

If you work with libraries like Spock or RSpec that let you name tests with arbitrary text, then you might not notice these patterns as easily, or the duplication might seem “more natural” in natural language. Don’t let that fool you into thinking that you haven’t duplicated concepts in your code!

You’ve almost certainly noticed a pattern in the names of some of your tests.

testAdd_EmptyList

testAdd_NonemptyList

testAdd_ListAtCapacity

testAdd_DuplicateItem

testContains_ItemFound

testContains_ItemNotFound

testContains_DuplicateItem

testSize_EmptyList

testSize_NonEmptyList

testIsEmpty_EmptyList

testIsEmpty_NonEmptyList

testIndexOf_ItemFound

testIndexOf_ItemNotFound

testTrimToCapacity_TrimmingNeeded

testTrimToCapacity_TrimmingNotNeeded

I don’t endorse this pattern for naming tests in general, but this reflects common practice. In a real-life version of this example, I’d be writing Learning Tests to help me understand how ArrayList works in Java. In such a situation I often write tests oriented around each method and the various special cases, because I’m trying to document the API as designed. When designing new behavior in new modules or classes, I prefer not to name my tests for any particular method, function, or even class, so as not to couple—even in my own mind—the tests unnecessarily to an initial implementation.

I can imagine finding this set of tests—and more—in a test called ArrayListTest.1 You can already see two things:

There are a lot of tests here.

There is a fair amount of duplication here.

You can also see that we can’t remove that duplication with just a single refactoring: the various tests fall into different groups, and so need us to organise them slightly differently.

Remove Duplication

I don’t seem to have any problem understanding the names of these tests—I wrote them, so I guess that shouldn’t surprise me—which means that I will turn my attention to removing duplication.

The Simple Design DynamoTM

In this case, duplication in the names of the tests will suggest different ways of reorganising the tests than would a simple refactoring of the duplicate code. Even though I haven’t written these tests out in code, I’ve seen them a number of times. Especially when a programmer writes all these tests in one test case class, e2 typically arrives at only one line of setup code shared by all the tests:

12

@Before
public void setUp() { theList = new ArrayList<String>(); }

Removing this duplication helps a little, but we can do much better. For example, looking at the tests for the “non-empty list” cases, I imagine I’ll find copied-and-pasted lists of “pre-populated” items.

12345678

@Test
public void testSize_NonEmptyList() {
theList.add("jbrains is awesome");
theList.add("jbrains is awesomer");
theList.add("jbrains is even awesomer than that");
Assert.assertEquals(3, theList.size());
}

@Test
public void testIsEmpty_NonEmptyList() {
theList.add("jbrains is awesome");
theList.add("jbrains is awesomer");
theList.add("jbrains is even awesomer than that");
Assert.assertFalse(theList.isEmpty());
}

When I look at the “isEmpty(), non-empty case” test, I get the idea that although I might want to check the “3” case for size(), I might prefer to check the boundary case for isEmpty(), meaning a list of one single item. Quite often, however, I see programmers merrily copy and paste lists of items to new tests because, well, we find that easier.

Now that I say this, perhaps I should add a test for testIsEmpty_BarelyNonEmptyList in order to distinguish the cases. I’ll add that to the to-do list I have by my computer.4

Group Tests by Fixture

Long ago, in a book far, far away, I wrote about grouping tests by fixture. I recommended that you “test behavior, not methods” (section 1.3.2) and “move special cases to their own fixture” (recipe 3.7). I gave some examples. It was fine. I encouraged the reader to remove duplication in the setup (now called @Before) code. More than anything else, however, don’t let tests twiddle the fixture. If a handful of tests want to share a fixture, then I prefer that they all share the very same fixture, meaning the same objects in the same state. This becomes especially important when you start trying to reuse fixture objects using inheritance. (I used to do this; I tend not to do it any more. The cure always eventually hurts more than the disease.)

Junk Drawer

You probably have a junk drawer in your house. You throw junk into it. Some of that junk you need, so you root around in it to find something specific, like a pen or a paperclip. Eventually, you find that you need a paperclip unusually often—usually to press a recessed reset button on some electronic thing—and so you decide to put the paperclip somewhere to make it easier to find. If you put it in its own little compartment, then you’ll find it, but if you then start putting some other, not-so-related items in with the paperclip, then before long you find yourself with a second junk drawer.5 Then a third. Then you just have junk everywhere. It doesn’t work.

So it goes when you try to organise fixture objects into a setup function. This works great until the first time a test wants to change the fixture just a little for its own purposes. For the first test, you don’t worry so much: you put it in the same test class, twiddle the fixture—what harm can one extra line of setup do?—then go along your merry way. The very next special case wants to twiddle the fixture in precisely the same way. Then a third. Now is the time to move these three tests into their own test class with their own fixture, as I recommended in JUnit Recipes. If you don’t do this now, then before you know it, there’s graffiti everywhere. Almost every test twiddles the fixture in some unexpected way. You find some of that fixture up in superclasses, and you become lost in a maze of super() calls that you need to make at just the right time, otherwise your tests vomit NullPointerExceptions all over the place.

Ewww. You should have moved those tests to their own fixture when you had the chance.

Organise By Fixture or By Action?

When you find a group of tests inside a larger test class, you can either extract those tests by fixture or by action.6 I used to think that choosing between these options amounted to black magic, “skill”, or wisdom. Now I think I have a rule suitable for an advanced beginner (on the Dreyfus model) to use.

If you name your tests using a convention like test<action>_<special case>—for example, testIsEmpty_NonEmptyList—then examine the test names for patterns. First look for multiple groupings of the same set of special case words, then group those tests into a test class by fixture. Then look for multiple grounds of the same set of action words, then group those tests into a test class by action.

I think this works because the special case names generally correspond to similar fixtures. If you have a bunch of tests that need to operate on a “non-empty list”, then you’ll probably copy and paste the same three items into each list object in those tests. (I don’t claim to call this a good thing, but we do it.) Moreover, if you try to organise the special case groupings by action instead, then you’ll move those tests away from each other into separate test classes, even though they have similar setup code. This creates a cohesion problem7 solved by reorganising those tests by similar fixture.

Group Tests First By Special Cases, Then By Actions

Returning to our tests for ArrayList, we have

testAdd_EmptyList

testAdd_NonemptyList

testAdd_ListAtCapacity

testAdd_DuplicateItem

testContains_ItemFound

testContains_ItemNotFound

testContains_DuplicateItem

testSize_EmptyList

testSize_NonEmptyList

testIsEmpty_EmptyList

testIsEmpty_NonEmptyList

testIndexOf_ItemFound

testIndexOf_ItemNotFound

testTrimToCapacity_TrimmingNeeded

testTrimToCapacity_TrimmingNotNeeded

Following my proposed rule, I would end up first with these tests grouped by fixture:

EmptyListTest

testAdd

testSize

testIsEmpty

NonEmptyListTest

testAdd

testSize

testIsEmpty

BarelyNonEmptyListTest

testIsEmpty

MatchingItemsTest

testContains

testIndexOf

NotMatchingItemsTest

testContains

testIndexOf

DuplicateMatchingItemsTest

testContains

testIndexOf

Also these tests grouped by function—the junk drawers:

AddItemToListTest

testListAtCapacity

testListNotYetAtCapacity

testItemAlreadyInList

TrimArrayListToCapacityTest

testNeedsTrimming

testDoesNotNeedTrimming

Of course, this doesn’t constitute an exhaustive test for ArrayList, but you get the idea. You’ll notice that I’ve renamed some of the tests and added a few. By reorganising the tests this way, a few ideas popped into my head, such as “adding an item when the list is not yet at capacity”. When I first wrote this list of tests, I thought of “not yet at capacity” as an unstated default assumption. Since Java creates an ArrayList with a capacity of 10 items by default, I could think of testAdd_EmptyList as implicitly checking the “not yet at capacity” case. This kind of implicit checking can lead to “holes in our tests”, which can lead to the dreaded “green bar, but there’s a bug” problem that brings us back to my old favorite: integrated tests are a scam. I don’t want to go there just now.

Instead, let me close by proposing that you try grouping tests first by repeated special cases (which correspond to similar fixtures), then by actions. I think you’ll like the results.

“If I Group Tests Like This…

…then I won’t be able to find anything!” Srsly, this is 2014. Don’t you use ag or ack or grep or something? Can’t you search your project for uses of the function add(), or at worst, the regular expression /\.add(/?!

References

J. B. Rainsberger, JUnit Recipes: Practical Methods for Programmer Testing. I wrote ten years ago about the benefits of organising tests by fixture, rather than by function. I never felt truly comfortable with how easily the reader could apply that advice. This article attempts to assuage my guilt at giving such questionable advice.

J. B. Rainsberger, “Integrated Tests are a Scam”. The talk as I performed it at DevConFu in Jurmala, Latvia in December 2013. Don’t watch the Agile 2009 conference version any more.

J. B. Rainsberger and friends, Understanding Coupling and Cohesion. 57 minutes of video. I invited some of my friends to discuss the nebulous concepts of coupling and cohesion in software design. How do we think about these topics? How do we understand the terms? How do we use that in our work as programmers? How do we teach it to others? How much does any of it even matter? Our invited guests: Corey Haines, Curtis Cooley, Dale Emery, J. B. Rainsberger, Jim Weirich, Kent Beck, Nat Pryce, Ron Jeffries.

Yes, I’m assuming Java here. Don’t let that fool you: I see exactly the same patterns in Ruby/PHP/Python as I do in Java/C#/C++/C.↩

I intend eventually to replace this sentence with a link to an article that discusses in more depth how to avoid feeling distracted while programming. If you can read this after December 2014, then tell me to write this article now.↩

This phenomenon relates to the Broken Windows Theory in which once we decide not to repair the first broken window in a neighborhood, vandalism and further damage follows soon thereafter.↩

Do you remember the “Three A’s” of arrange, act, and assert? By action I mean the function that you intend to test with that test.↩

Although we don’t generally agree on how to define cohesion, I find it useful to move similar things closer together and keep dissimilar things farther apart. This leads me towards higher (better) cohesion.↩

I think that both parts matter. You probably accepted the “afraid to change” part without any need for convincing. (If not, then this article probably won’t interest you.) Moreover, if the code doesn’t generate significant value, then I don’t see much risk in changing it. If the cost of “getting it wrong” doesn’t significantly outweigh the profit we derive from “getting it right”, then who cares? Probably not I.

I treat valuable code with considerable respect. It provides food for families. I treat difficult-to-change code also with consider respect, although this comes more from fear than admiration. If we put these two things together, then, quite simply, one false move and I might destroy an order of magnitude more profit than the yearly cost to keep me around.

This brings me to Rule Number Zero of Surviving Legacy Code:

Maximise safety.

We find ourselves in the typical chicken-and-egg problem: we want to write tests in order to refactor more safely, but then we remember that integrated tests are a scam℠1 and decide that we’d rather break things apart a little in order to write less-costly-to-maintain tests. So which do we do first?

In a situation like this, I like to go back to my guiding principles.

Integrated tests are a scam℠ in part because they don’t put enough positive pressure on my designs and thereby don’t give me enough useful design feedback. Right now, I don’t care about this. I already know that the design needs significant work. I also know that I can’t handle the torrent of feedback that microtests would give me about the design.23 If I want to use this principle to guide my behavior, then I need to find another justification.

Integrated tests remain a scam℠ in part because of the combinatoric explosion in the number of tests I need to achieve a strong level of coverage, which in this case correlates to confidence. I might have to write millions of tests to achieve high coverage. I probably only have time to write hundreds of tests, in which case I have to gamble about the level of coverage. Perchance, could I not care about coverage in this situation?

Test coverage—however one measures or defines it—links directly to safety in changing code. I want to use those tests as change detectors. I want the red light that flashes the moment I make a mistake. Microtests, especially if I write them first, give me that. They help me find mistakes immediately. They help drive down the cost of making a mistake, an essential technique for managing risk.4 If I can’t write microtests cost-effectively, then what can I do?

What if, instead of a red light that flashes the moment I make (almost) any mistake, I had a pink light that flashes when I make a really obvious mistake? I can’t have what I want, but I can afford this; will it do? It will help more than doing nothing. I will simply buy as much of this confidence as I can afford. To do this, I combine two simple ideas: Golden Master and sampling.

Golden Master

I use Golden Master to help me detect changes in the behavior of a system when I can’t justify writing the typical kind of assertion that you’ve grown used to seeing in tests. I use this trick, for example, when I find it difficult to articulate the expected result of a test. Imagine a function whose output consists of an image. It happens quite often that a binary comparison between actual and expected result yields a hyperactive assertion—one which frequently fails even when a human would judge that the test had passed. I suppose some people know tricks to make it easier to articulate “looks similar enough” for images, but I don’t know how to do that, and that leaves me to choose either a hyperactive bit-by-bit comparison or ongoing, manual inspection. Rather than revert to the Guru Checks Output antipattern5, however, I take a snapshot of the last-known acceptable output—I call that the golden master—and save it for future use. When I run the test again, I compare the output to the golden master, and if they match, then the test passes; if they don’t match, then the test fails. This doesn’t make the code wrong, but it means that I need to check the result and decide whether the code needs fixing or the golden master needs replacing.

You can use Golden Master wherever you already have some output to check, even if you find the form of that output particularly challenging. With this technique, you simply diff the output and inspect the situation only when you find differences between the current test run and the golden master. If your system already sends text to an output stream that you can capture, then you have the tools to use this technique.

I warn you, however, not to give in to the temptation to start scraping your output for specific information to check. Unless you have no other alternative, you will probably find it more cost-effective to carefully extract that information from the code and check it directly using good, old-fashioned assertEquals(). Don’t build a parser for an arbitrary, unplanned, probably context-sensitive grammar. That way lies madness. (Of course, if a context-free grammar happens to describe the format, then go for it. You’ve always wanted to learn lexx and yacc, haven’t you?)

Sampling

I find one glaring problem with the Golden Master technique: if the output describes a long-running algorithm, process, or path through the system, then the golden master itself might describe only one of a thousand, million, or even billion potentially-interesting possible outputs. Welcome back to the combinatoric explosion problem that makes integrated tests such a scam℠. How do we proceed when we can’t possibly check the variety of paths through the system that we need to check?

Ideally, we refactor! I know that if I can break my system into many smaller, composable pieces, then I turn products into sums: instead of checking combinations of paths through multiple parts of the system at once, I can check the handful of pairwise connexions between parts of the system in relative isolation. I could turn millions of tests into hundreds. Unfortunately, in our current situation, I don’t feel comfortable refactoring, so that means that I have to sample the inputs and hope for the best.

You can find more sophisticated sampling systems out there among blogs written by experienced testers, but they all amount to sampling: if I can’t try every combination of inputs, then I try some combinations of some of the inputs and aim for the best coverage that I can.

This shouldn’t surprise you. You’ve done this before. You’ve written a function that operates on an integer, and you knew enough about the algorithm to identify boundary cases at, for example, -1, 0, and 1, as well as around 100 and 1000, so you check on the order of ten inputs and feel satisfied that the algorithm will work for the remaining few billion inputs. You were sampling.

In the case of legacy code, however, sometimes we can’t sample quite so intentionally. Sometimes even when we limit our scope to characteristic inputs, we have so many combinations of those inputs that we still can’t afford to write and run all those tests. In some cases, we don’t even know how to identify the characteristic inputs. In other cases, the algorithm itself has a random element, defeating our goal of writing nice, deterministic, repeatable tests. Random sampling to the rescue.

If you can use the random number generator to generate a stream of inputs to your system, then you can use this generate a collection of output files, and that collection can act as your golden master. You only need to control the random number generator by seeding it with the same stream of seeds every time. I use a simple linear generating function like m + p * i where m and p represent arbitrarily-chosen numbers and i represents a loop index. Now I simply have to decide how big a sample to take. Generally speaking, a larger sample gives me more confidence in the sensitivity of the pink flashing light that signals danger.

I adjust the size of the sample depending on how long it takes to execute a test run, and how much significantly that affects my flow while programming. I also adjust the size of the sample to match my fear level: the more worried I feel about getting something wrong, the larger sample I take while working, and I accept the cost of slowing down. I’d usually rather go a little too slow than a little too fast, because I know that the cost of making a mistake would likely dominate the savings from going more quickly.

The Techniques in Action

You can see an example of this technique in action by reading this code. If you’d like to see how I added this behavior to some legacy code, then start at this commit and follow the process step by step.

Although these techniques do not, on their own, guarantee success, when I combine Golden Master and Sampling, I can usually find a way to proceed safely. When I combine these with microcommitting6, I can proceed at an even quicker pace. They help me avoid the Catch-22 problem that arises from needing to refactor dangerously unsafely in order to be able to refactor safely and sensibly. Where might you use Golden Master and Sampling to help get your arms (at least a little) around your legacy code?

texttest.org. A library to help you write text-based tests, such as I would use to provide golden masters. Do not download this tool until you have written your own golden master at least once. That is an order. After that, use TextTest, because it really helps.

When diving into legacy code, I find it more important than ever to keep stuff out of my head. During the two hours it takes to safely refactor some large function, I’m probably going to spot 14 potentially-useful refactorings. I can’t chase every bunny, no matter how cute they are. I need to write those ideas down, get them out of my head, and get back to the tricky surgery at hand.↩

I see little point in spending energy generating a backlog knowing full well that I will never get around to doing about 80% of it. Who would volunteer to do that? (Ask your project manager if value-driven product development is right for em.)↩

I claim that “the agile approach” to risk management complements the typical approach to risk management of limiting the probability of failure in order to limit exposure. “The agile way”, if you will permit me to use this shorthand, involves limiting the cost of failure instead. Eventually I will replace this sentence with a link to an article that goes into this topic in more detail.↩

Marcia, the guru, looks at the output, pauses for a moment, then says, “Yep. That’s it.” If you want to re-run the test, then you need Marcia. That doesn’t seem to scale particularly well.↩

Really frequent committing, like after changing a single line of code. No, really. Eventually I will replace this sentence with a reference to an article that explores this topic in more detail.↩

Some time ago a client asked me some questions about spies and mocks. I wanted to share what we discussed with you.

So here’s the issue my mind has been toiling over…

The project I’m on is using Jasmine for BDD. Technically though, I think most people aren’t actually executing real TDD/BDD. As in, they’re not letting the tests guide their design, but instead are sticking on unit tests at the end, after writing most of the code… this is what their tests suggest, at least.

I see, in their tests, a lot of spies and mocks. This tends to worry me,… especially the spies.

I see a lot of it as unnecessary, and even damaging. They appear to be reducing the module that they’re testing to nothing more than a series of spies and mocks. The thing they’re testing seems to bear little resemblance to the real run-time module.

From my perspective, mocking is very good and even essential in the cases of module dependencies that:

Would add too many extraneous variables to the testing environment

Add lag to the tests

Are not semantically tied to the thing we’re testing

Examples I like are database mocks, ajax mocks etc.

But spies…. I’m very unsure of the value of spies.

The tests I’m reading are creating a series of spies… in fact, every method of the module is spied.. even private methods. The tests will call some public method (fir example initiatePriceFeed()), and then assert success by ensuring that certain spied methods have been called. This just seems to be testing the implementation… not the actual exposed behavior, which is what I thought BDD/TDD was all about.

So finally, I have a few questions:

What is the best way to decide whether a spy is necessary?

Is it ever acceptable to test the implementation, instead of exposed behavior? (for example spying on private methods)

How do you decide what to mock and what not to?

I am sorry for the length of this email. There seem to be so many things I’d like to say and ask about TDD.

Note! In the Javascript world, it’s common to talk about “spies” rather than “stubs”. A spy and a stub do the same thing. They only differ in intent. In what follows, you can treat “spy” and “stub” as synonyms with, I think, no risk of confusion.

That sounds common. I started doing test-first programming, rather than test-driven development. I probably spent two years focusing on tests as tests before I felt comfortable letting my tests guide my design.

I think the people writing all these spies and mocks do this because it “seems right”. People they respect do it. They need to spend some time practising the technique, so they do it at every opportunity. This corresponds to the Novice/Advanced Beginner stages of the Dreyfus Model: either they just want to practise the technique (Novice), or they feel comfortable using spies/expectations1, and treat every opportunity as an equally (Advanced Beginner) appropriate time to use them. Good news: this is a natural part of learning.

Where to go next? Find one example where a module would benefit from depending on data, rather than another module. I go back to the difference between Virtual Clock (spy on the clock so that you can make it return hardcoded times) and Instantaneous Request (pass timestamps directly, rather than the clock, pushing the clock up one level in the call stack). Perhaps this will help people start to question where they could change their approach.

IMPORTANT! Instantaneous Request isn’t necessarily always better than Virtual Clock. Which you choose is less important than the discussions and thoughts that lead you to the choice. Also: starting to use Instantaneous Request over Virtual Clock means that the programmer is evolving, not the code. What matters is not “use fewer spies”, but rather “don’t let spies become a Golden Hammer”. Spies still help, I use them frequently, and I wouldn’t give them up.

Regarding the value of spies, I don’t consider spies and expectations much different from one another. A spy is merely an expectation that doesn’t verify which methods were called—instead it waits for you to do that. In some tests, it’s not important to verify what happened, but rather to provide a hardcoded answer for any method our Subject uses. One rule of thumb: spies for queries, but expectations for actions. This works because we tend to want more flexibility in our queries, but more precision in the actions we invoke. Think of the difference between findAllOverdueBalances() and findAllBalances().selectBy("overdue")—it doesn’t matter how I find all the overdue balances. Spies simply make it easier to hardcode 0, 1, a few, or a large number of overdue balances, as each test needs.

So: spies for queries, but expectations for actions.

Spy, then Spy, then Spy…

I understand your concern about series of spies, but let me check that I understand what you mean. When you say a series of spies, do you mean spying on A.getB() to return a spy B, whose B.getC() returns a spy C so that you can spy on C.theMethodIFindReallyInteresting()?

As for ensuring that spied methods have been called, those “spies” become expectations, and it can feel like those tests only check the implementation. That’s OK. If the implementation is so simple that we can check it with a simple test, then that’s good! It’s like double-entry book-keeping in accounting. If the tests are complicated and only check implementation, then that usually points to a missing abstraction, or at least, obsession with unnecessary details (could be a missing abstraction or could just be an unnecessarily complicated API). This last point is an example of not listening to what the tests are trying to tell you.

Programmers generally have this feeling eventually that expectations mean “I’m just checking the implementation”. I had the same feeling once, so I asked myself, “assuming that this actually makes sense, what am I missing?” Well, if the interactions between objects were simpler, then this “checking the implementation” issue wouldn’t cause any real problems, would it? In fact, it would only clarify what we’re trying to do. Maybe, then, when checking the implementation feels weird, we could ask about potential underlying design problems, and if those problems disappeared, then we’d feel less weird. This is one of those cases.

Go to a few tests where you feel weird in this particular way, and look for duplication between the examples. You might be surprised!

When Is A Spy “Necessary”?

You ask about “the best way” to decide whether a spy is necessary (maybe appropriate). I don’t know of One Best Way. I use them, then let duplication drive changes. I especially look for duplicating unnecessary details in the test. If I have to duplicate details in a handful of tests, just to be able to check some other part of the system, then perhaps I have two things in one place, and when I separate them, the corresponding spies become much simpler, and sometimes I can replace a spy with data (from Virtual Clock to Instantaneous Request).

Is It Ever Acceptable…?

You also ask whether it is ever acceptable to test the implementation instead of the behavior. “Is it ever acceptable…?” questions almost always have the answer “yes”, because we can always find a situation in which somewhat becomes acceptable. On the other hand, I don’t typically spy on private methods. If I need to know that level of detail in a test, then the test is trying to tell me that A cares too much about the internals of B. First, I try to remove unnecessary details from A’s tests. Next, I look for duplication in A’s tests. Especially if I spy on the same functions in the same sequence, that duplication points to a missing abstraction C.

So When to Mock?

I have two answers to this question. First, when do I use spies/expectations compared to simply using “the real thing”? I like to program to interfaces (or protocols, dependingon the language) and I like to clarify the contracts of those interfaces, something that expectations help me do effectively. To learn more about this, read the articles I list at the end related to contract tests. Especially read “When Is It Safe to Introduce Test Doubles?”.

Finally, when I’m not sure whether to use a spy or an expectation, I go back to the rule of thumb: spy on queries, but expect (mock) actions.

References

c2.com, “Virtual Clock”. An overview of the Virtual Clock testing pattern, with further links.

J. B. Rainsberger, “Beyond Mock Objects”. I use test doubles (mock objects) extensively in my designs and they help me clarify the contracts between components. Even so, using test doubles mindlessly can interfere with seeing further simplifications in our design.

I apologise again for not having collected my thoughts about collaboration and contract tests into a single work. I need to find the time and energy (simultaneously) to do that. In the meantime, I have a few articles on the topic:

In order to avoid confusion with the generic concepts of “mock objects” (better called “test doubles”), I use the term expectations to refer to what many people consider a mock: function foo() should be called with arguments 1, 2, 3.↩