Learn More from Tests That Stray off the Happy Path

Unit tests exercise various paths through your codebase. Some are happy paths where everything you expect goes right. These tests are boring. The interesting tests are the ones where your code goes hurtling off the happy path. The trick is to capture the diversity of a multitude of unhappy paths without needlessly duplicating unit tests. Here's how you can improve the quality of your unit testing and fix it more effectively.

Unit tests exercise various paths through your codebase. Some are happy paths where everything you expect goes right. These tests are boring.

The interesting tests are the ones where your code goes hurtling off the happy path. As you'd expect, they are all different, but they are much more interesting.

The trick is to capture the diversity of a multitude of unhappy paths without needlessly duplicating unit tests. It seems like this should be easy, but duplication is a real risk, particularly when the system is large or has been maintained over the course of many years.

I worked with a software system years back that had more than ten thousand test cases. Someone before me had identified every possible variation and had programmatically generated test cases to exercise each one. It was a nightmare. Everyone involved in the testing program complained about all the duplicates. We all believed we could cut the number of test cases by one or two orders of magnitude, but we did not know how.

With so many test cases, it was impossible to give a thorough analysis to every test. We did not know which outputs were good or not. We had to use much weaker criteria. Did any test case kill the system? Did any throw an unexpected exception? Someone before us had manually verified a handful of cases, but we didn't know which cases those were.

For the most part we had to settle for comparing outputs against the last version. We didn't know if differences stemmed from errors introduced by the new version, repairs to previously undetected errors, or mere random variation.

It was an unhappy situation, and the regression testing problems ultimately motivated the company to abandon the system.

All the tests had been devised after the system had been written. No thought was put into what each test case was trying to prove.

On a later project, we did things a little better. We'd all drunk the test-driven development (TDD) Kool-Aid. Each unit test was intended to prove some code did what we intended it to. When all these tests ran green, we figured we'd implemented the spec as we understood it.

As bugs surfaced later, we wrote more unit tests to reproduce them and assert correct behavior.

Over time, this proved troublesome for a couple of reasons. First, the tests exercised the code we were interested in as well as irrelevant code. For instance, my machine's graphics drivers were configured slightly differently from another colleague's graphics drivers. The tests that ran 100 percent green for him ran 30 percent red for me, and vice versa.

We did not realize we should mock the graphic drivers. This generated outputs with meaningless differences. Our job wasn't to test graphics drivers or how they were configured, but our tests didn't know that. This also made our unit tests run more slowly.

This is how I learned that every unit test should have only one reason to fail. That one reason should be well understood by whoever sees it. Our tests proliferated as we got bug reports and wrote tests to verify fixes.

We got so many unit tests that we could no longer reason about them. They were just red and green dots. With fewer unit tests, each failure could tell us more.

However, we learn much more from a failing unit test than a green one. A green unit test tells us that we've proved the code did one thing right this time, but it can't help us prove that nothing is wrong. A red unit test tells us something is wrong. It says it is wrong right here. In a perfect world, every "right here" has one and only one unit test covering it.

The problem is that unit tests may be only slight variations of one another. Maybe another guy on your project wrote a test a few years ago. The code found a sneaky way to pass his test while still being wrong. Because his test is just a green dot, we have no reason to look at it.

This calls for something like overcoverage analysis. An integrated development environment can tell you which lines of code were never executed during a test run. We might benefit from the opposite: If we could see those lines of code that are executed over and over again by several tests, we might consolidate redundant tests. If it is hard to spot duplicated unit tests in a suite of a thousand tests, it's next to impossible to spot it in that suite of ten thousand tests that I started with.

But we cannot be too hasty about this. When our code gets knocked off the happy path, there's a multitude of rabbit holes the unhappy-path code can go down. Testing each rabbit hole requires the code to retrace the happy steps again and again before stepping off someplace different or going in a different direction.

If tracing execution paths is too hard to use for identifying duplicate tests, perhaps we can look at the data our tests incorporate. How much complexity is there in the input data?

Two data sets can be different, yet in some sense equivalent. Suppose there are two data sets that differ only in that one has "Smith" where the other has "Jones." Suppose the original data set is used by a set of unit tests. When every test in the set passes and fails at the same times for either data set, we can say they're equivalent in this sense.

Further suppose that other changes can simplify a data set preserve equivalence. This can define an equivalence class of data sets. If you imagine all possible equivalent data sets, there'll be at least one that's simplest.

What I have in mind is a process to replace all the input data sets with simpler equivalents. If there are unnecessary records, remove them. If there are shorter strings, shorten them. If there are smaller numbers, make them smaller. Any change that makes the data set simpler while preserving equivalence is allowed.

This is what we should have done with the ten thousand-test nightmare I mentioned. These tests only differed by their input data sets. Had we reduced each input to its simplest form, we could have then removed the tests with the same data sets.

In a third project, I had a decade's worth of bug reports I had converted into unit tests. Customers sent in bug reports, and each bug report incorporated data to reproduce the issue. We couldn't do that if customer data held personal and confidential information. The same "equivalence" operation can replace personal and confidential information with random values. This is suitable for incorporating into a bug report.

You will get deeper insights into your code from failing unit tests than you get from successes. Our unit test suites need to be curated to retain just the abridged version that captures the gist of the problem with maximal efficiency. You can improve the quality of your unit testing, ship code faster, and fix it more effectively.

About the author

Writing software well is a real joy. I am interested in how Software Engineering can become a serious professional discipline and the methods for most effectively organizing the work. I successfully developed software using the "waterfall" but I got over that foolishness when I learned I'm not God. This started me doing prototyping. Now I think that "wicked" problems mandate an agile approach, but not cowboy coding. The approach is mandated by the problem. If it is completely understood, a BUFD waterfall is optimal. Otherwise an emergent design via agile is indicated. I am interested in the concept of "technical debt" as a metaphor for communicating the trouble you borrow when you cut corners. The catastrophic behavior of avalanches is another apt metaphor for troubled codebases.