I have to admit that historically, I am not a big fan of unit testing. Not in a sense that I don’t like testing as such: it is just that my experience shows that other kinds of automated testing (such as simulation testing and replay testing) happen to be much more useful in making programs more robust so that IMNSHO at most 20% of overall testing efforts should go towards the unit testing.1

1 well, this stands only for statically-typed programming languages, and is different for dynamically-typed ones, but this is a separate and quite long topic which deserves a different article

Re-Using Research Data to Come Up with a Completely Different Interpretation

Recently, I ran into an interesting research [Dietrich]; in the article, the author takes 100 open-source C# projects, and runs certain statistics over them. For example, there is a graph which shows “dependency of percentage-of-public-methods on the percentage-of-methods-being-unit-test-methods”. And a graph of “dependency of cyclomatic complexity on the percentage-of-methods-being-unit-test-methods”. And so on and so forth.

When looking at those graphs, I got a very clear feeling of what they IMNSHO really indicate; however, my interpretations were very different from the author’s interpretations of the very same data. So, what I am going to do, is to re-use his data – while providing another (and IMNSHO much better 😉 ) interpretation.

The Data

Most of the data in [Dietrich] were related to what is often considered different metrics of code quality, in particular:2

percentage of public methods (generally, the more encapsulation we have – the better)

average method length (LOC). It is often argued that methods should be short.

cyclomatic complexity (CC). For the explanation of cyclomatic complexity, see, for example, [Dietrich.Cyclomatic]

number of overloads (too many overloads tend to decrease readability; while it is arguable, let’s keep it to see if it changes anything)

I took these four graphs from [Dietrich], normalized them (so the integral under the graph is 1 for all the data series), turned them where necessary so that for all of them are “higher is better”, and centered them around zero so that zero corresponds to the average value for each of the metrics. Here is what I got as a result:

On this graph, just as in [Dietrich], horizontal axis represents the percentage of unit testing methods to overall number of methods. And on the vertical axis, there are different metrics of what-is-often-considered-beneficial-to-code-quality.

And as we can see, 3 out of 4 code quality graphs – as well as total one – look pretty much the same:

Code quality tends to be Quite Bad(tm) when there is no unit testing at all, reaching maximum when there are 0 to 10% of the unit testing methods, and degrading afterwards.

2 one graph which was present in [Dietrich] and which I omitted is ‘Average parameters’, but first, there is no simple “best number of method parameters” (having all the methods with zero params means there is way too much is stored in the object itself, causing all kinds of trouble), and second, observed difference between 0.8 and 1.4 parameters per function is not significant for readability anyway

Conclusions

Amicus Plato, sed magis amica veritas.

Plato is my friend, but truth is a better friend

— attributed to Aristotle —

First, I have to say that I am well aware of all the hits I am going to take from hardcore fans of unit testing, including TDD fans. However, being hit hard for saying inconvenient truths is in my job description, so here goes my self-nomination for my cup of poison hemlock:

It is quite easy to get past the optimum amount of unit testing for your project.3 And whenever your unit testing starts to affect your code in a negative way – you should stop4

That being said:

as with any other statistical analysis, there is no guarantee that correlation really means causation.

this whole set of results is a relatively small sample, so things might still change:

in particular, specific numbers are not to be relied on.

OTOH, the results above are consistent with my own observations in million-LOC real-world projects.

as it is observed above, some amount of unit testing is useful (which is also consistent with my own observations)

It is just that unit testing is only one of the ways of automated testing – and shouldn’t be seen as The Thing(tm) (or even worse – as The Only Thing(tm)) at least for statically-typed languages.

Observations above are applicable only to statically-typed programming languages. Dynamically-typed languages are very different in this regard (and tend to require much more unit testing to be maintainable and refactoreable).

Comments

On the contrary, small amount of unit testing seems to be so much _better_ than no unit testing – as shown on the graphs above. In fact, as stated above, code quality reaches its maximum at some rather small amount of unit testing.

Small amount of unit testing likely correlates to only minimal happy path tests. The more tests you write, the more edge cases you will cover along the way. It’s an illusion of testing.

What I want to know is how does more unit tests negatively impact code quality? Those “quality” metrics are mostly about readability and while that may mean it’s harder to follow, quality should be measured in escaped bugs. As programmers we can learn to read code better. Having more tests, having more of the right tests will lead to higher quality software. Users never see how elegant the code is anyways.

From my not insubstantial experience with serious projects, 80% of “escaped bugs” can be observed even with 100% code coverage by unit testing (most of escaped bugs happen to be about “unusual sequences of otherwise valid events”). Which means that _pretty_much_any_unit_testing_ is an illusion (and a very expensive one TBH). The whole is so much larger than the sum of units that it is not even funny. As a result – MOST of our automated testing efforts should be redirected away from unit testing (to things such as simulation testing, replay testing, etc.) – they (a) detect more bugs (they operate over that 80% bug base not over those 20% of what is easily detected anyway), and (b) do not decrease code quality. NB: as noted in OP, this applies only to statically-typed languages (for dynamically-typed ones, unit testing is a prerequisite to refactoring, which is very unfortunate TBH).

> Users never see how elegant the code is anyways.

Sure, but readable code means that whoever makes a change, is less likely to misread the code/intention/… => hence better code maintainability => hence less bugs (especially subtle ones like those 80% which are not caught by code coverage testing) are introduced during code maintenance.

> quality should be measured in escaped bugs

Even better – in terms of MTBF (though, say, Herbert Matthews argues that it should be downtime, so MTTR is also in the picture, but from our current perspective it doesn’t matter too much). And in the real-world I’ve seen a very strong correlation between code readability and MTBF (have seen 3-5x better-than-average MTBF for a better-readable code), so a “common wisdom” that “code quality” does correlate with “readability”, is not far from the truth.

P.S. “I am well aware of all the hits I am going to take from hardcore fans of unit testing” does stand too 🙂

100% code coverage means your tests hit every line of code. Escaped bugs usually deal with non happy path sequences or data and since negative testing is not always done, that’s where we see errors even with 100% coverage…or we just didn’t account for all valid sequences to begin with. This is why I said “more of the right tests” are needed. Having more tests just to have more tests is pointless and takes away from development time.

“or data” – yes, this is the usual culprit. However, testing for _all_the_possible_combinations_of_data_ is not feasible => hence, all the unit testing is illusion. Note that simulation testing and replay-testing-with-real-world-data-from-100K-users are different; I can assure you that 100K users each pressing a button once per second can produce in just a few hours much more “right tests” than the whole team of yours can devise in years. A rule of thumb from a multi-billion-dollar online business processing several billion user-interactions-changing-server-side-state per day: “if new release doesn’t crash within first 4 hours after deployment – it won’t crash at all” ;-(.

NB: of course, I am speaking about usual business-level stuff, NOT about life-critical stuff, where the whole thing has to be very different – but where once again, unit testing is unfortunately insufficient (see, for example, Therac-25 disaster, which bug could not be found in unit testing).

Just want to comment on the rule of thumb. We recently saw quite a few incidents that happened in the middle of release cycle. First, a serious issue was caused by a very rare combination of events where operator had to manually send a command and it triggered a hidden bug from a year ago (this somehow can be avoided when code coverage tool is used for testing). The second one was due to a slow memory leak. However, I do agree that probably 80% of production issues can be detected within a few hours after release.

Well, at least slow memory leaks shouldn’t lead to system-wide crash (monitoring should have revealed it, causing a bugfix, early release, and restart – and not a crash as such). As for manually sending commands causing a rare bug crashing the whole system – well, this kind of stuff does happen 🙁 , though fortunately very rarely.

Your PS is accurate. If you do not follow the principles of unit testing (I.e. talking about sequences) calling out difficencies in “too much” unit testing is not reasonable.

Sequences can be a product-under-test code smell.

The reasons I do not appreciate this article are based on conflating testing ideas as unit test discussion scope. Code coverage/static analysis is the primary example. Use any tool wrong, you will hit a point of diminishing returns quickly.

The data also makes a broad generalization on a sample of the unit testing population. Statistically this is dangerous with respect to having a sound and valid argument.

Unit testing should be based on atomic test areas and equivalence partitioning. This post does not qualify or quantify “too much” unit testing separate from areas that are not directly in scope to this topic.

If the code being tested is poorly implemented but is functionally correct, you should expect unit testing breadth to be impacted and not as effective.

I appreciate the effort of this article but other than the sources cited, the details are anecdotal or rhetoric. I feel can be counter productive to helping people effectively test.

> If you do not follow the principles of unit testing (I.e. talking about sequences) calling out difficencies in “too much” unit testing is not reasonable.

To be very clear: I am not speaking about sequences in unit testing, I am saying that unit testing is inherently extremely bad in finding bugs due to unusual event sequences (or more formally – due to program state being unexpected, and testing for all possible states is infeasible even for measly 64-bit state, leave along megabytes and gigabytes of state).

> The data also makes a broad generalization on a sample of the unit testing population. Statistically this is dangerous with respect to having a sound and valid argument.

So does _absolutely_any_ research based on statistics. Still, throwing away _all_ statistics-based research is obviously detrimental. Give me better research – _then_ we’ll be speaking.

> This post does not qualify or quantify “too much” unit testing separate from areas that are not directly in scope to this topic.

Yep; the point is that there _exists_ such a thing as too much unit testing (you know, similar to those theorems in mathematical analysis which prove _existence_ of an extremum between two zeros under certain conditions, not attempting to say where it will be observed). IMNSHO, this statement is fairly obvious (to anybody but people selling unit testing frameworks and those who they succeeded to sell one), but this fairly obvious statement is exactly what I am being hit hard for :-). As for answering “when” this “too much” happens, it is the next step, which I am not comfortable to justify yet (well, from my experience _at_most_ 20% of automated testing efforts should go into unit testing, but as of now, I cannot really provide evidence beyond several tons of real-world anecdotes 🙁 )…

Your argument is poorly scoped and all over the place. When making assertions, generalizing, and avoiding descrete data you are just pushing your opinion.

There are logical fallacies through out this entire document, and in the comments to use a rebuttal of “seems fairly obvious” and introducing the topic of people selling frameworks is ridiculous.

If your argument is unit testing will not find issues where integration and component/system testing would, any testing book worth their salt already would say the same. If this is the case, you should be calling when to leverage specific types of testing.

The point is, if an issue exists where unit tests are scoped to catch the problem, the downstream testing would be impacted at an exponentially higher cost with less root cause visibility.

Bringing static analysis/LOC in the product under test means what with respect to the scope of unit testing too much? This is a separate problem statement.

The problem with only leveraging real world anecdotal points is that the situation and data is subjective. Anyone could provide legitimate counter real world anecdotal points from there experience.

> If your argument is unit testing will not find issues where integration and component/system testing would

This is argument in comments, not in OP. And the argument in OP is that such thing as “too much unit testing” does exist. If you feel it doesn’t – this is clearly up to you, but TBH you’re clearly not the target audience of this post, so I don’t really care much ;-).

This article is based on sloppy and heavily criticized research. An author of this research then posted two follow-up articles and the last one shows exactly the opposite of what the first one showed. So after some fixes in the research now it shows that the more unit tests – the better code quolity except LOC per method. But I think even the latest version of this research is invalid because it measures code quality INCLUDING test code which tends to have longer methods and be very different from production code.

Thanks, given I have time I might be able to analyse this dataset too. Still, I have to note that (a) processing in the second post still makes exactly the same mistake as the first one – it completely ungroundedly _assumes_ that the dependency is linear (which is very arguable; at least it is fairly obvious that this hypothesis isn’t much better than “there is no dependency at all” hypothesis). Also – (b) unlike 100 codebases in the OP (which were selected by a 3rd-party), this second post mentions only “a handful of “awesome codebase” lists, like this one.” – which is not a good starting point to build a reproducible research – and being reproducible is a _firm_prerequisite_ to a research which can be taken seriously. Or more bluntly – as this second set is not well-defined, it can be easily brushed off as an attempt to select the data to bring results in agreement with pre-stated hypotheses of the author (which hypotheses BTW explicitly favoured unlimited unit testing).

EDIT: In addition, filtering out 1/3rd of those 750 allegedly “awesome databases” because they “don’t compile or otherwise have trouble in the automated process” doesn’t really add credibility to this 2nd data set too 🙁 . And given that one of the original metrics (% of public methods) have just disappeared (why? surely not because it didn’t favour unit testing even after “adjustments” made?) – I am really NOT comfortable with using this 2nd set for analysis 🙁 .

I think that it skews all the analysis. For example, let’s take a look at “parameters count” – unit test methods usually have 0 parameters, so the more unit tests you have the smaller average parameters count you get in total even if you don’t change actual production code, which says nothing about quolity.

Or even more important metric “LOC per method” is greatly skewed because the analysis also counts method length of unit tests and they usually longer than production code, so again if you just just add unit tests without changing production code you get higher average method lenght.

You do have a valid point, and yes – it is going to be skewed due to these reasons, but once again – I don’t have any better data, and my gut feeling is that skewing due to these reasons won’t be _too_ bad to affect the whole picture.

Perhaps one of the underlying causes of the decreasing quality of code under more unit test has more to do with team development practices being out of alignment and without a more rigorous red-green-refactor cycle, especially missing the refactor part. This is pure conjecture, but I suspect that code under a non-trivial amount of unit testing that isn’t actually well-factored would tend to lead to worse unit tests (note that the badness metrics for unit tests are in the same bucket).

Teams that are developing with unit tests, but not test-first-and-refactor-as-soon-as-either-test-or-production-code-gets-smelly are going to be writing legacy code for both production and test code that builds more and more inertia against being refactored and improved.

Of course. As one example, there is a possibility that 10% of unit testing merely corresponds to those projects which do care about code quality at all ;-), (which would mean that even 10% is too much of unit testing 😉 . And of course, it can be pretty much anything else too.

Still, this being a fundamental restriction of any statistical analysis (let’s not start repeating CC causation wars here 😉 ), and there is a reasonably good chance it can be causation too…

EDIT: I added a bullet point to Conclusions section to make it clear, thanks!

It’s strange to have “percentage of method covered by a unit test”, since what matters is more Code coverage.

With this metrics, yes, you will have more public method (often for wrong reason, because you want to test it, so make it public !), less method (which will increase LOC and CC, they are linked anyway ), and lot of overload will increase the number of method covered by unit testing.

Seeking code coverage with fewer test case (but parameterized to achieve coverage), focusing unit testing on small functionality over methods, will lead to more efficient and maintainable unit testing, and those metrics cannot give us any hint on it.

I almost agree with you : unit testing shouldn’t test method, it will quickly become counter productive.

FWIW, the question raised is not about “how efficient unit testing is” (which is a completely different story), but rather “how having LOTS of unit testing tends to affect code quality” (or, for the purposes of this article, code readability). A LOT of industry people out there, including myself, are clearly NOT ready to sacrifice their code readability in the name of unit testing (having otherwise unnecessary layers of abstraction just to enable mocking – gimme a break…; from my experience with serious projects, at least for statically typed languages, KISS principle is a significantly stronger driver for reducing # of production bugs than _any_ kind of testing, especially unit testing…).

I certainly agree that testing should be done at MUCH coarser granularity than testing single methods; in particular, if “unit of testing” ~= “module” (for example, for (Re)Actor-based systems – the whole (Re)Actor), I certainly won’t argue against this kind of testing (and also it will NOT cause trouble with code quality), but unfortunately for 99% of people “unit of testing” ~= “method” 🙁 .

For the first point, I refer to the sentence : “On this graph, just as in [Dietrich], horizontal axis represents the percentage of unit testing methods to overall number of methods”

And my entire response explain why observed are correlated with this one in a way that the results are not surprising.

It’s why I don’t think that having a lot of unit testing affect code quality by itself, but that focusing of having lot of unit testing over having a good test coverage will tend to affect code quality (because you will need to use reflection technics or put private stuff public, or create useless test for small and obvious methods, and nothing will prevent you to write awfully complicated method since one test will enforce your goal)

Well, “percentage of unit testing methods to overall number of methods” is actually different from “percentage of method covered by a unit test”. But in any case, specific numbers don’t really matter – what does matter is that “0 unit testing is bad” and “too much unit testing is bad too”, and I don’t see this changing because of choosing a subtly different metric.

> I don’t think that having a lot of unit testing affect code quality by itself

First, I agree that (mocking aside) it shouldn’t – but it often does :-(.

Second, more importantly, if you have to change your code to enable mocking – it HAS to affect your code (and it happens to affect in the negative direction 🙁 ). BTW, for .NET another interesting metric would be “percentage of virtual methods”, which tends to skyrocket for app-level code which enables typical .NET mocking frameworks 🙁 (and which is a yet another thing which hurts readability 🙁 – at least ‘virtual’ is a piece of code self-documentation which loses any sense in case of mocking).