Tuesday, June 5, 2018

The Last Measurable Ounce of Quality Can Be Expensive

A Brief
Lexicon of the Software Quality Landscape

The
quality of a software product is a multi-dimensional measurement, spanning such
things as functionality, correctness, performance, documentation, ease of use,
flexibility, maintainability, and many others. Many of these qualities are
difficult to measure, are difficult to see, and hence are difficult to manage.
The result is that they are ignored by all but the most enlightened in
management.

The topic
of interest for this screed is going to be correctness. This is not going to
concern correctness in the end-user sense, meaning the correct meeting of
requirements. It is going to mean "is the code doing what we think it
should be doing." Since we don't generally have provably correct programs,
it is a matter of convincing ourselves, through lack of evidence, that our
programs are working as we'd like. This precarious situation is perfectly
portended by Edsger Dijkstra's observation that "absence of evidence is
not evidence of absence."

So we have
several levels of ways to convince ourselves of correctness. They are, from
most detailed to most abstract, unit testing, integration testing, functional
testing, and system testing. Note that there is not industry agreement
concerning these exact terms, but the general concepts are recognized.

In typical
object oriented designs, unit testing involves isolating a given class and
driving its state and/or behavior and verifying that we see what we expect.
Integration testing is a layer above that where we use multiple classes in the
tests. Functional testing is yet above that, where we try to deploy our
programs in the natural components that they would inhabit in production, like
a server or a process. Finally system test covers testing in the full
production-like environment.

Unit
Testing

Our focus
will be on the lowest level, namely unit testing. The expectation is that unit
tests are both numerous (think many hundreds or even thousands) and are
extremely fast (think milliseconds). To be effective, these tests should be run
on every single compile on every developer's machine across the organization.
The goal is that the unit tests precisely capture the design intent behind the
implementation of the class code, and that any violation of those intents
result in immediate feedback to the developer making code changes.

I'd like
to tell you that every developer is doggedly focused on both the quality of the
production logic and the thoroughness of the unit tests that back that logic.
Through a combination of poor training, lack of emphasis at the management
level, and just plain laziness, developers produce tests that span from
greatness all the way down to downright destructive (more on that in another
blog entry). One of the easiest ways to try to externally track this testing is
through code coverage.

Code
Coverage

Code
coverage is a set of metrics that can give developers and other project
stakeholders a sense of how much of the production logic has been tested by the
unit tests. The simplest metric is the "covered lines of code" aka
line coverage. This is usually a percentage and it means that if a class has 50
lines of code in it, and it has 60% code coverage, then 30 lines of that
production logic is executed as part of the running of the unit tests for that
class. There are other coverage metrics that can help you gauge the goodness of
your tests, like branch coverage, class coverage, and method coverage. But
here, we will focus on line coverage since that is most widely used.

The
general, common sense assumption is that "more is better", so
mis-guided management and deranged architects insist on 100% code coverage,
thinking that would give the maximum confidence that the quality of the code is
high. If we had an infinite amount of time and money to spend on projects, this
conception could represent the optimum. Since this luxury has never been true
in the last 4 billion years, we have to spend our money wisely. And this
changes things drastically.

The truth
is that it might cost M dollars+time to achieve say 80% line coverage, but it
might take M *more* dollars+time to get that last 20%. In some cases, getting
the last few percentage might be extremely expensive. The reason for this
non-linear cost is complicated.

First,
production logic should be tested through its public interface where possible
rather than through a protected or private interface. It can be laborious to
construct the conditions necessary to hit a line of code buried in try/catches
and conditional logic behind public interfaces. This cost can be lowered by
refactoring the code towards better testability, but this is a continuous
struggle as new code is produced. There is a truism in the veteran developers
that increasing the testability of the production logic improves its design.

Second,
some code has high complexity also known as cyclomatic complexity. Arguably
this code should be refactored, but projects do have a certain percentage of
their code with high cyclomatic complexity that gets carried forward from
sprint to sprint.

The third
reason is a bit technical. Code like Java is compiled into byte code. The code
coverage tools run off of an analysis of the byte code, not the source code.
The Java compiler will consume the source code and emit byte code that may have
extra logic in it, meaning code with extra branches. It might not be possible
to control the conditions which would take one path or the other through this
invisible branch. Further complicating this, is that the invisible logic can
change from Java compiler release to release, putting a burden on the test
logic to reverse engineer the conditions needed to cover this invisible logic.

Summary

Based on
the above discussion, achieving 100% line coverage can be very expensive. On
teams that I have worked on over the years, a reasonable line coverage would be
70% or more. But you should let the development team determine this limit. If
you force your teams to get to 100% line coverage, you are spending money that
might be better spent on automation tests. In addition, I have seen cases where
developers will short-circuit the unit tests by writing tests only for the
purpose of increasing the coverage. You can readily identify these test because
they have no assertion or verification check in them - they just make a call and
never check on the result.

In short,
you should be careful what you ask for. Make sure you interact with the
development team in making the decision about code coverage. Spending another
50% of scarce testing dollars on that last 10% coverage is unlikely to bring a
return on investment.