Summary
Testing code is different from testing a system. Code in real-world, production systems has to contend with an ever changing, often unpredictable environment that renders unit tests an unreliable predictor of system behavior. In the real world, system robustness matters, but writing more tests can produce diminishing returns.

Advertisement

Do unit tests instill in us a false sense of certainty?

That's how I felt the other night. Bill Venners (Artima's chief editor) and
I were helping a group of developers at a Silicon Valley Patterns Group meeting. Our
goal that evening was to build a Jini and JavaSpaces compute grid. Before
everyone could go to work on grid code, we needed to get Jini lookup services up and
running on an impromptu wireless network assembled at the back room of Hobee's restaurant in Cupertino.

The Jini Starter Kit, which Sun will soon open-source, is as high quality and
throughly tested a piece of code as code gets. Indeed, the JSK is used to run
high-volume business-critical transactions at a handful of large corporations. Starting up Jini lookup services with the JSK is typically a snap.

But it wasn't so that night. We struggled for an hour with this normally simple step,
having to adjust several aspects of users' local environment: moving files around,
deleting directories, checking for multicast capability on network interfaces, etc.
The exercise was frustrating to those who, just a few short hours prior to our
meeting, were able to run Jini lookup services on the very same laptops they brought
with them to the meeting. The rigorous testing and QA processes followed by the Jini
developers predicted nothing about how well the system would work on our impromptu
network that night.

A few days later, Bill and I were sitting just a few yards away from Hobee's, trying
to start up a new version of Artima code. Before checking in that code, I made sure
that all the over one hundred unit tests for that module passed. Yet, when Bill
checked that code out and started up the Web app on his laptop, a subtle
configuration issue prevented the app from working as intended. While code itself was
tested, the system relied on configuration options that were defined partially
outside the code. The unit tests, again, were no indication of whether the code would
run at all in a real-world environment.

Were our tests, or the Jini JSK's tests, flawed? How could we account for
environmental exigencies in those tests? How deep should we aim for in our test
coverage? Should we strive to cover all the permutations of code and its environment
in our test suites? Is such complete test coverage of code even attainable?

System Matters

These experiences made me appreciate the distinction between testing code and testing
a system. The real world only cares about the system - the actual
interaction of all the code in a given piece of software with its environment. Unit
tests, on the other hand, mostly test code: Unit tests are proof that a given method,
or set of methods, act in accord with a given set of assertions.

Unit tests are also code. When running a set of unit tests, the code that's being
tested and the test code itself become part of the same environment - they are part
and parcel of the same system. But if unit tests are part of the system that's being
tested, can unit tests prove anything about the system itself?

No lesser a logician than Kurt Gödel had something to say about this. To be sure,
Gödel's concern was algebraic proof, not unit testing. But in addressing the false
sense of certainty implied in Bertrand Russell's Principia Mathematica, Gödel demonstrated that it is not possible to prove all aspects of a system from within a system itself. In every logical system, there must be axioms - truths that must be taken for granted, and that can be demonstrated true or false only by stepping outside the system.

Such axioms are present in any software system: We must assume that the CPU works as
advertised, that the file system behaves as intended, that the compiler produces
correct code. Not only can we not test for those assumptions from within the system,
we also cannot recover from situations where the axiomatic aspects of the system turn
out to be invalid. If any of a system's axioms turn out to be wrong, the system
suffers catastrophic failure - failure from which no recovery is possible
from within the system itself. In practical terms, you will just have to reboot.

A cardinal aspect of a test plan, then, is to determine a system's axioms, or aspects
(not in the AOP sense) that cannot be proven true or false from within the system.
Apart from those system axioms, all other aspects of the system can, and should, be
covered in the test plan.

The fewer the axioms, the more testable the system. Fewer axioms also result in less
possibilities for catastrophic failure. But in any system, there will always be
conditions that cause complete system failure - CTRL-ALT-DEL will be with us for
good. Fully autonomous, infallible systems truly belong in the realms of science
fiction and fantasy.

Degrees of Belief

If we accept that there will always be a few aspects of a system that we cannot write
tests for, aspects whose correctness we must take for granted, how do we decide on
those "axioms"?

Do you write test methods for simple property accessor methods, or do you just assume
that the JVM does what it's supposed to do? Do you write a test to ensure that a
database, indeed, inserts a record, or do you decide to take that operation for
granted? Do you just assume that a network connection can be opened to a host - is
that operation a system "axiom"? And do you just assume that a remote RMI call will
return as intended, or do you write test for all sorts of network failures, along
with possible recovery code? Finally, do you just assume that a user types a correct
piece of data in an HTML form, or do you write tests and error handling code in that
situation?

Clearly, there is a spectrum, and we often make our decisions about our "system
axioms" based on our beliefs of certainty about correctness. Most of us are highly
uncertain that every user always enters the right answer in a form, so we always
write tests in that situation. But most of us are fairly sure that a database can
perform an insert just fine, so writing tests for that operation would seem like a
waste of time, unless we're testing the database itself.

If our decisions about what to take for granted in a system is based on such degrees
of belief, and if tests start where "axioms" end, then the degree to which testing
tells us about a system's behavior in the larger, operating context of that system,
is also dependent on those beliefs.

The Jini code, for instance, assumed that multicast is enabled on all network hosts.
The Artima code took a specific configuration for granted, and assumed that that
configuration is the one supplied at system startup. We didn't test for that, just
assumed that that is always so. The tests passed, but the system still failed when
that condition was not satisfied in a different operating environment.

In addition to beliefs, we also have to contend with market pressures when choosing
system "axioms." You may know that a remote method call can fail a hundred different
ways on the network, but you also know that shipping a product today, as opposed to
tomorrow, can lead to a market share gain. So you decide to not test for all those
possible network failures, and to take the "correctness" of the network for granted.
You hope you get lucky.

Past Behavior

We could improve our degrees of beliefs about system correctness if we analyzed past
system behavior. One way to do this is to rely on experience. But another way to do
this is to follow what a better search engine, such as Google, does: The more we use
the system, the better the system gets because it learns form past data to improve
its results.

We could instrument code in such a way as to capture failure conditions (e.g., by
logging exceptions). We could then tell that, for example, one out of every N remote
calls on a proxy interface results in failure, given the typical operating environment
of that code. Note that that's real-world data, not just assumptions. Then we could
assign the inverse of that measure - the degree of probability that the call succeeds
- to that method call. We could then correlate that information to how often a call
is used, and produce a matrix of the results.

That probability matrix would be a more accurate indicator of the code's actual
reliability - or "quality" - than just having a set of tests that happen to all run
fine on a developer's machine. Such information would help developers pinpoint what
"system axioms" are valid, and what assumptions prove incorrect in the real world.

With that information, we would not need to strive for complete code coverage, only
coverage that leads to a desired quality level. That would, in turn, make us all more
productive.

I think it may even be possible to find emergent patterns in code with a probability
matrix of that code shared on the Web. Coverage and testing tools could tap into that
database to make better decisions about where to apply unit tests, and about how
indicative existing unit tests are about actual code behavior.

That said, I'm curious how others deal with ensuring proper configuration, and how others account for configuration options in tests. What are some of the ways to minimize configuration so as to reduce the chances of something going wrong? Then, again, isn't reducing configuration is also reducing flexibility and "agility?"

In general, do you agree with my conclusion that complete test coverage is not desirable, or even attainable? How do you choose what to test and what not to test? How do you pick your "system axioms"?

> ...another way to do> this is to follow what a better search engine, such as> Google, does: The more we use> the system, the better the system gets because it learns> form past data to improve> its results.

In all your observations about the dangers of mistaking unit testing for system testing, the snippet above caught my eye. I'm intrigued to know how, in the absence of a feedback path, Google knows that I've found what I'm looking for and thus knows how to use that information to improve future searches.

> That said, I'm curious how others deal with ensuring> proper configuration, and how others account for> configuration options in tests. What are some of the ways> to minimize configuration so as to reduce the chances of> something going wrong? Then, again, isn't reducing> configuration is also reducing flexibility and "agility?"

I'm not sure if total flexibility/"agility" is desirable.In most of systems I have worked with configuration was kept in CSVs AND XML AND property files AND DB registry tables AND interface constants AND/OR class static constants AND maybe JNDI etc. !

Unless there is a configuration framework in place, before you start a project, it is very likely that the "lets-ship-it squad" is going to create most of the above ways to store configuration data.

Rod Johnson's excellent "J2EE development without EJB" describes the best solution I've seen so far. Using a lightweight (IoC as he puts it) container like Spring and designing to interfaces rather than concrete classes there is a way to non-invasively configure your object in runtime and still allow great flexibility during testing.

I need to stress that it is not just a case of dropping such a framework/container in the mix and get your problems solved.Designing for testing/maintenance should also be high in your list of priorities.

One approach that seems to be getting some attention in the agile/scripting language world (Ruby, Python, etc.), is to use the language itself to specify the configuration. So they just fold the configuration file into the source tree.

Now this approach is not feasable in a widely distributed Java application that runs on users desktops, but for a server-side application, you usually have a JDK (not just a JRE) available. That makes it possible to compile the java config file either on startup from the application, or as a separate step in the configuration process.

For the scripting language world, this is certainly a worthwhile idea. Is it also worthwhile for Java server applications?

> In all your observations about the dangers of mistaking> unit testing for system testing, the snippet above caught> my eye. I'm intrigued to know how, in the absence of a> feedback path, Google knows that I've found what I'm> looking for and thus knows how to use that information to> improve future searches.

There are several ways, and PageRank is one of them. This is research that was conducted at Stanford, and the results are published at various places. A quick overview is here:

> One approach that seems to be getting some attention in> the agile/scripting language world (Ruby, Python, etc.),> is to use the language itself to specify the> configuration. So they just fold the configuration file> into the source tree.> > Now this approach is not feasable in a widely distributed> Java application that runs on users desktops, but for a> server-side application, you usually have a JDK (not just> a JRE) available. That makes it possible to compile the> java config file either on startup from the application,> or as a separate step in the configuration process.> > For the scripting language world, this is certainly a> worthwhile idea. Is it also worthwhile for Java server> applications?

It is possible, and many people use Ant as a sort of configuration tool (Ant tasks, to be precise, which are little Java programs).

The problem is that I can't test the correctness of that configuration, since the configuration itself requires the environment to conform to the very requirements I'd like to test for. So, if I deploy an app without being able to test the configuration, then I can still experience the unpleasant surprise that a perfectly well-tested, working app doesn't work.

2. When you find a problem (like the one at your meeting), write a test for it: system or unit, whichever it takes. So, the next time there's a problem, anyone can run the tests in the problem environment to help troubleshoot it.

3. Minimize the axioms. For example, the axioms of most systems include the JDK and Ant, along with their configurations (e.g., /etc/ant.conf, ~/.antrc, jre/lib/ext, etc). But, I prefer to depend on just the OS and CVS, so I put the JDK and Ant into CVS and used scripts to make sure that the project is using only the configuration from CVS. External systems, like database servers, are another axiom that is good to avoid if possible. For example, you can bundle an open-source database server with the project, turning an external system into an internal one.

White Box Unit tests use examples, logic, and knowledge of a units internals to validate that a unit works in a particular way in isolation from as much as possible.

Black box systems tests follow a goat track through the application and try to get a sensible result.

It seems what you need are some new features to add usability.

A unit that functions perfectly with correct inputs, but does not help diagnose incorrect input may be unusable. You need features to help people find out what they are doing wrong on their end.

A configuration checking feature that clearly identifies incorrect configuration and suggests the right remedy is very useful.

All SQL our application runs and the result from it are logged. That way when a table is changed or we are not connected to the right database this is clear from the logs.

It is not that we do or don't unit test an insert. It's that unit testing an insert does not guarantee good behavoir at runtime. Inserts can fail because of disk space, permissions, DDL changes, deadlock, network issues, and 5000 other things.

Now I am just off to unit test a select. It will not guarantee that it will work at runtime, but it is a quick way to work out if I am doing something dumb in my code.

Some things are easy to unit test and somethings are not. Putting source that is hard to test into XML, rather than Java does not change this. But it could help you get closer to 100% coverage of Java code. So maybe you should switch to Spring for a false boost to your confidence.

> But, I prefer to depend on just the OS> and CVS, so I put the JDK and Ant into CVS and used> scripts to make sure that the project is using only the> configuration from CVS. External systems, like database> servers, are another axiom that is good to avoid if> possible. For example, you can bundle an open-source> database server with the project, turning an external> system into an internal one.

This is an interesting point, and this is what I chosen to do in the past: to bundle all required software with a distribution so as to rely on as few external dependecies as possible. The problem with that is that even that larger distribution must exist in a context.

As one example, software my company ships requires network access to a machine that is to act as the server. When someone installs the software, that network access may be available. At some later point, though, customers have installed firewall software that shut off network access to the machine. So that results in breaking the software, and that, in turn, results in a poorer quality perception of the product in the customer's view. Not to mention, it results in technical support calls, which cost money.

So testing for the network is not worth for us, because there is nothing we can automatically do to solve that problem. If we could programmatically communicate with all firewall software/OS, etc out there, we could possibly solve this. But that's just not the case. I think that applying more tests in this area would not produce a payoff in our case.

I really like the concept of identifying axiomatic assumptions of your system. The problem is that while specifying all axioms is possible (but not likely) implementing all of them is unlikely. We have a financial application that predicts the future behavior of a portfolio of accounts - like a collection of consumer mortgages. The models behavior in the system is highly dependent on the quality of the data stored in the current book of business along with the quality of the economic assumptions used to predict the future. To guarantee valid results we would have to continually check all these assumptions at every step along the many paths where we calculating future events. This isnt possible do to at PC processor speeds and return results within the products specified calculation time limits.

So we attempt to check these assumptions at other times when it is more efficient to do so. But again, it is not possible for a human to locate and specify every relationship that relies on this asynchronous validation.

In practice, we discover and document places where components create outputs that violate these assumptions. We then attempt to deal with these defects in future releases by polishing the offending components. In effect, we use the purple wire approach advocated by Fred Brooks in The Mythical Man-Month to manage these flaws discovered in system testing. The most important lesson weve learned is that a probability of occurrence must be assessed or evaluated when these system defects or purple wires show up. We dont formally track these probabilities, but we certainly pick numbers and rank them when planning future enhancements.

That reminds me of a system test I added once. I was testing some code that used a library that depended on an external server. My tests started failing, and it took me some time to troubleshoot the problem: the external server had become unavailable. So, I added a test to first make sure that the external server was available before running any tests that depended on it. That saved me a lot of troubleshooting time over the next year or so.

But the problem you describe is different. If the external server becomes unavailable during operation, despite being available during installation and testing, and if this becomes a support problem, then I guess the solution is to add some functionality to alert the user or tech support when this problem occurs. This functionality could be as basic as a meaningful Exception chain propagated up to a log file that tech support can see. But if it happens often enough, then that might justify the functionality of a user-friendly alert via whatever user interface is available.

I think the real issue you're getting at is how much to assume versus how much to handle. It's the same issue when deciding what to assert in the production code. E.g., should you assert that you have gotten all the configuration settings that you're expecting? This issue is more about error handling than testing.

All of this depends on what you think unit tests are good for. Personally, I don't write them to prove correctness, I write them because they facilitate change and decouple my designs.

Unit tests are great change detectors. They are also a great way of getting feedback during development, but yes, they are poor at detecting configuration problems or problems in macro-systemic behavior, because those are really not unit-level things.

Sounds axiomatic, but tests are very good at testing what they test but they are poor at testing what they don't test. For some reason, though, unit tests are often nailed for the things that are outside their scope. We could conclude that that means unit tests aren't too important, or we could conclude that there are other forms of testing that are better for those other problems.

> Sounds axiomatic, but tests are very good at testing what> they test but they are poor at testing what they don't> test. For some reason, though, unit tests are often> nailed for the things that are outside their scope. We> could conclude that that means unit tests aren't too> important, or we could conclude that there are other forms> of testing that are better for those other problems.

I didn't interpret this as Frank trying to nail unit tests, but asking what degree of test coverage is optimal, and how can we deal with what's left over that's not tested. Frank and I are both big believers in the value of unit tests. As I've mentioned in previous discussions, what I do to decide whether to write a particular test, and what imagine most people do, is attempt to judge the ROI of the test. How does the expected return (a guess) compare against the required investment (an estimation)? Does that estimated ROI seem worth it in the current business context? If so, it makes sense to write the test. If not, then the business would be better off if I invest the time another way. So whether the test coverage target to aim for is 40% or 60% or 80% really depends on human judgement.

One of the messages I get from XP (everyone hears things through the filters of their own biases) is that I should really open my mind to the value of those unit tests--i.e., that I may tend to underestimate the return on investment. That the return is not just expressed in correctness, but in the confidence to make changes later.

I think the main question raised in Frank's post for me (once again filtering through my own biases) is how do we deal with configuration problems? I can think of a few ways:

1. Minimize configuration.

Sounds good, but this would seem to go against the grain of making software configurable so it is easy to tune it at deployment time without recompiling source code. That's a big trend.

2. Simplify configuration.

This sounds better to me. This says keep configuration as simple as possible given the requirements. Question requirements that call for complex configuration. When there's no configuration requirement left to get rid of, then simplify the expression of that configuration. Maybe try and put everything in one configuration file. Make sure each configuration parameter just shows up in one and only one place. (Oops, that's the DRY (Don't Repeat Yourself) principle.)

3. Write tests that explore what the app or system does when the axiomatic assumptions are broken.

This would force me to think about and decide what should happen in such cases, and would likely lead to friendlier software. Writing such tests might help me find more ways to simplify configuration and minimize external assumptions. Of course, I have to apply the ROI estimation to such tests, and whether or not to write them just depends on the situation.

Perhaps a good exercise is just to try to write down all the external assumptions an app or system makes, what could go wrong, and what should happen if it goes wrong. For scenarios that seem very far fetched, it may be OK to say that the resulting behavior is undefined (because it isn't worth it in the business context to actually define the behavior). For other scenarios, it may seem worth it to define the behavior, and then that becomes a new requirement, which should be implemented and probably tested.