Andrew Koenig

Dr. Dobb's Bloggers

Performance Bugs: Not Just Hard To Detect, But Hard To Define

December 05, 2013

Test-driven development is among the most important ideas about how to develop software in the past 20 years or so.

Test-driven development is among the most important ideas about how to develop software in the past 20 years or so. This idea really has two parts. The first part is automatic testing: Push a button, the system churns for a while, and eventually it tells you which test worked and which ones didn’t. The second part is that the tests define the software: As soon as the software passes all the tests, it's done. By implication, it had better be possible to define tests that catch everything that might be wrong with the software: If you can't test it, you have no other way of knowing that it works.

In order to test a piece of software, you need to figure out what it's supposed to do in the situation that you're testing, and to find a way to verify that it did what you expected. Unfortunately, when you are trying to test performance, both of these needs are somewhere between difficult and impossible to meet.

To be sure, it is usually easy to come with a rough idea of how a system should perform. For example, a performance goal might be a given number of simultaneous users, or an upper bound on execution time, or limits on resources used. However, such goals are often too imprecise to allow a simple yes/no answer to the question of whether the system meets those goals.

For example, we might imagine a system that supports sales people in a retail store. A performance requirement for such a system might include the number of sales people and an upper bound on response time. But what does that upper bound really mean? More specifically, what does that upper bound mean in terms that can be measured for testing purposes? Suppose, for example, that the specification is that with up to 30 users, response time should be no worse than one second. Here are some hard questions to answer:

Does a single response that takes longer than one second mean that the whole system has failed?

If the response time is intended to be an average, does that mean that the system can consistently fail to meet that requirement for a while as long as it does better than required the rest of the time?

Does the specification for the number of users include requirements as to what the users are doing?

A logical first try at testing for such specifications is to construct a synthetic load and measure how the system performs under that load. If it passes the performance tests under that synthetic load but behaves unacceptably in practice, and the tests define the software, how is it even possible to say that the behavior is incorrect? If you think that the behavior is incorrect, how do you say so on the bug report?

This last point may be the most important: If performance is defined in terms of a synthetic benchmark, and a system passes the benchmark but fails to behave reasonably in actual use, is it even a bug at all? If it is, it must be a bug in the benchmark — either that, or the whole notion that the tests define the software is wrong. In other words, one major problem in testing performance is that doing so requires constructing benchmarks that are useful in the circumstances for which the system is being used.

Now let's think about the unfavorable interaction between benchmarks and optimization. Part of the purpose of a benchmark is to be simpler than practical use. Defining a synthetic load and a standardized set of transactions is useful precisely because doing so is easier than collecting a bunch of sales people and asking them to pretend that they are selling merchandise to customers. However, this very simplicity runs afoul of the optimizations that pervade almost every operating system and runtime library.

For example, suppose that our synthetic load reads and writes a lot of data to and from the filesystem. If the synthetic load uses the same files over and over again, copies of those files will probably wind up in the operating system's cache. As a result, operations on those files are apt to be faster than they would be under a real load. On the other hand, if the synthetic load is careful to use completely unrelated files in order to avoid inadvertent filesystem optimizations, it might wind up hurting the overall system performance more than a real load would. Evidently, if you are going to use benchmarks to specify system performance, you must not only use the benchmarks to test the system, but you must also test the benchmarks.

Another problem with synthetic benchmarks is that if they are too simple, they tempt the developers to write code that works well with the benchmarks but ignores real-world performance. This developer behavior is a logical consequence of the notion that if the software passes the tests, it's done. As an extreme example, I once encountered an implementation of common mathematical functions (such as sine and cosine) that checked whether its arguments had values that were known to be used in widely known benchmarks. If so, it returned pre-stored results corresponding to those arguments and skipped the computation entirely. This was a strategy that caused the program to pass the tests at the expense of making the tests meaningless. One manager referred to this particular piece of software as "the bait-and-switch math library."

Even at a gross level, then, performance is difficult to measure. This difficulty remains — and may even get worse — when we start thinking about measuring individual components rather than whole systems. Suppose, for example, that we want to test the performance of an implementation of a standard C++ library algorithm such as sort. The library specification says that sorting an n-element sequence should execute O(n log n) operations on average, but if a C++ implementation uses Quicksort, as is typical, a particularly unfortunate input to sort might cause it to execute O(n2) operations. How can we go about constructing an automated test that verifies whether an implementation of sort meets this requirement?

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!