Empirical Software Engineering

Quality by the Numbers

A good example of the harvest that empirical studies are generating relates to one of the holy grails of software engineering: the ability to measure the quality of a program, not by running it and looking for errors, but by automated examination of the source code itself. Any technique that could read a program and predict how reliable it would be before it is delivered to customers would save vast sums of money, and probably lives as well.

One consistent discovery is that, in general, the more lines of code there are in a program, the more defects it probably has. This result may seem obvious, even trivial, but it is a starting point for pursuing deeper questions of code quality. Not all lines of code are equal: One line might add 2 + 2 while another integrates a polynomial in several variables and a third checks to see whether several conditions are true before ringing an alarm. Intuitively, programmers believe that some kinds of code are more complex than others, and that the more complex a piece of code is, the more likely it is to be buggy. Can we devise some way to measure this complexity? And if so, can the location of complexity hot spots predict where defects will be found?

One of the first attempts to answer this question was developed by Thomas J. McCabe and is known as cyclomatic complexity. McCabe realized that any program can be represented as a graph whose arcs show the possible execution paths through the code. The simplest graph is a straight chain, which represents a series of statements with no conditions or loops. Each if statement creates a parallel path through the graph; two such statements create four possible paths. Figure 2 shows a snippet of code extracted from a cross-platform download manager called Uget. The graph in part (b) shows the paths through the code; each if and loop adds one unit of complexity, giving this code an overall complexity score of 3.

Another widely used complexity measure is Maurice Halstead’s software science metric, which he first described in 1977. Instead of graph theory, it draws on information theory and is based on four easily measured features of code that depend on the number of distinct operators and operands, their total count, how easy they are to discriminate from one another and so on. Figure 2(c) shows the values for the sample piece of code in (a).

Hundreds of other metrics have been developed, published and analyzed over the past 30 years. In their chapter, Herraiz and Hassan use statistical techniques to explore a simple question: Are any of these metrics actually better at predicting errors than simply counting the number lines of source code? Put another way, if a complexity metric is highly correlated with the number of lines of source code, does it actually provide any information that the simpler measure does not?

For a case study, Herraiz and Hassan chose to examine the open-source Arch Linux operating system distribution, which yielded a sample of 338,831 unique source files in the C language. They calculated the measures discussed above, and several others, for each of these files, taking special account of header files (those consisting mainly of declarations that assist in code organization). They found that for nonheader files, where programs actually do their work, all the metrics tested showed a very high degree of correlation with lines of code. Checking for generalizability, the effect held for all but very small files. The authors drew a clear lesson: “Syntactic complexity metrics cannot capture the whole picture of software complexity.” Whether based on program structure or textual properties, the metrics do not provide more information than simply “weighing” the code by counting the number of lines.

Like all negative results, this one is a bit disappointing. However, that does not mean these metrics are useless—for example, McCabe’s scheme tells testers how many different execution paths their tests need to cover. Above all, the value here is in the progress of the science itself. The next time someone puts forward a new idea for measuring complexity, a validated, empirical test of its effectiveness will be there waiting for them.