One of the most important questions in software testing is "how much is enough"? For combinatorial
testing, this question includes determining the appropriate level of
interaction that should be tested. That is, if some failure is
triggered only by an unusual combination of more than two values,
how many testing combinations are enough to detect all errors? What degree of interaction occurs in real system failures?

If you have have questions, or would like to contribute data to this collection, please email me at kuhn@nist.gov. Failure data: The table below summarizes what we know from empirical studies from a
variety of application domains, showing the percentage of failures that
are induced by the interaction of one to six variables. For
example, 66% of the medical device failures were triggered by a single variable
value, and 97% were triggered by either one or two variables
interacting. Although certainly not conclusive, the
available data suggest that the number of interactions invovled in
system failures is relatively low, with a maximum from 4 to 6 in the
six studies cited below. (Note:
TCAS study used seeded errors, all others are "naturally
occuring".) These results can be summarized in what we call the Interaction Rule: Most failures are induced by single
factor faults or by the joint combinatorial effect (interaction) of two
factors, with progressively fewer failures induced by interactions between
three or more factors.

Static analysis data: Why do the fault detection curves look this way? That is, why
does the error rate tail off so rapidly with more variables
interacting? One possibility is that there are simply few complex
interactions in branching points in software. If few branches
involve 4-way, 5-way, or 6-way interactions among variables, then this
degree of interaction could be rare for faults as well. The table
below (Table 2 and Fig. 2) gives the number and percentage of branches
in avionics code triggered by one to 19 variables. I developed
this distribution from
Chilenski's report on the use of MCDC testing in avionics software,
which contains 20,256 logic expressions in five different airborne
systems in two different airplane models. The table below
includes all 7,685 expressions from
if and while statements; expressions from assignment (:=) statements were excluded.

Table 2. Number of variables in avionics software branches

Vars

Count

Pct

Cumulative

1

5691

74.1%

74.1%

2

1509

19.6%

93.7%

3

344

4.5%

98.2%

4

91

1.2%

99.3%

5

23

0.3%

99.6%

6

8

0.1%

99.8%

7

6

0.1%

99.8%

8

8

0.1%

99.9%

9

3

0.0%

100.0%

15

1

0.0%

100.0%

19

1

0.0%

100.0%

As
shown in Fig. 2, most branching statement expressions are simple, with
over 70% containing only a single variable. Superimposing the
curve from Fig. 2 on Fig. 1, we see (Fig. 3) that most faults are
triggered by more complex interactions among variables. It is
interesting that the NASA distributed database faults, from
development-phase software bug reports, have a distribution similar to
expressions in branching statements. This may be accounted for by
the fact that this was development-phase rather than fielded software
like all other types reported in Fig. 1. As faults are removed,
the remaining faults may be harder to find because they require the
interaction of more variables. Thus testing and use may push the
curve down and to the right.

Narrow software domains: We have investigated a particular class
of vulnerabilities, denial-of-serivce, using reports from the National Vulnerability Database
(NVD), a repository of data on all publicly reported software security
vulnerabilities. NVD can be queried for fine-granularity reports
on vulnerabilities. Data from 3,045 denial-of-service
vulnerabilities have the distribution shown below.

Vars

NVD
cumulative %

1

93%

2

99%

3

100%

4

100%

5

100%

6

100%

Analyzing failure data: Our Summer Undergraduate Research Fellowship student Menal Modha has prepared a
short tutorial
on analyzing failure reports to determine the degree of variable
interaction in failures. The NVD data sets wre analyzed by Evan
Hartig, Bryan Wilkinson, and Menal Modha, and are available here:

Combinatorial vs. exhaustive testing: The
studies cited below compare combinatorial methods with exhaustive (with
respect to discretized values) testing, showing combinatorial testing
produced equivalent results with only a small fraction of the tests
required for exhaustive:

Giannakopoulou et al. (2011) compared
various code coverage metrics using combinatorial and exahustive
testing. Coverage was nearly identical and authors reported that
only one segment of code was missed by 3-way testing, because it
required a specific combination of four variables that would have been
caught with a 4-way covering array.

Test set

No. of tests

Stmt %

Branch %

Loop %

Condition %

Exhaustive

9.9 x 106

94/94

90/92

46/37

85/83

3-way

6,047

93/94

89/91

46/37

83/81

Montanez
et al. (2011) compared combinatorial and exhaustive test sets for
conformance testing of the W3C Document Object Model. A 4-way
test suite found all faults discovered by exhaustive testing with less
than 5% of the number of tests.