Test-driven design, a methodology for low-defect software

CodeProject I wrote earlier about the good practices in designing APIs, which is so important when developing complex software. However one usually does not have the chance to start a product from scratch. This means that more often than ever, a software manager picks up an existing tool with an existing team. Making the tool more efficient –better QoR, faster runtime, smaller memory footprints, more stability, new features, etc— is made difficult by legacy code, awkward APIs, or plain wrong architecture. What to do then? We usually cannot afford to rewrite all or major parts of the product. Does that mean that we are stuck with an endless cycle of resource-intensive software incremental changes, often creating as many bugs that they are intended to fix?

Defect rate

First I would like to discuss the notion of software reliability and how it evolved over the past 40+ years. A defect causes an invalid behavior of a program with respect to its specification (e.g., incorrect output, performance issue, crash). One of many ways to look at software quality is to estimate its defect rate, i.e., the number of defects per line of code (loc), or more conveniently per 1,000 lines of code (kloc).

The first observation is that the larger the code, the higher its defect rate. It is estimated that the bug rate increases logarithmically with code size.

Thus the total number of defects for a specific application can be reduced by the following:

Continuous code factorization (direct loc reduction).

Use of libraries (which have a reduced bug rate, thanks to the extensive exposure they receive due to their long lifespan and high usage).

Increase the expressive power of the programming language (indirect loc reduction).

Since the introduction of FORTRAN in 1957, many languages and operating systems have been created and have grown more powerful and sophisticated. What could be typically coded in 10 klocs of FORTRAN can be coded today with less than 5 klocs of C++, and about 3-4 klocs of Java. Raising the level of abstraction of programming languages helps decreasing the total number of defects because it results in smaller programs with a lower defect rate.

Evidently, testing reduces the defect rate. A software powerhouse like Microsoft reports about 10-20 defects/klocs before QA, and claims that the rate drops to 1/kloc in released code. Looking at long lifespan and very critical code, statistic from the Jet Propulsion Laboratory shows that spacecraft software (which is typically only 20 klocs, and must run without interruption for years) reaches 6-10 defects/klocs after 2-5 years of testing. The code developed for the shuttle program is estimated to have less than 0.1 defect/klocs.

Over the past 40 years, independent researches from academia and the private sector have shown that on average an application has a defect rate of 5.5/klocs, regardless of the programming language and the operating system used for development. This looks counterintuitive, since increasing the abstraction level of the programming language reduces the bug rate and the actual size of one specific application. But that progress is neutralized by the ever-increasing size and complexity of the programs, made possible by better software development methodologies and powerful development environments. To put a defect rate of 5.5/kloc in perspective, consider your typical EDA place-and-route product, say 3Mlocs of C/C++, with a likely high turnover rate (i.e., percentage of locs that are modified in every release). You can expect about 16,000 defects…

Test-Driven Design

Now I will present a method that I successfully used for both existing and from-scratch products. It is based on the observation that independently from the quality of the team and the advancement of the tool, the software complexity and the unpredictable evolution of the product makes managing the software quality quite problematic. Think EDA, where customers ask for new capabilities every week and salespeople sell features 6 or 12 months before they are actually developed. It is difficult, if not impossible, to have an upfront, clean, and frozen specification, from which an architecture and a set of APIs can be derived. One needs to change the architecture and the APIs because of new unpredicted features and unforeseen problems, or simply because the software is written in a hurry without the adequate resources –I have no doubt that most readers will agree on that last point. This creates bloated code with a high defect rate, which result in application with a larger number of bugs.

Test-driven design flips the traditional software development scheme upside-down. In most cases, the software development flow consist of (1) specify the requirements in some language (e.g., English, ML, C++ or Java header files), and (2) iterate a code/test loop until the software reaches a point where it is deemed stable enough to go through a full QA regression release process. This often leads to slow iterations between the release team and the R&D team before the release is fully qualified. Also the essence of the original specification may be lost because there is no concrete way (read: operational semantics) to check whether the released product actually meets its intended requirements.

Traditional vs. test-driven software development flow

Contrast this with a test-driven design approach. In that methodology, the tests are written before anything else. The goal is to capture the specification with a set of small (positive and negative) unit tests. Then some code is written and run on the unit tests. Some of the tests fail, which lead to further refinement of both the unit tests and the code. This iteration write-test/code/test converges until one cannot design a new test that would break the code. The next step, QA regression release process, can then be carried on.

A few things are important to recognize in a test-driven software development methodology: (1) the spec is the set of unit tests; (2) therefore the release can be validated as meeting the spec; (3) the testing iteration handled by R&D is closed when the unit tests and the code are fully stable, which leads to fewer iterations between the release and R&D teams; and (4) this methodology does not assume anything about the intrinsic quality of the code and the strength of the development team. Indeed this approach can be used on very badly architected code and still lead to substantial improvements. Also note that the unit tests can be internal, e.g., written in C++ and providing a self-testing mechanism, or more traditional with external data that are fed to the application.

Case studies

Let me give a few concrete examples. A tool I was in charge of contained some legacy code that performed an essential task in EDA: constant propagation (it consists of propagating logic values through a logic network, following basic computation rules, e.g., NOT(0) = 1, AND(0, 1) = 0, and AND(1, 1) = 1). The computational principles are simple, but a good constant propagation system should be lazy, incremental, support undo, may explain to the user why some constant occurs in some part of the network, etc. This makes the development of the system much more challenging.

The legacy code produced crashes now and then. It was difficult to read, it contained suspicious piece of code to handle corner cases (e.g., multi-driver nets, user-set constants), and it had a poor testing coverage (<50%). I decided to go for a full rewrite with a clean API, and unit tests were developed together with the new code following a TDD methodology. This resulted in 6267 loc of C++, 40% of which being unit tests (click the screenshot of the C++ unit tests below), made of 1415 asserts. That code was release in May 2007, got 3 reported defects until November 2007, and has been without defect since then.

Another example is a C++ template’ized bitwise four-valued simulator, written to match the Verilog semantics. This was done with 8014 loc of C++, including 40% of unit tests, made of 1015 asserts (click the screenshot below: you can recognize the basic four-valued logic truth tables). The template was self-tested with three different concrete instances of logic representation (on 2-tuples of bool, on strings made of 32 or 64 characters ‘0’, ‘1’, ‘x’, and ‘z’, and finally on an actual logic netlist). No defect was ever found on the semantics.

In both these cases, I had the opportunity of rewriting or starting from scratch. What if one has to improve on an existing system too large to be rewritten?

The third example is about a complex feature (sequential clock gating) that at the time had been released 6 months before. The field complained about inconsistencies and erratic behavior, so I decided to apply a TDD methodology to rectify the code. First hurdle, we established a unit test campaign, which consists of describing the spec in terms of unit tests in plain English and sketches. This produced 49 unit tests, as shown below (click to enlarge).

Second hurdle, we proceeded to translate these informal unit test descriptions into elementary RTL descriptions. The idea was that if the code was compliant to the spec, we could predict exactly which optimized netlist it would produce. Third hurdle, a 3rd party reviewed these 49 RTL tests, and found that 9 of them were faulty because they did not capture what was specified in the document. Once we fixed these tests came the fourth hurdle: we run the code.

The results were brutal: the code crashed on 3 tests, it synthesized a functionally incorrect netlist in 5 cases, and produced 13 suboptimal results. Overall, 21 failures out of 49 tests, a 43% defect rate! We then went through a 2 weeks iteration of unit test refinement and code fixing with a team that never touched the initial code, to eventually converge on 72 unit tests –many more than we could think of initially– and a usable feature.

Conclusion

Test-driven design (TDD) aims at capturing a spec with unit tests, then have some code successfully running these tests. The unit tests are more important than the code itself –any code that passed the unit tests meets the spec–. TDD initially requires a higher investment: writing unit tests to capture an expected behavior is a complex task, and a 3rd party review is needed to validate them. But the effort pays off: eventually the set of unit tests becomes the spec, and can even be used as documentation. Running unit tests is fast, so it dramatically reduces the R&D testing time. Also once a code passes a comprehensive set of unit tests, the risk of iterating from QA back to R&D is reduced. Overall, test-driven design increases code correctness and stability dramatically, even in the presence of a deficient architecture and legacy code.

I’ve had success using TDD personally, though in spreading the practice there’s still a hurdle to convince people to do the “extra” work up front. It’s nice of you to share specific data about actual EDA software.

One question: I don’t quite get the point you were trying to express in emphasizing that the team “never touched the initial code.” I can read it two different ways with nearly opposite meanings: “TDD is so powerful that it drives improvement even with developers who don’t understand the original code” or “TDD works best with developers who have not been prejudiced by the original code.” (And neither may be what you intended!) Can you clarify?

Let me explain the team “never touched the initial code”. The project was getting moved left and right because of QoR issues, but without real improvement. That was happening in the middle of drastic changes (major outsourcing…), so I ended up working with a small team of people in Bangalore to take over that project. In that team, only one person was somewhat familiar with the code, and nobody had any training in terms of software quality. TDD was the perfect opportunity *and* the only option to use there, because (1) as you pointed out, there was no objection to go that route since the developers had not seen any route so far, and (2) because we couldn’t have relevant training sessions on the code, so it was better to have people jump in and fix issues as they were refining the unit tests, which led them to re-learn and re-appropriate the code.