Monday, December 19, 2011

Really good code usually consists of a large number of relatively small subroutines (or methods) that can be composed as building blocks in many different ways. But when I review production embedded software I often see the use of fewer, bigger, and less composable subroutines.This makes the code more bug-prone and often makes it more difficult to test as well.

The reason often given for not using small subroutines is runtime execution cost. Doing a subroutine call to perform a small function can slow down a program significantly if it is done all the time. One of my soapboxes is that you should almost always buy a bigger CPU rather than make software more complex -- but for now I'm going to assume that it is really important that you minimize execution time.

Here's a toy example to illustrate the point. Consider a saturating increment, that will add one to a value, but will make sure that the value doesn't exceed the maximum positive value for an unsigned integer:

int SaturatingIncrement(int x)

{ if (x != MAXINT)

{ x++;

}

return(x);

}

So you might have some code that looks like this:

...

x = SaturatingIncrement(x);

...

z = SaturatingIncrement(z);

You might find that if you do a lot of saturating increments your code runs slowly. Usually when this happens I see one of two solutions. Some folks just paste the actual code in like this:

...

if (x != MAXINT) { x++; }

...

if (z != MAXINT) { z++; }

A big problem with this is that if you find a bug, you get to to track down all the places where the code shows up and fix the bug. Also, code reviews are harder because at each point you have to ask whether or not it is the same as the other places or if there has been some slight change. Finally, testing can be difficult because now you have to test MAXINT for every variable to get complete coverage of all the branch paths.

A slightly better solution is to use a macro:

#define SaturatingIncrement(w) { if ((w) != MAXINT) { (w)++; } }

which lets you go back to more or less the original code. This macro works by pasting the text in place of the macro. So the source you write is:

...

SaturatingIncrement(x);

...

SaturatingIncrement(z);

but the preprocessor uses the macro to feed into the compiler this code:

...

if (x != MAXINT) { x++; }

...

if (z != MAXINT) { z++; }

thus eliminating the subroutine call overhead.

The nice things about a macro are that if there is a bug you only have to fix it one place, and it is much more clear what you are trying to do when there is code review. However, complex macros can be cumbersome and there can be arcane bugs with macros. (For example, do you know why I put "(w)" in the macro definition instead of just "w"?) Arguably you can unit test a macro by invoking it, but that test may well miss strange macro expansion bugs.

The good news is that in most newer C compilers there is a better way. Instead of using a macro, just use a subroutine definition with the "inline" keyword.

inline int SaturatingIncrement(int x)

{ if (x != MAXINT)

{ x++; }

return(x);

}

The inline keyword tells the compiler to expand the code definition in-line with the calling function as a macro would do. But instead of doing textual substitution with a preprocessor, the in-lining is done by the compiler itself. So you can write your code using as many inline subroutines as you like without paying any run-time speed penalty. Additionally, the compiler can do type checking and other analysis to help you find bugs that can't be done with macros.

There can be a few quirks to inline. Some compilers will only inline up to a certain number of lines of code (there may be a compiler switch to set this). Some compilers will only inline functions defined in the same .c file (so you may have to #include that .c file to be able to inline it). Some compilers may have a flag to force inlining rather than just making that keyword a suggestion to the compiler. To be sure inline is really working you'll need to check the assembly language output of your compiler. But, overall, you should use inline instead of macros whenever you can, which should be most of the time.

Sunday, November 27, 2011

EEPROM is invaluable for storing operating parameters, error codes, and other non-volatile data. But it can also be a source of problems if stored values get corrupted. Here are some good practices for avoiding EEPROM problems and increasing EEPROM reliability.

Pay attention to error return codes when writing values, and make sure that the data you wanted to write actually gets written. Read it back after the write and checking for a correct value.

It takes a while to write to data. If your EEPROM supports a "finished" signal, use that instead of a fixed timeout that might or might not be long enough.

Don't access the EEPROM with marginal voltage. It is common for an external EEPROM chip to need a higher minimum voltage than your microcontroller. That means the microcontroller can be running happily while the EEPROM doesn't have enough voltage to operate properly, causing corrupted writes. In other words, you might have to set your brownout protection at a higher voltage than normal to protect EEPROM operation.

Have a plan for power loss during a write cycle. Will your hold-up cap keep things afloat long enough to finish the write? Do you re-initialize the EEPROM when the CPU powers up in case the EEPROM was left half-way through a write cycle when the CPU was reset?

Don't use address zero of the EEPROM. It is common for corruption problems to hit address zero, which can be the default address pointed to in the EEPROM if that chip is reset or otherwise has a problem during a write cycle.

Watch out for wearout. EEPROM can only be written a finite number of times. Make sure you stay well within the wearout rating for your EEPROM.

Consider using error coding such as a CRC protection for critical data items or blocks of data items stored in the EEPROM. Make sure the CRC isn't re-written so often that it causes EEPROM wearout.

If you've seen other problems or strategies that you'd like to share, let me know!

Friday, November 4, 2011

This past summer I had the opportunity to work with my friend Gautam Khattak on several industry design reviews. By the time the dust had settled, we'd had a lot of time to think about patterns of common problems and the types of things that designers should be looking for in peer reviews for embedded system software.

As a result, we've written a two-page checklist for embedded system code reviews that you are free to use however you like.

The sections include: function, style, architecture, exception handling, timing, validation/test, and hardware. This is a pretty thorough starting point, but no doubt you will want to tailor these to your specific situation.

The checklists are broken into sections to make it easier to do perspective-based reviews. The idea is to divide up the sections among 3 or 4 reviewers so that each reviewer is thinking about somewhat different aspects of the code, resulting in better review coverage of potential issues.

Everyone has their own favorite thing to look for in a review. If you think we've missed one please let us know with a comment. One thing we have intentionally left out is system-level problems that are better done in a system-level review. These are primarily intended for module reviews that look at a couple hundred lines of code at a time.

Again, you can use these within your company or for any other purpose without having to ask our permission. But we'd be happy to hear from you if you've found them especially useful or if you have suggestions for things we've missed.

Friday, October 14, 2011

Every once in a while I'm asked to take a look at a scheme for ensuring that an embedded system is secure for a variety of reasons. Generally the issue isn't really secrecy so much as making sure that an embedded device is really what you think it is, or that a license fee has been paid for a service, or that some bad guy hasn't broken in to a system to issue bogus commands. These are hard problems for any system, but can be made especially difficult for embedded systems because severe cost constraints can discourage the use of bank-grade IT security solutions.

There is plenty out there on the web about security, so I'm not going to attempt a comprehensive tutorial here. (There is a chapter in my book that covers the basics as applied to embedded systems.) But one thing that can be difficult to find is a comprehensive list of high-level pitfalls for embedded security all in one compact list. So here it is. Ignore these hard-won lessons from the security community at your own peril.

Security by obscurity never works against a serious adversary. Any sentence that contains the phrase "the bad guys will never figure out how" is false.

Almost nobody is good enough to cook up their own cryptography. Use a good standard algorithm. If that costs too much, then realize you're just keeping out the clueless.

Secret algorithms never remain secret. Any security that hinges upon the secrecy of an encryption algorithm is a bad idea. (Those sorts of systems are snake oil.)

Weak algorithms will be broken, and probably publicized for bragging rights.

Even if an algorithm isn't broken, it might be subject to simple cloning or a replay attack that captures and repeats an encrypted message such as a door unlock command.

Never use a back-door key, or manufacturer secret key. The information will get out. Place your faith only in a large, unique-per-device random secret key.

Make sure you don't generate weak or non-random keys. All the above fallacies apply to the algorithm you use for generating random keys.

Don't forget the possibility of a disgruntled employee or a rubber hose attack disclosing your algorithm or master key.

Don't use encryption as your primary tool when what you really need is a secure hash, signature, or MAC for authentication. There are endless possibilities for getting it wrong if you use the wrong tool for the job.

Don't forget to have a plan for upgrading security or fixing bugs when they appear. Remember that update mechanisms can be attacked (for example, via a trojan horse update file).

Instances of poor security biting embedded system designers are ubiquitous. You just haven't heard most of them because nobody is allowed to talk about their own company's problems.

The point of the above list is the following: if you don't understand what a bullet is and why I'm saying it, you should probably dig deeper into security before you ship a product that needs to be secure. There are always exceptions, but they are rare. This is one of those cases where 80% of ordinary designers think they're the 1% exception case that is clever enough to get away with bending the rules -- and it's just not so.

If you have suggestions for additional items for high level of pitfalls I'd love to hear them.

Friday, September 30, 2011

An important principle in code optimization is speed up the parts that matter. One of the interpretations of Amdahl's Law is that speedup is limited by the fraction of total time taken by a particular piece of code. If for example a particular snippet of code is 10% of overall execution time, then making it execute infinitely fast is only going to give you a 10% speedup. 10% might be worthwhile, but if instead a snippet of code only takes 0.001% of total execution time, then it is unlikely spending time optimizing it is worth your while. In other words, there is no point wasting hours optimizing parts of the code that won't make any difference to overall execution speed.

But, how do you know what parts of the code matter? Intuition can help sometimes, but in practice it is hard to know your code well enough that you always guess the bottleneck locations correctly.

If you have a tool-rich development environment you can use a profiling tool (wikipedia has a description and a list of common tools). But sometimes you are on a platform that is too small to use these tools, or you're in too much of a hurry to use them because you are just absolutely sure you are right about where all the time is being spent.

Here is a trick that can save you a lot of time doing optimization. I use it pretty much every time I'm going to optimize code as a sanity check (even if a tool tells me where to optimize, because tools aren't perfect and I've learned to be careful before investing valuable time optimizing). The trick is: insert some NOPs in the code you want to optimize and see how much slower it goes. If you don't see a speed difference, then you are wasting your time optimizing that part of the code to make it go faster. If you do see a speed difference, that difference can help you estimate how much speedup you are going to get if you do the optimization.

Here's some example C code before optimization:

for(int i = 0; i < 1000; i++)
{ x[i] += y[i];
}

Depending on your compiler it might well be that you can optimize this by switching to pointers instead of indexed arrays. Assuming that you've decided it is worth the trouble to optimize your code instead of buying a faster processor (that's a whole other discussion!) then probably this code can be made faster. But, if this code is only called once in a while and is only a small part of your overall program it might not give enough speedup to be worth optimizing.

Let's say you think you can squeeze 5 clock cycles per loop out of this code with a rewrite. Before you optimize it, use a timer (or an oscilloscope watching an output pulse on an I/O pin, or a stopwatch) to run the code as-is, and then time the following code to see if there is a speed difference:

for(int i = 0; i < 1000; i++)
{ x[i] += y[i];

asm("nop");

asm("nop");

asm("nop");

asm("nop");

asm("nop");
}

This inserts 5 NOP do-nothing instructions that are executed every time the loop is executed. Assuming a NOP instruction takes 1 clock, it slows the loop down by 5 clock cycles.

If the whole program with the NOPs in the loop runs 10% slower, then you know that optimizing the loop to save 5 clocks will likely cause the whole program to run about 10% faster. (The slowdown and speedup aren't quite the same percent if you think about the math, but for small percents it's close enough to not worry about this.)

If on the other hand adding a bunch of NOPs makes no difference to the overall program speed, then you are wasting your time optimizing that loop. If adding instructions doesn't make things noticeably slower, then removing them won't make things faster.

The purpose of this code is to set min, max, and emergency shutdown thresholds for voltage coming from some A/D converter. The programmer has manually converted the voltage value to the number of A/D integer steps that corresponds to. (Similar code might be used to express time in terms of timer ticks as well.) Do you see the bug in the above code? Would you have noticed if I hadn't told you a bug was there?

Here is a way to save some trouble and get rid of that type of bug. First, figure out how many steps there are per unit that you care about, such as:
// A/D converter has 37 steps per Volt ( 37 = 1 Volt, 74 = 2 Volts, etc.)
#define UNITS_PER_VOLT 37

then, create a macro that does the conversion such as:
#define VOLTS(v) ((unsigned int) (((v)*UNITS_PER_VOLT)+.5))

Here is an example to explain how that works. If you say VOLTS(3.30) then "v" in the macro is 3.30. It gets multiplied by 37 to give 122.1. Adding 0.5 lets it round to nearest positive integer, and the "(int)" ensures that the compiler knows you want it to be an unsigned integer result instead of a floating point number.

Because the macros are expanded in-line by the preprocessor, most compilers are able to compile exactly the same code as if you had hand-computed the number (i.e., they compile to a constant integer value). So this macro shouldn't cost you anything at run time, but will remove the risk of for hand computation bugs or someone forgetting to update comments if the integer value is changed. Give it a try with your favorite compiler and see how it works.

Notes:

The rounding trick of adding 0.5 only works with non-negative numbers.Usually A/D converters output non-negative integers so this trick usually works.

This may not work on all compilers, but it works on all the ones I've tried on various microcontroller architectures. I saw one compiler do the float-to-int conversion at run-time if I left out the "(unsigned int)" in the macro, so make sure you put it in. Do a disassembly on your code to make sure it is working for you.

The "const" keyword available in some compilers can optimize things even further and possibly avoid the need for a macro if your compiler is smart enough, but I'll leave that up to you to play with if you use this trick in a real program.

As mentioned by one of the comments, you might also check for overflow with an ASSERT.

Monday, August 1, 2011

I recently worked with some embedded system teams who were struggling with the best way to use .c and .h files for their source code. As I was doing this, I remembered that back when I was learning how to program that it took me quite a while to figure all this out too! So, here are some guidelines on a reasonable way to use .c and .h files for organizing your source code. They are listed as a set of rules, but it helps to apply common sense too:

Use multiple .c files, not just one "main.c" file. Every .c file should have a set of variables and functions that are tightly related to each other, and only loosely related to the other .c files. "Main.c" should only have the main loop in it.

.H files define external storage and define function prototypes The .h files give other modules the information they need to work with a particular .c file, but don't actually define storage and don't actually define code. The keyword "extern" should generally be showing up in .h files only.

Every .c file should have a corresponding .h file. The .h file provides the external interface information to other modules for using the corresponding .c file.

Here is an example of how this works. Let's say you have files: main.c adc.c output.c process.c watchdog.c

main.c would have the main loop that polls the A/D converter, processes values, sends outputs, and pets the watchdog timer. This would look like a sequence of consecutive subroutine calls within an infinite loop. There would be a corresponding main.h that might have global definitions in it (you can also have "globals.h" although I prefer not to do that myself). Main.c would #include main.h, adc.h, output.h, process.h and watchdog.h because it needs to call functions from all the corresponding .c files.

adc.c would have the A/D converter code and functions to poll the A/D and store most recent A/D values in a data structure. The corresponding adc.h would have extern declarations and function prototypes for any other routine that uses A/D calls (for example, a call to look up a recent A/D value from polling). Adc.c would #include adc.h and perhaps nothing else.

output.c would take values and send them to outputs. The corresponding output.h would have information for calling output functions. Output.c would probably just #include output.h. (Whether main.c actually calls something in output.c depends on your particular code structure, but I'm assuming that outputs have two steps: process.c queues outputs, and the main.c call actually sends them.)

process.c would take A/D values, compute on them, and queue results for output. It might need to call functions in adc.c to get recent values, and functions in output.c to send results out. For that reason it would #include process.h, adc.h, output.h

watchdog.c would set up and service the watchdog timer. It would #include watchdog.h

One wrinkle is that some compilers can only optimize in a single .c file (e.g., only do "inline" within a single file). This is no reason to put everything in a single .c file! Instead, you can just #include all the other .c files from within main.c. (You might have to make sure you include each .h file once depending on what's in them, but often that isn't necessary.) You should avoid #including a .c file from within another .c file unless there is a compelling reason to do so such as getting your optimizer to actually work.

This is only intended to convey the basics. There are many hairs to split depending on your situation, but if you follow the above guidelines you're off to a good start.

Friday, July 1, 2011

The following is the extended version of a position statement I wrote for a panel session on Grand Challenges in Dependability for DSN 2011 in Hong Kong. Even if you are not a researcher in this area, it might provide some food for thought about the big picture, especially for embedded system security and safety.
You can find a printable version here.

Embedded systems permeate our everyday lives, including applications as diverse as cars, consumer electronics, thermostats, and industrial process controls. We have a surprising amount of reliance upon these systems, and we take their dependability almost for granted. Given extreme cost constraints, tremendous deployment scales, and the wide range of application domains, it is amazing that things more or less work well today. But, as application complexity increases, more applications become safety critical, and more embedded systems are attached to the Internet, we cannot expect business as usual with design approaches to maintain the level of dependability we want and need from such systems.

In my opinion the biggest challenges facing embedded systems lie in the areas of creating more suitable security techniques, finding a more unified approach to safety+security, dealing with composable emergent properties, and deploying dependability techniques to small product development teams.

Deeply embedded system security has significantly different constraints and requirements than enterprise and personal computing security. Embedded control systems often have severe resource constraints, limited development budgets, and stringent real time performance requirements. But an even more pressing security problem in many embedded systems is that the effects of a malicious fault can cause physical damage to people and the environment. It is less difficult to reverse or adequately insure against most malicious financial transactions than it is to reverse the release of toxins into the environment or “roll back” a multi-vehicle collision. Additionally, most embedded systems to date have been designed with near-zero security once an attacker has access to the internal control network. IT-based techniques for addressing that situation are unlikely to suffice due to matters of cost, real time dynamics, and lack of complete physical isolation from attackers.

Inevitably, embedded system safety and security will have to merge into a unified discipline, or at least a tightly-coordinated set of sub-disciplines. It is questionable to build safety cases for most everyday systems upon a faulty presumption of perfect security. At the same time, security techniques will need to take into account the safety implications of vulnerabilities and system outages. One element of a safety and security unification strategy might be to look at security faults as an attack on the assumptions of the safety case (e.g., an attacker negates the random independence assumption of fault arrival rates).

Due to the limitations and realities of embedded system development, workable dependability approaches will likely include some notion of cost-effective resilience in the face of inevitable faults, as well as a way to balance the tension among the often conflicting goals of safety, security, performance, and reliability. The good news will be that there are opportunities to exploit domain characteristics such as physical process inertia in ways not practical in desktop and enterprise computing.

A long-standing problem has been increasing the composability of emergent system properties. It is desirable to have building blocks that can be composed arbitrarily without surprises, and by the same token have an ability to decompose a system architecture so that predictable building blocks can be identified in a way that minimizes cross-coupled quality attributes. Much progress has been made on this in the area of real time systems, but much remains to be done in other areas such as safety and security. An additional challenge will be ensuring the composability of massively deployed distributed systems so that, for example, a city full of smart thermostats doesn’t display emergent aggregate behavior that takes down the power grid.

Finally, the serious challenges posed by creating a dependable system are made more difficult when the development teams are typically composed of a handful of domain experts who may have no formal computer training beyond an introductory programming course. The traditional way to deploy advanced knowledge is via synthesis and analysis tools, and this has been done with astonishing success in IC design. More recently, model-based design has been helping embedded system designers in some domains perform code synthesis from relatively high level system behavioral descriptions. But will be a long time before tools can provide us with push-button automation that addresses the myriad aspects of embedded system dependability.

Sunday, June 5, 2011

Have you ever written perfect software? Really? (And if you did, how exactly do you know that?) If you're imperfect like the rest of us, how do you take that into account in your software architecture?

As far as I can tell nobody knows how to write perfect software. Nearly perfect is as good as it gets, and that comes at exponentially increasing costs as you approach perfection. While imperfect may be good enough for many cases, the bigger issue is that we all seem to act as if our software is perfect, even when it's not.

I first ran into this issue when I was doing software robustness testing (the Ballista project -- many moons ago). A short version of some of the conversations we'd have went like this. Me: "If someone passes a null pointer into your routine, things will crash." Most people responded: "well they shouldn't do that" or "passing a null pointer is a bug, and should be fixed." Or "nobody makes that kind of stupid mistake" (really??). We even found one-liner programs that provoked kernel panics in commercial desktop operating systems. Some folks just didn't care. But the folks who concentrated on highly available systems said: "thanks -- we're going to fix everything you find." Because they know that problems happen all the time, and the only way to improve dependability in the presence of buggy software is to make things resilient to bugs.

Now ask yourself about your embedded system. You know there are software bugs in there somewhere. Do you pin all your hopes on debugging finding every last bug? (Good luck with that.) Or do you plan for the reality that software is imperfect and act accordingly to increase your product's resilience?

Here are some of the techniques that can help.

Watchdog timer in case your system wedges (is it turned on? is it kicked properly?)

Tuesday, May 3, 2011

Thanks to everyone who attended my talk at ESC SV! There were about 80 attendees, and I especially appreciated the many questions and insightful comments from the audience.

The organizers ran out of handouts, so I'm posting the handout Acrobat file so everyone can get a copy. This copy has been updated to include some last-minute changes I made right before the talk. It summarizes a lot of the areas covered by my book:

Monday, May 2, 2011

I've received a couple notes saying that my book is out of stock on Amazon. What's really going on is that Amazon itself doesn't sell the book, but rather the publisher sells it as Fulfilled by Amazon (FBA). So what you are looking for is the "Available from these sellers" link and order from Geos Fulfillment, which is then shipped direct from Amazon's warehouse.

Or, you can order via Paypal. That ships for free via priority mail worldwide direct from the publisher's warehouse.

Wednesday, April 27, 2011

Jack Ganssle has a nice design article that details a long history of design errors, from bridge building to safety critical software problems. Much of it is about NASA mission failures, but he also checks in on the topics of radiation therapy, pacemakers, and nuclear experiments. Good one-stop shopping for horror stories and a discussion of high level patterns behind these sorts of problems.

On discussing the Therac 25:
"The FDA found the usual four horsemen of the software apocalypse at fault: inadequate testing, poor requirements, no code inspections, and no use of a defined software process."

Quote of the day from the article:
"Globals are responsible for all of the evil in the universe, from male pattern baldness to ozone depletion."

Monday, April 18, 2011

In previous posts I've talked about how useful peer reviews can be. But some many folks say they are just too expensive. That just isn't so. Here's the math.

You can write good embedded software at the rate of 1-2 lines of code per hour. (That is all source lines of code divided by all hours for a project, including requirements, test, etc. Really, that is the number.)

Formal, rigorous, thorough peer reviews target 100-200 lines of code reviewed per hour (we're talking Fagan style inspections, which are quite formal, but still the most cost effective review method from all the data I've seen). Let's say you have 4 people in an inspection. That is 25-50 lines of code per person-hour. If you include an hour of prep time for each person for a two-hour review, that is still 17-33 lines of code per person-hour.

Think you write code faster? OK, let's say you're Agile and do 3 lines of code/hr (what I've seen in industry, and comes with increased risk we can discuss elsewhere). That's still nowhere near the review productivity rate.

We're talking maybe 5%-10% of project effort to do heavy-weight reviews. And it generally finds half the bugs. More importantly, it finds them early, so you don't have a lot of rework.

Make the operation task-focused, simple, and consistent. Bring the operation itself as close to possible as what the user wants to accomplish (the outcome), not what the device knows how to do (the mechanisms available).

Keep things simplicity and consistent.

Make the terminology familiar (and task-focused, and simple). Use the meaning that people expect to see, not jargon. Use different terms whenever there is a concept that differs in an important way. Don't use different terms if the difference doesn't matter to the user.

Make mistakes low cost (e.g., provide undo) so people aren't afraid to explore the interface.

The concept of an objects/actions matrix looks interesting. The idea is you put objects as the rows of a matrix and actions on those objects as a column. Whenever an action is permissible for an object it gets a check mark in the matrix. Good designs have square, densely filled in matrices (simple, consistent). Bad designs have large, sparsely filled in matrices (complex, inconsistent).

This article is well worth the time for anyone who designs embedded systems.

Thursday, March 24, 2011

Perhaps the single most difficult thing to get right in a bug report is assigning a priority. It is tempting to assign a priority based on how spectacular the result is. But you also need to take into account how likely the bug is to manifest in a deployed system. For example, a bug that causes a system to crash and automatically reboot may be a lot more dramatic than a confusing screen message, but if that confusing screen message results in thousands of tech support calls, it could be a disaster for your company. The best approach is one that combines severity and probability.

It is tempting to try to use fancy math to combine severity and probability. Usually that doesn't work out so well in practice. Instead, I recommend borrowing a technique from the safety critical system community. They use a Risk Table to assign a "criticality" to a particular adverse event as part of their Preliminary Hazard Analysis (PHA). You can use the same table in a different way and just say you are assigning bug "priority" instead of "risk." Below is an example Risk Table:

Probability is your best estimate as to how often the bug will be seen in use. Consequence is how big a problem it will cause. The Risk (indicated by each box in the grid) is how big the risk is to product reputation -- which ought to be the same as the bug priority. It helps to have clearly defined statements to guide assigning any particular bug to a row and to a column. Once you assign probability and consequence, the table tells you the priority of that particular bug.

You can modify this table to have 3 to 6 rows and 3 to 6 columns depending upon your needs (the table can be a rectangle rather than a square). You can also modify the asymmetry of assigning risks as has been done in this example (consequence is weighted a bit higher than probability for this table by putting extra "Very High" boxes on the top row and so on). The point is not the table itself, but rather that binning things in this way makes assigning bug priority a lot easier for people to do in practice.

Monday, February 21, 2011

One of the imponderables when using a bug tracking system such as Bugzilla is how to assign "severity" to a particular defect entry. It is instinctive to assign severity based on how dramatic the outcome is. So in that sort of system an LED that sometimes doesn't blink might be "medium" and "low" and a system crash might be "critical." BUT, that's usually not the best way to look at severity.

Instead defect severity should be based on how urgent it is to get fixed for the business environment your product lives in. To use the above example, what if the non-blinking LED causes 1000 field support visits calls because people think the device isn't working properly (perhaps it is a cable modem and "blinking" means "I'm working properly")? What if the system crash is likely to happen to only one out of every 100,000 customers, and happens in a situation in which the customer is very likely just to cycle power to clear the problem without any big deal? In that situation the first defect might be "critical" and the second might only be "medium" severity.

So when you are thinking about assigning issue severity, consider the business context and not just how the defect feels to a tester or developer at a test bench. The general idea is that defect severity should correspond to value to the business of fixing the defect, and not how embarrassing it might be to the developers. (You can read more about good practices for defect tracking in chapter 24 of my book.)

Monday, January 24, 2011

Surprisingly, there is a fairly easy way to know if your peer reviews are effective. They should be finding about 50% of your defects, and testing should be finding the other 50%. In other words the 50/50 rule for peer reviews is they should find half the defects you find before shipping the product. If peer reviews aren't finding that many defects, something is wrong.

I base this rule of thumb on some study data and my observations of a number of real projects. If you want to find out if teams are really doing peer reviews (and doing ones that are effective), just ask what fraction of defects are found in peer reviews. If you get an answer in the 40%-60% range probably they're doing well. If you get a lower answer than 40%, peer reviews are being skipped or are being done ineffectively. If they answer is "we do them but don't log them" then most of the time they are being done ineffectively, but you need to dig deeper to find out what is going on.

If you are trying to find all your defects in test (instead of letting peer review get half of them for you), you are taking some big risks. Test is usually a more expensive way to find defects. More importantly, peer review tends to find many defects or poor design choices that are difficult to find by testing with any reasonable effort.

So, why make your testing expensive and your product more bug prone? Try some peer reviews and see what they find.

Thursday, January 13, 2011

Series Intro: this is one of a series of posts summarizing the different red flag areas I've encountered in more than a decade of doing design reviews of industry embedded system software projects. You can read more about the study here. The results of this study inspired the chapters in my book.

To conclude these series of postings, here is an observation to ponder.

One of the informal observations made across the course of these reviews was that developer teams with exactly 5 primary contributors have the most spectacular project failures. Invariably these teams had previously completed a project with 3 or 4 members successfully, and increased the team size to tackle a more complex project without making any changes in their software process. But they failed with the new, 5-person team.

While this is an anecdotal result, projects that grow past 4 developers in size should seriously consider switching to a heavier weight software process (more paper, more formality, more methodical rigor). Smaller teams still seem to benefit from good process, but basically can get away with informality with less dramatic risks than larger teams (5 or more developers) working on more complex projects.

Does this mean with fewer than 5 people you can simply ignore all the risk areas I've posted? No. Most of the reviews (perhaps 80%) were conducted on teams with fewer than 5 people. What this does mean is that with only one or a few developers you can get away with a lot and only have a few red flag risk areas. If you are lucky you will survive them, and if you are unlucky they will bite you. Hard. But most of the time you will slide by well enough that you will work nights, weekends, and have no social life -- but you will still have a job.

But, if you have more than 5 people you probably have to do most or even all of the process activities listed in these risk areas. If you blow off using a rigorous process, it is pretty likely you will fail. Probably you will fail spectacularly.

At least this is what I have observed doing 95 reviews over 10 years in industry. Your mileage may vary.

About Me

I've done embedded systems for big industry, the US military, startup companies, and now Carnegie Mellon University. I'm the author of the book Better Embedded System Software, which goes into more detail on most of the topics discussed in my corresponding blog.As with any blog, these posts often contain speculative and partially formed thoughts, and should not be interpreted as a fully considered opinion unless stated otherwise.Key pages:Academic home page at CMUEmbedded Software Blog Checksum and CRC Blog