Where I work we use an offshore company to augment several internal departments. I work in the QA team, and concerns have been aired about the level of confidence we can have in the test work we have been assigning to workers at this company.

I have suggested that it may be a good idea to use error seeding to get a metric of our confidence level in code that has been tested by this company, and compare it with the same metric as used internally.

This is a technique I learned about on an ISEB testing course, however I have not met many people who have used it in a real-world environment, and so would like some input as to what best practices may be. The following are things which I have thought of, but obviously it would be better to learn from the experience of someone who has tried this before.

Have developers seed multiple (n) bugs per testable item

Have developers track these bugs in a system that QA is not able to access

Alert the tester for said testable item to how large n is (approx? absolute?)

For each bug that tester raises, alert them as to whether the bug was seeded or not

My reasoning is that if we make this metric visible and transparent, it will also give testers feedback as to how rigorous their testing is.

I would welcome input as to any practical issues that may make this difficult or even unworkable. If this is implemented I shall leave an answer updating everyone on my own experience with this.

Why are there reasons not to have confidence in the testing that is being done ? Do you have evidence of missed bugs and/or poor testing ?
–
Phil KirkhamMar 19 '12 at 16:26

Evidence of missed bugs - yes, however there were only two bugs which were know. Both were however missed - it is my thinking that while this looks bad, it doesn't really mean anything - systematically injecting bugs to a the codebase would give a more accurate view of how the land lies, and if it's somewhere near as bad as it looks, give a chance for improvement.
–
theheadofabroomMar 19 '12 at 16:41

Have you considered piloting your error seeding process on projects being tested internally? That way, you'll get to see if your process produces the desired metrics in a less adversarial setting.
–
Joe StrazzereApr 12 '12 at 11:41

3 Answers
3

I agree with Joe's assertion that metrics can be badly misused and counterproductive.

That said, error seeding can be a useful way to answer the right question.

After we develop a test plan, we generally assume two things:

The test plan is unlikely to uncover every error in the system. (If the system is large and complex, we can assume some undiscovered errors will remain.)

Still, the test plan should be able to uncover many errors in the system. (Otherwise, it's not a very well-designed test plan.)

Let E = the total number of error in system, D = the number of errors discovered by the test plan, and U = the total number of undiscovered errors. We know that E = D + U, but we can never be sure what E or U are. However, we'd like to think E/D is close to 1.

So, how do we calculate E/D - our error detection rate? Because we'll never know E or U, we can't! But we can inject faults into the code, and then run the test plan. By seeing how many of the known errors get detected, we can calculate an error detection rate in a controlled environment. Knowing if this controlled error detection rate is high or low might prove instructive.

Some notes and pitfalls:

This method is designed to collect meaningful data where prewritten test plans are run. They provide an indication about whether or not these plans are likely to catch errors, or simply lulling testers into a false sense of security.

This will only work if the seeded errors are plausible, and striking a balance between being well-hidden and detectable. If the errors are too easy to find (or too hard to find), the resulting data will be neither accurate nor useful. Therefore, it takes a fair amount of expertise to seed the errors. (One possibility would be to remove a divide-by-zero check. If no test case in the test plan finds the error, then perhaps that points to an area where the test plan could be improved.)

The people injecting the faults should be unfamiliar with the test plan. Moreover, if the test plan isn't already written, those who write the test plan should be completely unaware of the injected bugs. Otherwise, the two groups will be engaged in a "chess match" of sorts, and the data won't provide an accurate indication of the test plan's effectiveness.

This isn't meant to be an ongoing exercise, so you needn't worry about testers "stopping" after the seeded bugs are found. This technique isn't used to test the system, it's used to test the test plan.

Joe is right to point out that this technique is not worth doing, unless it is done correctly. That said, it still has its place in the software testing realm, particularly in organizations that rely heavily on pre-written or scripted tests.

Welcome to SQA, J.R. Aside from D and U, it may also be interesting to know the number of false positives (the number of reports that turned out not to be bugs). As I understand it, this can be as big an issue with outsourced QA as the bug detection rate.
–
user246Mar 19 '12 at 17:48

3

@user246 false positives are bad, yes, but aren't many of them a symptom of under-communication or under-documentation of the expected behavior of the application? I've worked with a few overseas QA teams throughout my career and I've found that the better documentation you can give the less false positives you get (duplicates are just as much of an issue in this regard). Also, if the issue isn't under communication or documentation then isn't it time to re-evaluate your testing team in general (this goes for local and third-party testing teams).
–
DezMar 19 '12 at 18:46

@Dez yes, I agree it is a symptom of communication problems. Sometimes communication works best when people are in proximity, which of course is not possible when the team is remote. These problems can be mitigated with the right effort. Will the organization invest that effort? Sometimes yes, sometimes no.
–
user246Mar 19 '12 at 19:18

@user246 However, in light of the topic of this question, organizations are willing to invest time and effort to seed bugs into the application in order to test the effectiveness of the team testing. Would the money used for that effort possibly be better used to improve communication and/or documentation?
–
DezMar 19 '12 at 19:22

@Dez Welcome to SQA, BTW. This might be a good subject for a new question.
–
user246Mar 19 '12 at 19:38

I have, however been part of projects where new metrics were introduced - which is essentially what you are proposing here. If my experience is any predictor, you should expect that people will get better at specifically whatever it is you measure, at the expense of everything else.

Here, you might expect the testers would get better at finding the seeded errors, and worse at other things. Those other things might include - helping others, documentation, finding non-seeded bugs, collaborating with developers, etc, etc.

You might also find that, once the seeded bugs are found, testers stop testing.

And you might find morale issues among the testers (and perhaps developers and others as well).

With any metrics program, consider carefully the unintended consequences and side-effects, on the testers, developers, and others.

If this is implemented I shall leave an answer updating everyone on my
own experience with this.

+1 What happens when you tell the testers that the bug they found was put in on purpose to 'test the testers' ?
–
Phil KirkhamMar 19 '12 at 16:24

So is your point that metrics in general are bad, or that there is we'd need more metrics, in which case, what would best complement this technique to garner the information we need? The issue in question is that there is a low confidence in the work our counterparts outside the company have been doing, but management insist that it is more effective than spending the same money on increasing the headcount internally. Without some sort of metric the situation will stay the same, whereas I hope that either confidence can increase, or we can justify switching to more expensive internal resources
–
theheadofabroomMar 19 '12 at 16:31

2

My point is that many times there are unintended consequences of metrics programs. Your developers will be injecting the bugs, right? What is their motivation? If they want to demonstrate that "those stupid outsourced testers can't find bugs", I'm going to bet they could inject some doozies.
–
Joe StrazzereMar 19 '12 at 17:54

The project goal of this would be to determine whether we were getting enough out of our outsourcing that it was justified as a cheaper and more flexible option than hiring more internal resource. The developers may well have the attitude that they're trying to trip up the outsourced testers, however the same process would be used by internally QA'd projects, and testers are often not assigned until the work is ready for test (there will be both internal and external commitment on the project).
–
theheadofabroomMar 20 '12 at 11:08

The software in question in a flash application, where it would not be difficult to insert the odd non-fatal stack trace, or a few out of alignment resources and off-by-one counts - similar to the sort of bugs that are common.
–
theheadofabroomMar 20 '12 at 11:09

Researchers (e.g. Atif Memon at the University of Maryland) use fault seeding to measure the effectiveness of testing techniques. In that context, the practice makes sense. I assume you propose to use seeding as an experiment, not as an on-going process.

You mentioned rigorous testing. In a rigorous experiment, you decide ahead of time how you plan to interpret your results. Everyone involved in this experiment -- you, your QA team, your developers, and the outsourced team -- will have their own biases that will impact how they respond to the experiment's measurements. You can reduce the effect of those biases by agreeing on how you will collect and analyze the measurements before you begin the experiment. This includes agreeing on a sample size that is large enough to be statistically significant.

Edit

In addition to measuring the number of found bugs and missed bugs, it may also be interesting to measure the number of false positives, i.e. the number of reports that turns out not to be bugs.