I ask because in a recent Boagworld podcast, Steve Krug talks about how he recommends testing with only three users. When asked if that's statistically significant, he says no and goes on to explain that, to paraphrase, it doesn't really matter because some issues will cause everyone to fall at certain steps.

5 Answers
5

There is no contradiction between being concerned with statistical significance and conducting usability tests with 3 to 5 users. Technically, “statistical significance” means the results you’re seeing cannot be plausibly attributed to chance. In scientific research, where the cost of reporting spurious results is high, “plausible” is generally defined as a 0.05 probability or higher. There are several issues when applying this to a usability test of as few as three people.

First of all, the significance level of your results depends not only on the sample size but on the magnitude of the observed effect (i.e., how different it is from your null hypothesis). You can have significance with small sample sizes if the magnitude is big enough. So in the case of a usability test, what’s the magnitude? What are you comparing your effect to?

If you run the binomial calculations, it turns out that if 3 out of 3 of your users have a serious problem with your product, then at a significance level of 0.05, at least 36% of the population will also have the same serious problem with your product (one-tailed test). I don’t know about you, but 36% is an awfully big proportion of your users to frustrate, and of course, it could easily be much more. It's clearly a serious usability problem. What Krug apparently fails to realize is that if you have an issue that "will cause everyone to fall," then the results from a sample of 3 or so people will be statistically significant for a pragmatic null hypothesis.

Or take the usability test testing rule of thumb to have about 5 users per usability test. If a problem affects 30% or more of your users, you have an over 0.83 probability of observing it one or more users with a sample size of 5. On the other hand if a problem affects 2% or fewer of your users, then you have less than 0.096 probability of observing it in 1 or more users. So by testing 5 users and attending to anything seen in one or more users you have an excellent chance of catching the most common problems and little chance of wasting time on problems affecting a tiny minority.

So far from ignoring statistical significance, drawing conclusions from usability tests on 3, 4, or 5 users is actually perfectly consistent with the laws of probability. This is why empirically it has worked so well. Additionally, statistical significance only relates to quantitative results. Usability tests also typically include qualitative results which can boost your confidence in the conclusions. You find out not only how many have a problem but, through your observations and debriefing question, you uncover why. If the apparent reason why is something that is reasonably likely to be relevant to a lot of your users, then you should have more confidence in your results.

That said, there’s a caveat to testing with such small sample sizes that gets back to the issue of the magnitude of the effect: small-sample usability tests are only good for finding big obvious problems –ones that affect a large proportion of users. However, sometimes you have to worry about problems affecting small proportion. To take an extreme case, if the problem only occurs for 2% of your users but it ends up killing those 2%, then obviously you want to know about it, and obviously a sample size of 5 is not going to cut it.

Similarly, when comparing results of two designs or problems, you can’t confidently state that one is better than the other with a small sample size unless one completely blows the other out of the water. When you need to know at a 0.05 level of significance which problem is greater or which design is performing better, larger sample sizes are called for. As a quick and dirty (and conservative) estimate of the sample size you need, take the precision you want, invert it, and square it. For example, if you want to know the percent of users able to complete a task to within 5%, then you need as many as (1/0.05)^2 = 400 users!

On the other hand, who says you need significance at the 0.05 level? For the business, what are the consequences of choosing one design to build or one problem to solve over the other? In many situations wouldn’t we be satisfied with a 0.10 probability of pursing a spurious results? Or even 0.20? The cost of missing a good design or top priority problem may be much more than erroneous pursuing something when it doesn’t make any difference. For any given sample size, the larger the real difference in magnitude, the smaller the chance you’re wrong, so if you are wrong in choosing one thing over the other at a 0.20 level of significance, you’re unlikely to be terribly wrong –you’re unlikely to have been much better off going with the other option.

Take another extreme case: You test two icons for something on three users. Two users do well with Icon A while only one does well with Icon B. For a null hypothesis of equal performance of the icons, the two-tailed significance level is 1.0 –it can’t get any more insignificant. But which icon do you choose? One icon doesn’t cost any more to use than the other one and you have to choose one. So of course you choose Icon A. Obviously, you should have low confidence in your choice. Obviously it’s reasonably plausible that the icons could perform equally well in the real world. There’s even a reasonable probability that B is actually better than A. But in the absence of any other data, Icon A is obviously your best bet. In the presence other data the level of significance does matter -you want to know how much confidence to place in each piece of information you have. However, the point is you don't always have to be 95% confident about the information for it to be worth considering.

A very thorough analysis! It's worth emphasising your point that statistical significance only relates to quantitative results since usability tests often yield a lot of qualitative data.
–
Bennett McElweeMar 9 '10 at 21:21

Wish I could star answers as well as questions!
–
Francis NortonJul 15 '10 at 10:25

For example, if three out of three people said they want to be able to sort the search results, that doesn't necessarily mean you should add sorting functionality. It means you should consider why the people said this and what need underlies it. Maybe you will end up designing a sort. Maybe a filter. Maybe something else.

while testing alone is not a good
indicator of where a team’s priorities
should lie, it is most certainly part
of the triangulation process. When put
in context of other data, such as
project goals, user goals, user
feedback, and usage metrics, testing
helps establish a complete picture.
Without this context, however, testing
can be misleading or misunderstood at
best, and outright damaging at worst.

Actually, I would say that asking users if they want sort is not a usability test. Providing sort and seeing how it improves performance is a usability test. I'd call asking users' opinions on a design a "misusability test" :-). However it is true that test results, including significant results, should be considered in context with other reliable data. But the questions is, if test results are not significant, should they be considered at all? Aren't they just junk?
–
Michael ZuschlagMar 8 '10 at 23:10

I agree; I would never ask users what the want in a usability test. But they often volunteer such information. "Ah, here's the table I want. But it's not sorted properly! There should be a Sort button."
–
Bennett McElweeMar 9 '10 at 20:32

2

Even if test results are not statistically significant, they can still be very useful. Consider a test with only one person. Probably not significant to any useful level. Yet you will still gain valuable insights from such a test.
–
Bennett McElweeMar 9 '10 at 20:50

Think of it this way. Let's say you take a trip to a distant land. You get off the boat and everyone had 12 fingers. You don't know if this is an anomaly or that everyone in the whole country is like this. Oh, you sell gloves in this scenario. :)

It doesn't matter if you had statistical confidence or not. You saw a giant red flag as soon as you got off the boat. This is clearly something you need to explore and figure out what is going on. If you sold guitars, you might not care as much. Simple (no confidence) usability studies helps your identify big red flags. That is usually all you need to make the system better.

Actually, if we assume those 12 people are close enough to a random sample, then you can say with 96% confidence that at least 77% of the population have twelve fingers. So, in fact, you can be quite statistically confident that something highly unusual is going on.
–
Michael ZuschlagMar 9 '10 at 4:02

If you are performing an experiment with a decent number of participants and you intend to do a statistical analysis, then statistical significance is of paramount importance in framing your results (i.e. what is likely down to chance, and what is likely down to interaction between your experimental factors).

If you are doing interaction research on a small number of participants, I would argue that the kind of empirical analysis that you are talknig about are not appropriate due to the small sample size.

IMHO signficance is always relevant and important when conducting statistical analyses. The question then is, when is it appropriate to perform statistical analyses? My answer would be when you have enough participants. If not, don't perform the analyses and just ignore significance because the output will be pointless or of little value/validity.

How do you know if you have enough participants unless you conduct the statistical analysis? Isn't the point of a significance test to answer the question, "Do I have enough participants to warrant generalizing the statistic to the population?"
–
Michael ZuschlagMar 9 '10 at 12:15

It's as you put in your answer above, imo - to determine chance from phenomenon. To answer the question "Do I have enough participants to warrant generalising..." is a question of validity, not solely statistical significance (i.e. signficance is a subset of validity). IMO your results will have external validity when the populations are broadly comparable, the results are signficant with the appropriate statistical power - but that there's no hard and fast rule for acceptance of validity (although I'm far from being a master statistician).
–
Nick FineMar 10 '10 at 9:17

Actually, sample size is only relevant for statistical significance (and confidence intervals). It’s irrelevant for other aspects of external validity. If you have a systematic bias in your sampling, procedure, or measures, increasing your sample size is not going to help. If you have no systematic bias, then a significant result from n=3 is externally valid. Counter-intuitive, I know.
–
Michael ZuschlagMar 25 '10 at 21:19

I would echo Michael's caution about representativeness being more important than sample size.

Regarding the issue of Statistical Significance and the sort of discount testing detailed by Krug and Nielsen. I think Krug and others offer the answer that statistical significant isn't relevant because it can be a complex topic and there are always people out there ready to pounce on your statistics, telling you (often erroneously) that you're wrong--which is unfortunate. You can avoid that whole conversation by just saying you don't use statistics (which is unfortunate but common).

As it happens you can use statistics with any sample size (even 3). In the context of a typical "find and fix" low cost usability test you can still use statistics to help understand how common the problems are and the number of problems you're likely seeing.

As Michael alluded to, one area is with confidence intervals. If you see 3 out of 3 people experience the same problem, you can then estimate how many of all the users would encounter the problem by using a binomial confidence interval (a calculator is here http://www.measuringusability.com/wald.htm).

By entering 3 passed and 3 total we get a 95% confidence interval between 47% and 100%. We can say with 95% confidence that at least 47% of our users would experience this problem (a non-trivial amount). We've made a statistical claim with only 3 users.

The next question would be, given a sample size of 3 users how many problems have we likely seen. First this only applies to the tasks, parts of the interface and type of users you're testing, change anyone of those and you need to recalculate.

The statistical calculation is again based on the binomial. If after testing 3 users you or someone else wanted to know the number of problems you've found or not found you use this strategy.

For example, if you have the goal of finding problems that will affect at least 30% of all users, then you'd need to plan on testing 8 users to have a 95% chance of seeing problems that occur this often in the usability test. NOTE: This does NOT mean you've found 95% of all problems (as is often said), you've only found 95% of all problems that affect 40% of all your users. In other words, with small sample sizes you're only going to see the more obvious problems. Use this calculator http://www.measuringusability.com/problem_discovery.php

But to Krug's point, there are usually so many "obvious" problems that need fixing, you don't need to worry too much about problems that affect say only 1 in 10 users.