10 Tips For Benchmark Usability Tests

by Jeff Sauro | February 14, 2011

To know if design changes improved the usability of an application, you first need a baseline measure of usability from a benchmark test.

Here are 10 tips to use when planning your next benchmark test.

Recruit for representativeness over randomness: It will be difficult to select a random group of users for your tests. Worry less about the random selection than how well your test users match your entire user-population. If you need to compare new versus existing users or domestic versus international users, it is more important to have these users proportionally represented than trying to randomly select them. Even clinical trials have problems with random selection and most usability tests don’t involve life and death decisions.

Estimate Sample Size using the desired margin of error: For tests where no comparisons are made, the needed sample size is derived from how precise you want your measures to be. You can use the 20/20 rule for quick calculations. To achieve a 20% margin of error you need 20 users. To cut the margin of error in half you need to quadruple the sample size. So a 10% margin of error requires approximately 80 users. A 5% margin of error requires 320 users. At this sample size, when you report your average completion rates, times and satisfaction scores, you will have confidence intervals that are 10 percentage points wide (a margin of error of plus or minus 5%).

Counterbalance tasks: Alternate the presentation of the tasks to minimize undesirable sequence effects. Often the first one or two tasks have lower performance metrics because users are still getting acquainted with the test and application. The more time they spend completing tasks, the more their performance (time, completion rates) improves. You will want to spread the learning effects evenly across tasks by counterbalancing or randomizing the task-order.

Collect both Post-Test and Post-Task Satisfaction: Usability questionnaires like the SUS provide more stable estimates of the application’s overall impressions of usability. They are less sensitive to task performance. Post-task questions like the Single Ease Question (SEQ) are more sensitive to usability problems, errors and higher task-times. Task-level satisfaction can be combined with the other task level metrics into a single usability metric.

Combine measures into a Single Usability Metric for reporting: When you record multiple metrics to measure usability you can standardize them and combine them into a Single Usability Metric (SUM). Having a single score makes it easier to convey the usability of a task or system on dashboards and reports and you still retain all the information in the component metrics for more detailed analysis.

Use confidence intervals around all your metrics: Data from any sample will deviate from the entire user population by some amount. This difference is called sampling error and is quantified using the margin of error. Margins of error are found by computing confidence intervals around all your measures. They tell you the most likely range of the total user-population average. If you need help with the computations, the Quantitative Starter Package will do the work for you.

Conduct a pilot test: Even having one or two people complete your usability test can reveal obvious flaws with your test design or with the application prior to the full test. A pilot test can reduce ambiguities in task scenarios, prevent embarrassing system problems and improve the quality of your analysis.

Include some cheater/speeder detection for remote usability tests: Around 10% of online usability test-takers and survey participants will rush through your study just to collect the honorarium. You will want to identify these speeders using some sort of question and remove them from your analysis. Having a high percentage of speeders/cheaters (>20%) suggests your tasks are too complex or the test is too long.

When you record task time don’t throw away the failed task-times. You can report: average task completion time, average time on task and average time to failure. All three of these can be both valuable diagnostic tools and used for comparisons in your next benchmark test or after design changes.