While usability tests project an air of science and procedure, in practice they typically boil down to the interaction of *two* people and a machine. While the usability specialist does his/her best to minimize their effect on the tested, there is always something of a “Heisenberg usability effect.” Simply put, an observed user behaves differently than an unobserved one.

I don’t know of studies that have quantified this effect, but from testing experience I know that it can shift user behavior quite radically. What’s worse is that it doesn’t shift in a predictable direction — some users become anxious and fail where they normally wouldn’t, while others stay calm longer to secure their ‘usability gift pack’.

Despite our best efforts, social relationships and power structures always exist. A 12 year-old user being tested by an adult researcher will behave differently than they would visiting the same site with their friends. A particularly attractive user might be treated differently than a crumudgeon. As professionals we try to reduce these variances, but we can’t control for a participant who behaves differently around a male usability specialist vs. a female one. Our usual answer to this has been to leave the room entirely. So the user ends up sitting alone in a small room with a wizard-of-Oz-like voice telling them to turn to the next task. This does remove some of the visual/social distractions, it is still a strange, strange environment.Aside from such issues, the length of these tests cannot capture long tasks and/or longitudinal behavior. In recent years the need to learn about this has been somewhat dismissed due to the popular notion “if people don’t succeed in 30 seconds they leave forever!” Of course this is not true. If I fail to find a good flight on Orbitz in 30 seconds, it matters not. Usually it takes me 20-40 minutes to book a flight because I want to be sure of getting a good deal. Similarly, I recently struggled reconfiguring my servers to work with my spiffy new router and battled with all manner of software for about 6 hours. No tester would subject me to such torture, and thus no test would capture the sheer agony of completing this task.

Currently, little is known about the magnitude of the Heisenberg usability principle, since we simply don’t have much data from users who didn’t know they were being tested. This might change as more companies roll out automated testing approaches. For example, Amazon uses A/B testing where 5000 randomly chosen users are exposed to a particular design change and their path through the site is subsequently logged. If the design change raises the number of people who find certain features/offers by 10%, they note it. All the while, the users don’t know that this test is being run. Such data could serve as a baseline against which traditional usability lab tests could be run against the same design changes.

Regardless of the size of this effect, it is clear that usability tests can give us valuable data. Still it is critical to step back from time to time to examine our methods and understand what is really covered and what is missing.

The theory is that Think Aloud usability testing comes from the Protocol Analysis of Cognitive Science, and that when performed correctly, for certain tasks, it does not interfere /that much/ with the way the user would normally do things.

Of course, those assumptions (performed correctly and the correct task) are large ones.

Ericsson and Simon published an entire book basically explaining when Protocol Analysis is useful and valid and when it’s not–it’s called “Protocol Analysis: Verbal Reports as Data”.

While not much research has been done on the validity of observed Think Aloud usability studies, quite a bit has been done in the field of Cognitive Science on the validity of Protocol Analysis, so looking into that research could be very useful when thinking about Think Alouds.

Maybe someone should put together an overview of Protocol Analysis research for the usability analyst.

Tom Chi has written an article on the “observer effect” within usability testing, where the testing environment influences the way participants behave. To quote: I don’t know of studies that have been run to quantify this effect, but from testing…

some users become anxious and fail where they normally wouldn’t, while others stay calm longer to secure their ‘usability gift pack’.

In most usability testing situations, I’d consider it a serious breach of ethics to limit inducements to how participants behave in a test. Participants should be told they can stop participating at any time and that (usually) they will receive full incentives. However, participants usually forget this, if they note it at all, since they tend to assume that the incentives are tied to completion of the test.

That said, usability testing is such a mishmash of different techniques with different goals conducted by people with wildly different competencies that any study of the effect of the social interactions between researchers and participants would be meaningless. First you would need to control techniques used, testing goals, and tester competency.

Try testing on sales people. Really, so rewarding to hear them gibber with confidence! You can reuse their one-liners on the cover of your usability report as well…things like “Those guys who designed this should be laid off tomorrow” and they *never* think they are wrong. Makes your day!

Naturalistic observation obviously yeilds a much richer data set than setting pre-conceived tasks to be performed in a lab, but it’s much more labour intensive, time consuming, and expensive. Contextual inquiry and field studies help balance the scales, but can introduce their own biases.

What usability testing IS good for is finding gross design problems (ie. naive problems that don’t take into account context of use) and (in the case of formal, summative testing), setting metrics against which further testing can confirm or refute design hypotheses.

To throw another spanner in the works of verbal protocol, within the psychology community there seems to be some contention as there is much evidence that introspection is mostly unreliable, ie. people can’t accurately tell you why they did something (see Nisbett, R. E. and T. D. Wilson (1977). “Telling more than we can know: Verbal reports on mental processes.”). People tend to either rationalise or confabulate why they did certain things, as they don’t have access to higher order mental processes.

How reluctant we are to part with the notion that there’s such a thing as an unbiased usability test. While having a facilitator (and even observers) in the room clearly has the potential of changing the behavior we are trying to observe, as Tom pointed out, the disembodied voice is unnatural too. And many forms of bias lurk in our test protocols - unrealistic tasks, leading wording, etc. Our responsiblity is not to eradicate it, but rather to look for it in all its subtle forms and understand how it affects the data we collect. Embrace the idea that bias will always exist - it’ll free your mind to focus on more important things.

I base my methods on the premise that the purpose of usability testing is to get data to the development team that will empower them to improve the product. Often, the team has questions about specific aspects of the design. As a facilitator, it’s my job to probe these areas, sometimes even at the risk of redirecting the user’s attention. For instance, if a help topic has just been rewritten but users don’t spontaneously look at it, I might suggest that they try the help. The data on how well the help file works is still valid; we just shouldn’t draw conclusions about whether real users will get to that information.

On the other hand, some of the most valuable findings from usability tests are those that we didn’t expect, so it’s important to remain open and not over-orchestrate the show. I believe that rigid adherence to scripts in a misguided effort to be “objective” can actually inhibit the serendipitous - and very valuable - things that might otherwise happen.

I agree with Ash that people often can’t tell you what they’re really thinking, but if you rely strictly on observation you risk putting your own interpretations into the picture, which may be even farther off base than the user’s. Asking users to think aloud helps, but most people can’t do it perfectly. Plus, a user who is thinking hard tends to fall silent, and that’s exactly when you most need to figure out what’s going on.

Bottom line, I plan to continue my fairly active style of facilitating, because I think it provides the most useful data for the development team. To increase its reliability, I take note of the questions I ask and I’m alert for things I say or do that affect the results. There is bias in every test I conduct, and on occasion it undermines some of the findings. But people who believe that it is possible to be 100% objective are simply blind to their own biases.

“To throw another spanner in the works of verbal protocol, within the psychology community there seems to be some contention as there is much evidence that introspection is mostly unreliable, ie. people can’t accurately tell you why they did something (see Nisbett, R. E. and T. D. Wilson (1977). “Telling more than we can know: Verbal reports on mental processes.”). People tend to either rationalise or confabulate why they did certain things, as they don’t have access to higher order mental processes.”

That’s not contention. Introspection is not Protocol Analysis. It’s fairly well-known that there are certain things which are valid and easily obtainable from Protocol Analysis and there are certain things which are not. “Why?” is not a question you ask during a proper Protocol Analysis (or, for that matter, a proper Think Aloud usability study).

My point earlier was that there has been quite a bit of research on this very topic in the fields of psychology and cognitive science, and that people do have a fairly good idea as to where the lines are drawn. Those same lines apply to Think Aloud usability studies and could be useful to know.

Yes, it has been argued that “Introspection is not Protocol Analysis”, but there was an argument. When Nisbett & Wilson pointed out the lack of validity of introspection, Ericsson & Simon defended their techniques by arguing that the protocol does not seek ‘why’ from subjects, but at best, merely gets them to report the contents of their STM (which then has to be interpreted).

However, in practice (for usability folk) Neilsen (1993) proposed two flavours of ‘Thinking Aloud Protocol’ ie. ‘Critical Response’ (closer to Protocol Analysis) and ‘Periodic Report’ (which is retrospective and introspective). Add to that the lack of rigor applied to such techniques and the common teaching of ‘probing techniques’ (the “why”) to be employed during and after the evaluation, and the result is quite a bit of introspection.

Yes, there has been quite a bit of research on Protocol Analysis, but the results haven’t been all good either. Some have challenged the foundations and it has since been pointed out that the argument that Protocol Analysis ‘worked’ was based on the perceived success of General Problem Solving (GPS). For decades even the AI community has acknowledged that GPS (while a significant step in the history of both Artificial Intelligence and Cognitive Psychology) was overly simplisitic and completely invalid. Goguen & Linde (1993) concluded that “protocol analysis is not a reliable guide to what subjects are thinking, and is open to serious misinterpretation by analysts, who can choose a small sample of protocols [..] for an unrealistic problem (both artificially simple and artificially without social context) to impose their preconceptions on the data.”

All that said, I wholeheartedly agree with (and actively promote) Julian’s point: that we should understand the origins of our techniques and learn from them. Many have been ‘watered down’ over the years - especially with the boom in ‘discount usability engineering’. Further to that, I think we should also challenge techniques that are in common use (such as thinking aloud protocol, card sorting, and heuristic evaluation), if we are to further develop our field.

An interesting article, and comments. Apologies if I’m out of my depth here, but the slightly different situation that I find myself in means that it’s useful for me to understand (or at least think about) how your ‘Heisenberg usability principle’ affects the user. I spend a fair amount of time instructing people on the use of their computer software, sometimes over the phone, so I’m used to the radically different ways that different people behave when their computer-use is under scrutiny, but have very little understanding of how those same people behave on their own (and how they get themselves into hot water in the first place). I can see that my role as teacher is very different to that of a usability specialist, but if nothing else, I end up deeply embroiled in the usability of a piece of software.

A lot of tasks that are relatively easy for a moderately experienced user must be reduced to a list of screen descriptions and point-and-click instructions for novices. It must be a difficult task to design an interface that a genuine novice can follow.

Aside from the oberserver/observed Heisenberg problem discussed so far I’ve noticed another more subtle problem with think alounds.

Even if you remove the observer from the test scenario, it seems that users who think outload solve problems, achieve goals, and just generally figure things out a bit quicker than users who keep their toughts inside.