user research

08/25/2009

Usability testing has become one of those flexible words that teams have stretched to include almost any activity that purports to assess usability. There may be users involved; there may not. There may be behaviors observed; there may not. I've seen focus groups masquerading as usability testing. I've seen ethnography masquerading as usability testing. I've seen phone interviews masquerading as usability testing. And worst of all, I've seen "expert opinions" masquerading as usability testing.

While those of us in the business are gratified by how mainstream usability has become, we also get concerned when we see a wide variety of methods being used by inexperienced teams. In some cases, the wrong methods are applied in the wrong situations, and the result can be faulty conclusions and bad design decisions.

One of the most quoted phrases I hear from non-practitioners dates to a 1993 paper originally written by Tom Landauer and Jakob Nielsen. Intended as a mathematical means of illustrating the diminishing ROI of testing with lots and lots of users, most who refer to the paper only display the graph from Jakob Nielsen's original 2000 blog post.

The problem with this, is that most people merely look at the graph, hear that the paper was presented at some fancy-pants ACM CHI conference in Amsterdam, and walk away sure in the knowledge that they have discovered ultimate methodological truth. They never read the paper themselves - or even bother to visit the summary Jakob published on his website in 2000 - something a simple Google search could reveal in seconds.

This is a problem, because the original graph is most meaningful for a homogenous group of users. Obviously some fundamental usability issues, such as navigation bar confusion, can be discovered by users with very distinct differences, such as men and women, or seniors and tweens.

However, in many cases, I work with clients who have user groups with very distinct differences in terms of domain knowledge, familiarity with the web, familiarity with the sponsoring organization, etc.

In those cases, I strongly encourage teams to test with 6-8 users from each distinct group. Before I'm ready to get up in front of a bunch of stakeholders who are spending hundreds of thousands of dollars on a website and genuflect and pronounce the site "usable" or "unusable" I like to see 25-30 users.

Along with not testing with enough people, the other common mistake I see inexperienced teams make is that they don't test with real users.

Only one in 4 American adults has a college degree (Source: U.S. Department of Commerce, Economics and Statistics Administration, 2007 American Community Survey, Educational Attainment in the United State).

Nearly 60% of US households defined as a family of four make less than $50k in combined household income. (source: Source: U.S. Census Bureau, Census 2000 Summary File 3, Matrices P52, P53, P54, P79, P80, P81, PCT38, PCT40, and PCT41).

A full 20% of US adults reads at a 5th grade level or below, and the median reading level of US adults is 8th grade (source: 2003 National Assessment of Adult Literacy, NAAL).

Despite these facts, I frequently work with teams who recommend that they invite other employees from within the same company to come over and try out their design so they can conclude it is usable. Only slightly better are the suggestions to "test" a website or product with "friends and family."

Sociologists have demonstrated that most people's circle of friendships don't deviate from their own narrow range of education, income, or ethnicity.

So if you want to say that you're conducting usability testing, be sure to use broader recruiting methods, preferably by hiring a professional recruiting agency, especially if you're trying to assess a product or website that has a broad consumer audience.

Granted, if you're working on accounting software that is only used by CPAs, then you can probably get away with working with a group of 5 CPAs provided by the site sponsor as a user group. But if you're intending to assess a site with a broader consumer audience - get out of your office, away from your neighborhood, and test with a decent group of "real" users. Try 6-8 of each distinct group. You'll be glad you did.

09/13/2009

The next critical success factor for successful usability testing is to observe real behavior.

Leading tasks, highly scripted sessions that place page or layout elements out of context, and lots of conversation with the moderator, are some of the more common mistakes I see teams making when they think they're evaluating the usability of the an interface objectively.

Instead, what they're doing is taking users on a guided tour of an interface, and in the worst cases even trying to convince the user their design is good. Even when they're doing a decent job being objective, those who use highly scripted, interview-based "usability" sessions, are at best gathering small sample qualitative preference data.

There's a big difference between preference data and behavioral data. Some may have heard a good illustration from NN/g's Kara Pernice. She uses the example of a cappuccino machine. Imagine you are standing in front of a cappuccino machine. It is brand new and the box and manual were thrown away. If you wanted to know if the design of the cappuccino machine was usable -- you could assign 20-30 people who had never seen or used it before to walk up and make themselves a cappuccino. If you did this in different regions of the country, or even around the world, you would find there is little regional variation in that behavioral data. You could have confidence that despite your small-sample qualitative methods, you had identified any usability challenges.

However, if your goal really were to find out what flavor of cappuccino people liked -- this one-on-one qualitative method would be all wrong for that kind of preference data. For one thing you would quickly discover that regional variations across the country, and throughout the world would become very important. You would also discover that 20-30 people recruited via a recruiting agency database are not a statistically valid "sample" of the broader population of cappuccino drinkers. You could over-react when 15 out of your 30 people told you they liked peanut-butter flavored cappuccino, and your boss would be angry when you went to market trumpeting a new Skippy flavored brew that didn't sell well.

The tricky part of "user testing" -- is that it is often funded by the marketing group, or other product managers who aren't exclusively interested in the time-on-task, error rate, or learnability of a cappuccino interface. Empathizing with their need to be reassured about how well the cappuccino machine is going to sell will be important to communicating with them effectively. But if the team is really focused on improving the product and repairing any usability flaws -- you'll need to educate them about methods.

In my current role with an e-commerce retail focus, I regularly have clients come to me with "comps" for a new product page layout, or homepage. They tell me they want to do a "usability test" of the new page, or see if their new images work well and contribute to a purchase decision. There are several "usability" consultancies out there that are happy to take the clients' money, and using the clients' very stilted "usability" test script, bring in a mere 5-10 users, and proceed to "lead the witness."

These consultancies will call it a listening session, or usability group, or whatever the moniker, but by plonking users in front of a single page, outside the context of a realistic behavior (in this case making a real purchase on the wider Web), pointing to a new element and asking users to talk about whether they like it, or if it is helpful, or to describe its usability, they're unfortunately not learning much that is valid or reliable.

The marketing or product leads will sit behind the mirror and furiously scribble down comments both positive and negative, but as with the cappuccino machine preference example, they're using a flawed methodology that has serious potential to not only fail to uncover real behavioral usability problems, but mislead researchers and teams into thinking users prefer one interface element over another.

So how do you avoid this problem?

You do it by watching real behavior that is as non-leading as possible. There are always going to be test effects and distortions caused by the fact that we most often study users outside the normal environment of their home or office, on a computer or browser they are not familiar with, in a situation where they know they are being watched, with a sometimes learned motivation to speak in the animated and adjective-laden style that they think will get them invited back to another focus group to make another 100 bucks, etc.

The potential sources of variance for lab-based testing are well known and well-documented. So I try not to pretend to myself, or to my clients, that the lab doesn't impact what we see. But by following some simple rules we can limit those effects as much as possible.

For starters, I try to set an overall goal for users, and then leave the room. It's a bit difficult for users to verbally describe behavior (instead of actually doing things and trying out the design), if there is no one else in the room to talk to. Second, if I'm interested in a particular part of an interface, such as a new informational element on the homepage of a pharmaceutical company's homepage, or a new larger, interactive, zoomable image module on a retailers product page, I'm much better off if I can observe user interactions with that element that are natural and un-scripted.

As I'm currently in the retail e-commerce space, I insist that users have the broad goal of making a purchase. While I do have to limit them to one particular website (broader studies of them purchasing, say, a pair of pants without any limit to where they can go would have obvious strengths in terms of learning user behavior patterns with search engines, comparison behaviors between sites, etc.), I don't sit next to them and tell them which pages to click, or stop and point out elements of the interface and say "ooh, what do you think of that? Do you like it? How much do you like it on a scale of say 1-10?"

Whenever possible, I like to see users interact with fully clickable, functional prototypes or live sites. Again, in an e-commerce context I like them to be using their own credit card, making selections and actually purchasing such that they know this stuff is actually going to get shipped through the mail and arrive on their doorstep.

You'd be amazed at the difference in behavior between a user who is "pretending" to shop, and one that knows this item they're evaluating will either have to be used, worn, or shipped back via the hassle of a return.

After only a few minutes, I find that users forget I'm even on the other side of the mirror or watching on a dual screen monitor.

As a result, when 20-30 or so users arrive on the homepage the team wanted tested, or land on a product page with the new image "zoom" functionality, I get to see 1) what other elements of the page or overall site they use to solve their problems, 2) at what point in their process they do interact with the new element, and 3) for how long.

Because we use eyetracking technology, I'm able to watch their eyegaze in real-time from behind the mirror and understand intra-page navigation.

Now, dear reader, I suspect you're going to ask -- what if they don't interact with my new interface element, the new whiz-bang thing that the person paying for the study is so desperately wanting feedback on?

Well, sometimes that does happen. And that of course is instructional in and of itself. If 30 folks come in with the goal of making a real purchase, and zero of them use the new zoom feature (that is supposed to help them choose between products), that should give the team pause. But despite my commitment to natural, non-scripted user behaviors, I'm of course a big fan of the good old-fashioned debrief. After we've observed a natural purchase, we then transition to having users talk us through what they've done -- we've already seen the natural behavior so we don't risk altering or influencing what they do by asking follow-up or probing questions.

And if, during the natural user-guided portion of the session the user didn't interact with an important element, I'm happy to assigned a moderator-contrived task during the debrief, or "prompt" them to notice something and try interacting with it so I can seek feedback. Although I know I'm leading the user at that point -- I at least am able to place their comments and behaviors in the context of their more natural behavior which I have just observed. Again, I am likely to get some preference data, but at least it's not all I'm getting out of the study.

So to sum up, watching real behavior is a critical success factor for effective usability evaluation. Instead of tightly scripted, moderator-contrived tasks (with me sitting close to the user and breathing down their neck), assigning a broad goal and letting the user "do their thing" is more likely to uncover unexpected problems and give us confidence as to whether the design really works. Leaving the room can often help users relax and start "doing," instead of merely talking about doing. As Jakob Nielsen has said, what users do, versus what users say they do, can be very different things.