Visual Perception, Data Visualization, and Science

Guide to user performance evaluation at InfoVis 2013

When reading a paper (vis or otherwise), I tend to read the title and abstract and then jump straight to the methods and results. Besides the claim of utility for a technique or application, I want to understand how the paper supports its claim of improving users’ understanding of the data. So I put together this guide to the papers that ran experiments comparatively measuring user performance.

Less than a quarter

Only 9 out of 38 InfoVis papers (24%) this year comparatively measured user performance. While that number has improved and doesn’t need to be 100%, less than a quarter just seems low.

Possible reasons why more papers don’t evaluate user performance

Limited understanding of experiment design and statistical analysis. How many people doing vis research are familiar with different experiment designs like method of adjustment or forced-choice? How many have run a t-test or a regression?

Evaluation takes time. A paper that doesn’t evaluate user performance can easily scoop a similar paper with a thorough evaluation.

Evaluation takes space. Can a novel technique and an evaluation be effectively presented within 10 pages? Making better use of supplemental material may solve this problem.

Risk of a null result. It’s hard – if possible at all – to truly “fail” in a technique or application submission. But experiments may reveal no statistically significant benefit.

The belief that the benefit of a vis is obvious. We generally have poor awareness of our own attentional limitations, so it’s actually not always clear what about a visualization doesn’t work. Besides being poor at assessing our abilities, it’s also important to know for which tasks a novel visualization is better than traditional methods (e.g. excel and sql queries) vs. when the traditional methods are better.

A poisoned well. If a technique or application has already been published without evaluation, reviewers would scoff at an evaluation that merely confirms what was already assumed. So an evaluation of past work would only be publishable if it contradicts the unevaluated assumptions. It’s risky to put the time into a study if positive results may not be publishable.

I’m curious to hear other people’s thoughts on the issue. Why don’t more papers have user performance evaluations? Should they?

10 thoughts on “Guide to user performance evaluation at InfoVis 2013”

I’d argue that “Evaluation Takes Time” is not a “fear of being scooped” question, it’s a financial question. Any sort of serious evaluation costs several weeks of work–at reasonable tech payscales (not grad students), that’s something like $10,000. Is it worth the marginal value?

I’d love to hear your thoughts on an evaluation of (say) Google N-Gram. (I’m in the keynote, so it comes to mind.) What should I compare N-Gram to? A really big SQL table of (word, value, date)? An Excel table of (date, value) that’s pre-processed for any given word? Do I create a new UI for N-Gram that allows me to type in a word and get a long table of numbers to scan over?

Evaluation definitely has a monetary cost. It’s probably not worth the marginal value for an internally created and used vis. But publishing a paper promoting the utility of an approach to a broad audience is different. Doesn’t the audience deserve some proof showing that it helps some task? Otherwise, a paper promotes the use of a technique or the applicability of a technique to a type of problem, but it’s not clear whether others would benefit from it.

N-Gram is a line chart. The novelty is in the data collected rather than the vis technique. But the vis technique is heavily researched and studied (Cleaveland and McGill, 45 degree banking, etc.). I doubt N-Gram would pass the bar for novelty by reviewers.

There are many more ways to evaluate a visualization (technique/tool/system) than user performance and doing a user performance evaluation sometimes just does not make sense for the type of contribution a paper makes. Just as an example: we presented hybrid-image visualization this year as a technique that allows two visualizations to be blended for distant-dependent viewing. We did not do a user performance evaluation but this does not mean that we did not evaluate our technique. What we did is to employ a qualitative image inspection technique (QRI) [1] as well as discussed perception theory that backs up that our technique actually works. For the type of contribution that our paper made, this was the right match of evaluation – and no reviewer asked for a user performance study. A user performance evaluation would have to ask very different types of questions that no longer match the focus of the paper (which was the presentation of the technique). Interesting things to study in the future are: what are the effects on cognition when using hybrid-image visualization in collaboration or doing a specific in-situ eval of hybrid-image vis in a work context (but even then I’d probably opt for a qualitative study and not user performance).

There are many more ways to evaluate a visualization (technique/tool/system) than user performance

I agree! Like I said, I don’t think UP evaluation has to be 100%. And one example when it may not be necessary is when the premise of the visualization is based on an already evaluated perceptual theory (though we have to watch for overextending theories). The hybrid-image paper is a great example of a technique that explains its roots in perceptual theory. While – as the paper states – there’s more work to be done in terms of designer guidelines, the premise is built on the experiments run by Oliva & Sychs (1994-2000), Navon (1977), and Campbell & Robson (1968). We know why it should work. Many proposed techniques lack that foundation entirely.

As for the question of quality inspection, I’m skeptical of this approach for a few reasons:
1) We are often blind to our own attentional limitations (Change blindness blindness).
2) We don’t always choose the tool that most optimizes our performance (Franklin Taylor’s “optimal shoveling” study)
3) We can perceive improvements in visual clarity even though the actual display is blurred (Motion sharpening).
4) Performance takes a huge hit when users are exploring rather than looking for something they know is there (oddball vs. targeted search). The reader or presentation/demo viewer in these QRIs is rarely naïve and is frequently primed. “Figure X clearly shows that….” “Oh yeah, I DO see that.” They know what they’re looking for, and that’s what they find.

Now, many QRIs that seek to answer the question of whether something is visible at all likely don’t suffer from the above problems. But there are so many ways that a QRI can be misleading. Should readers always have to dig through literature to determine whether a paper’s presented results are misleading?

My overall concern about QRIs is: I don’t trust the brain to self-assess.

I do agree with Petra, I think performance evaluation is really only feasible and sensible for a small subset of techniques published: it needs to be a simple technique with a clear alternative. How would you evaluate the performance of a complex tool using a combination of novel and established visualization techniques? There is no way to isolate the various aspects that can influence performance. What if there is no adequate visualization technique to compare to? In the case of our LineUp paper [1], for example, we ran a study to get some qualitative feedback from users. Should we have run a comparative study to Excel, for example? I think not. Should we have created an alternative visualization technique that we think inferior, just for the purpose of evaluating the superiority of another approach? Which alternative should we have implemented from the whole design space? And LineUp is actually a pretty technical paper, where a performance evaluation would be much more feasible (but not necessary meaningful) compared to a complex system as used in a design study paper.

In contrast, in our Context-Preserving Visual Links paper [2], we ran a performance study that evaluates the effectiveness of highlighting with color vs highlighting by connecting elements with edges (among others). This is a clearly controllable and atomic aspect of the visualization and thus it makes a lot of sense to compare user performance, quite similar to your 2012 InfoVis paper.

I argue that for most of the papers you list above, the evaluation was the most important or a very important part. We learn that, radial layouts are better for certain tasks than other graph layouts, for example. These kinds of papers are very valuable, but they are not the only ones that are valuable.

I do think that none of the reasons you mention should influence the decision whether to do a performance evaluation – most of them are unjustifiable excuses, with the exception of “The belief that the benefit of a vis is obvious”. Here I agree with van Wijk, as he discussed in his VIS capstoen: strive for tools that are obviously better. I wouldn’t trust a flimsy 12 person evaluation that shows me that complex-tool-1 is better than complex-tool-2, especially if the benefit is not obvious.

As I stated in the original post and the reply to Petra, we don’t need 100% of papers to evaluate user performance. However, we should be doing more to evaluate how techniques impact performance for wide variety of tasks. It’s even possible that combining techniques (that were already evaluated for many tasks) would have some sort of negative interaction. As Tamara Munzner said in the evaluation panel, we could stop making new visualizations and still have ten or more years of evaluation work to do.

I don’t agree that Excel is unfair for comparison. There are many tasks for which visualization does not help. Demonstrating that having a visualization at all actually improves a user’s understanding of the information is useful. Excel is not a straw man.

I do agree that an entire alternative visualization doesn’t need to be created for comparison. But selectively simplifying features or altering visual mappings would be very informative in determining what aspect of a vis helps a particular task. You did exactly that in the contextual links paper. You implemented highlighting and straight lines for comparison. For a more complex multifaceted application, selectively removing or simplifying even one component (especially the more novel techniques or combinations thereof) allows a the reader to know how much that component impacts a particular task and whether it’s worth the effort to implement.

Perhaps we just have a difference in philosophy. If a solid evaluation yields results that counter my intuition, I’d first question the study. But if the study is solid and replicated, I’d accept that I have a flawed intuition.