How do we know when a visualization is good? Perspectives from a cognitive scientist

This post is the first in a series entitled: How do we know when a visualization is good? In this series, we will hear from visualization researchers on how to evaluate visualization quality. — MV

How do you know which visualization is better? This seems like a simple question, but the answer is far from simple.

First, let’s define better. Better can mean that something is more attractive or maybe more exciting. Let’s consider pop culture for a moment, rather than visualizations. The Star Wars prequels are obviously better than the originals, right? Or Bill Murray is the best comedian of the 20th century. Job done, let’s go home.

If you think that these subjective opinions aren’t scientific enough, maybe you’d prefer another approach for defining better. How about comparing box office numbers, DVD purchases, or merchandise sales? According to financial success at the US box office, the 1977 Star Wars a New Hope is the second highest grossing movie of all time, after controlling for ticket price inflation. Gross movie income is an example of an objectivebehavioral metric (people drove to the theater and paid money to see it), whereas opinions about the prequels are subjectiveself-report metrics.

This terminology is borrowed from cognitive science. “Behavioral” approaches measure objective actions or behaviors that a user takes and subjective “self-reports” measure users’ descriptions of what they think.

A case for objective measures

A large body of prior work shows that objective behavioral measures such as testing how fast or accurate someone is at completing a task can more precisely describe how people actually reason with information rather than subjective self-report measures. In other words —

What you think you think, isn’t always what you think. Think about it… Did that help? No? My point exactly ;)

What I mean to say is that our intuitions can be misleading, particularly when it comes to visual information. Remember this?

Figure showing the correspondence between participants’ ratings of three-dimensionality and scientific credibility in neuroimaging visualizations, initially published in grayscale in Keehner et al. (2011) and reprinted in color in Padilla et al. (2018).

For a visualization example, researchers find that people believe 3D and photorealistic visualizations are more scientific, easier to use, and generally superior to more simplistic approaches (one case is illustrated in the Figure to the left). Our misplaced faith in high-fidelity graphics could come from rules of thumb that we’ve learned in our daily lives. For example, sometimes you get what you pay for, meaning that more $ = a fancier product = higher quality. It could be the case that we’ve developed a rule of thumb (or a heuristic) where we associate high-fidelity graphics with quality, which we also assume is indicative of better science. However, numerous studies find that more minimal visualizations can significantly improve performance over more complex visualizations. In other words, people don’t always know what is good for them. This could be because we are not consciously aware of the many unconscious processes that can drive our actions (such as heuristics), which makes it difficult to describe our reasoning and motivations.

So how do we use objective behavioral measures to evaluate the quality of a visualization? And which measures?

Visualization researchers commonly focus on accuracy and speed. But even the best-intentioned studies (including my own) can only make claims about the speed and efficiency of a visualization for the tasks, contexts, and data that they tested. *Note that I’m using the term taskto refer to anything the user does, from insight generation to looking up values from a plot. Empirical studies can only claim that for the tasks and context tested visualization X is superior to visualization Y based on speed and accuracy.

Using the standard approach described above, it is hard to know if a visualization is better in other tasks and contexts or the generalizability of the findings. To operationalize generalizability, we can evaluate the different types of validity in a research study.

Ecological validity: how closely the conditions of an experiment match real-world conditions

External validity: the ability of the findings to generalize to contexts and groups other than the ones tested

Construct validity: the ability of a study to measure what it claims to measure

One way to evaluate which visualization is better (which has merit in terms of generalizability) is to test what exactly (in the mind)produces differences in judgments between visualizations. Our neuronal responses to light are fairly consistent across people, making some mental processes highly generalizable. Mental effort is one cognitive process that we could use as an objective and generalizable measure of visualization quality. As an added benefit, by studying mental effort we can get clues about viewers’ mental representations of the data.

Quantifying mental effort

I’ll use language as a metaphor to illustrate what we know about mental effort. It is easy to understand a conversation in your native language. It is harder but not impossible to follow the same discussion in another language. If a visualization is communicated in your native tongue (in a way that matches how you naturally think about the data), then it is intuitive and doesn’t require any translation. If the visualization is communicated in a way that doesn’t match how you think about the data, then you have to translate the information to understand it. As with language, translating visual information is error-prone, time-consuming, and effortful.

Researchers have proposed methods for quantifying the difficulty of a visualization translation. The motivation being that visualizations which require the least amount of translation are better. Vessey (1991, updated in 2006) proposed the theory of cognitive fit, which suggests that if there is a mismatch between viewers’ mental conception (or mental schema) of the data and the visualization, then people have to complete a mental transformation to make the two align (like translating between two languages). Vessey’s theory proposes that if many mental transformations are required, then time to complete the task and errors should increase. However, using speed and accuracy as proxies for mental effort is a very indirect method for measuring a mental process.

As an alternative approach, I’ll describe methods that cognitive scientists have developed for more directly testing mental effort or working memory. The methods for measuring working memory, (detailed in the next section, have better construct validity than Vessey’s approach — meaning that they more closely measure what they claim to measure.

Testing working memory

Without getting too into the cognitive weeds, one function of working memory is as an indication of mental effort, which can be closely approximated with neuroimaging methods (increased blood flow to parts of the brain signals mental effort), pupillometry (our pupils dilate when we are thinking hard), individual differences (some people have a greater ability to sustain mental effort than others), and dual-task experimental designs. I’ll detail the basic theory of a dual-task experiment, which only requires clever experimental design rather than a bazillion dollar magnet.

Dual-task experiments. The oversimplified version of the prerequisite theory is that working memory consists of multiple components in the mind that can hold a limited amount of information for a short time — said another way; working memory is capacity limited. Contemporary theories also stress that working memory is what allows us to pay attention to task-relevant things and block out other stuff, and suppress automatic responses; this ability is also capacity limited (for much much more on working memory see, Shipstead, Harrison, & Engle, 2015).

I’ll illustrate the basic premise for a dual-task experiment in the figure below as an adult beverage. Working memory is the glass which is capacity limited. Visualization X requires some amount of mental effort to understand, depicted below as beer. Visualization Y is less intuitive and requires more mental effort (utilizing more of the glass’ capacity) than visualization X. We don’t know exactly how much mental effort either require but we hypothesize that visualization X is more intuitive than Y. To test if X requires less working memory capacity than Y, we have users complete a secondary task (such as mental arithmetic, denoted below as a bomb shot), which also requires mental effort. When doing a hard secondary task, we may find that people will exceed their capacity and their performance on the primary visualization task will tank when using visualization Y but not for X. Researchers have validated dual-task experiments with neuroscience methods and show that you can use dual-task findings to closely approximate working memory demand.

Closing thoughts

The moral of the story here is that one approach for deciding which visualization is better (with significant merit in terms of generalizability), is to compare how much mental effort is required to use each visualization effectively. This approach is considered “mechanistic”, meaning that it focuses on the mental mechanism or process that drives a user’s actions.

Research that is informed by how we conceptualize visualizations can objectively quantify the mental processes associated with *good* encoding techniques, and innovate ways to make visualizations even better.

Caveats. The concepts I discussed are based on a cognitive science perspective. While I made a case for objective measures, using converging measures or multiple techniques to test a theory is the most thorough way of hypothesis testing. Many of the best evaluations take a holistic approach to assessments and test, retest, and retest a visualization again using many different methods.

For a review on modern theories on decision making with visualizations with lots of examples of different types of empirical evaluations see: