Communicating data science: Why and (some of the) how to visualize information

This is the third post in the series on communicating data science. For an interview with a storytelling expert and a guide to presenting an analysis, check out the communicating data science tag.

As an undergraduate, I took a class on ancient Peruvian counting systems. In it we learned about how quipus, cords of varying colors and lengths tangled in knots, were used to record and transmit data. The instruments, also suggestively referred to as talking knots, demonstrate the human tendency to distill the mildly context-sensitive complexities of natural language to simpler, perceptually compelling units of communication. The size and number of knots, a tint of color, the length of a cord.

Fast forward a few centuries and we've witnessed the rapid evolution and spread of data visualization as a technology and communication tool. Instead of quipus, we have increasingly sophisticated, popular tools like ggplot2, Tableau, and d3.js. At the same time, the motivations and mechanisms for relying on visualization to transmit information from my brain to yours have remained fundamentally the same.

In this blog post, I do as so many have done before me (but with my own spin): introduce you to the why and (some of the) how of communicating information through visualization.

Why visualize data

There are a number of reasons for using perceptual (visual, tactile, or other non-verbal) means to communicate data. The goal in using visuals to represent information, I believe, is to present information to tell a story in a way that reduces the cognitive burden that language (can) carry. Here's how I break it down:

First, language is hard. Maybe you've heard that it's easy to learn a first language as an infant. Although it may come naturally, it takes years of observation and practice before children are able to advance beyond babbling to mastery of all of the speech sounds in their language. Of course the payoff to learning language is huge, but don't under-appreciate the effort that goes into using language to communicate.

Second, reading is harder. It takes even longer to learn how to read and write. Writing systems are also a human invention so all that innate ability you may have had to acquire a language is out the window.

Third, perception is a common language. You don't learn the ABCs of your senses in the same way you do a language. Visual designer Vivian Peng explains in a talk she gave at the 2016 NYC R conference how and why appealing to our perceptions can "give us the feels" in a way that words cannot.

By appealing to our emotions, visualizations become not just ways to convey information but powerful tools for influence. With great power comes great responsibility, however; Christopher Ingraham of The Washington Post explains in this story how data visualization is "just as much an art as a science." Small choices in colors can have huge impact on the story you tell. And it's definitely worth mentioning that while I say perception is a common languages, there are significant individual variations present like color blindness that are important for data visualizers to consider.

Finally, time is of the essence. How many times have you heard something along the lines of "We consume 1014812898 pieces of data every single day"? It's a cliched question because it's clear by now our brains are barraged by a constant stream of information. A visualization must be good in order to first catch our attention and to keep it before we click the next link on our Twitter feeds. But this is certainly easier to do with visuals than it is with a daunting wall of text.

So what are the elementary ingredients of a data visualization? These are the minimal cues that may communicate some contrastive meaning and can therefore be used to tell the story in your data. These small choices answer questions like, "what does each of the dots represent in a scatterplot?" and "what do red and blue each mean in this linegraph?"

How to visualize data: Detail & Color

In the second half of this blog post, I spend some time covering a couple of the core areas that help us communicate meaning in our data: level of detail and color. There are of course many, many more considerations and best practices, but this will get you started thinking about data visualization and the decision-making processes involved.

Level of detail

What is level of detail? Building a visualization all starts with figuring out what you're trying to represent and the level of detail required. Knowing at what level of detail to represent your data forces you to address the question you've set out to answer in the first place. If the level of detail you use in your visual is apparent, it will be easier for your audience to follow what questions you're attempting to answer with the data.

If you're familiar with Hadley Wickham's principles of tidy data, I'd equate level of detail with knowing what the observations and variables are in your dataset. To demonstrate, I'll show you a couple of examples that use different levels of detail to answer different questions about some data.

Level of detail - Fine. The Airplane Crashes since 1908 dataset on Kaggle contains worldwide historical data originally sourced from Open Data Socrata. The figure below shows the number of fatalities for each individual recorded crash. Each point represents an event.

Points represent individual airplane crash fatality events.

The information that this simple figure provides is a sense of the number of events and a general sense of patterns in crash fatality severity over time. But how useful is this figure likely to be?

Level of detail - Medium. If we have some other question in mind, we may make the decision to represent the data at a different level of detail. For instance, what if we want to know the total number of fatalities per year? In the figure below, the height of each bar represents the number of deaths from all events in a single year.

This time the heights of the bars represent the total number of fatalities in a year.

This time, the above figure straightforwardly answers the question of how many total airplane crash deaths occurred each year. We can see when the number of total deaths peaked; we learn that there are several years without any fatalities; and that the number of fatalities appears to have decreased in recent years. All (mostly) without words.

Level of detail - Coarse. The coarsest level of detail may be, for example, a single number: the total number of fatalities recorded in the dataset. I could visualize the number as an annotation to a figure or I could simply use it in text. If this were Rmarkdown, I could write: "This dataset records `r sum(data$Fatalities)` fatalities in total."

Color

What felt missing from those two figures above? Color!

The temptation to add visual flare with a splash of color is hard to overcome. But I want to caution you that color is a vehicle for conveying contrastive meaning so it should be used to highlight or reinforce differences of some kind in order to effectively tell its story. If you've ever seen a figure made using ggplot with a unique color for 20 or more different factor levels (cough, cough), you'll understand why perceptual contrast is key when it comes to your usage of color.

The nature of the variables you're depicting guide the choice of color palette. Lisa Charlotte Rost, an OpenNews fellow on the NPR Visuals Team, gives a resource-laden how-to guide for best picking colors for the data, and story, you have at hand.

Personally, I've quickly developed an affinity for the viridis color scales which I first learned about from Kaggle Kernels like this one. As the R package authors Bob Rudis, Noam Ross and Simon Garnier succinctly note in their vignette, the four available color scales are effective because they:

make plots that are pretty,

better represent your data,

[are] easier to read by those with colorblindness, and

print well in [grayscale]

The color palettes available in the R viridis package. Read more about the package in the vignette.

As these are all important factors, this is a great place to start (arguably better than the ggplot defaults we're so used to seeing). Here are a few examples for both categorical and sequential data.

Sequential data

Color gradients can be used with data that falls on some continuum from low to high where you specify the unit of measurement (e.g., temperature in Fahrenheit). They can also be used with other types of discretized scalar data like counts which also go from low to high (for example, number days since you last visited the dentist).

Categorical data

Another use of color is to distinguish levels of categorical (also known as factor, qualitative, or nominal) variables. Here each level of a factor represents one unique flavor of something. Like apples are to oranges, you want the audience of your visual to understand immediately and without needing to squint that these two things are completely different.

For example, the figure below created by Competitions Master Willie Liao effectively relies on color to communicate the shifting proportions of participation rates of Masters, Kagglers, and Novices over time under the former progression system. Check out the full kernel here.

One thing I'll note is that color here does not convey the ordered nature of the factor levels. That is, it's not true that blue > green > salmon so the onus is on the audience to know this through other means. This isn't essential here, but could be useful depending on the context.

There are a couple of interesting things to note here. First, it's amazing to me how easy it is to distinguish the ten factor levels using this color palette. Where color is the sole indicator of character class, this is crucial. Second, you can see how this color palette would be well suited to ordered factors as well, as long as the number of factors doesn't get too high. That is, there's a clear low-to-high continuum.

In any case, using the viridis palette is just a suggestion which may not be the best way to communicate information in all cases. The point is that it commonly addresses important considerations made in the process of building a data visualization:

Use color to show discernible differences in your data, i.e., convey contrastive meaning

As the subtitle says, the figure below represents the standard deviations of the fastest 200 men and women's times at five race distances since 2000 from the average of the best times of the top 50 swimmers. Its creator uses contrasts in color and transparency to emphasize how far and away Katie Ledecky's times are from the average. Nothing about the data itself changes, but the careful manipulation of color helps us quickly glean the message (despite the wordy subtitle).

Using color and transparency to highlight a series.

Relatedly, Andy Kirk, a freelance data visualization expert, explains in this post exactly why you should "make grey your best friend when colouring your visualisation work." Some of his most useful points include:

Accentuate for focus

See the overall shape of all data

Placeholder for zero null

I highly recommend his blog as an excellent reference for how to do data visualization.

I've only managed to cover some of the decision-making involved in effectively visualizing data, but I hope that even if you only use these tips for exploratory data analysis, they'll lighten your cognitive load enough to allow you to perfect your cross-validation strategy. I look forward to what you create. Please share links to your data visualizations in the comments -- we'd love to see (and upvote) them!