Infographics are all over the place nowadays. How do you know which ones to trust? Follow these three easy steps to save yourself from getting duped.

Faced with an unprecedented torrent of information, data scientists have turned to the visual arts to make sense of big data. The result of this unlikely marriage–often called “data visualizations” or “infographics”–has repeatedly provided us with new and insightful perspectives on the world around us.

However, time and time again we have seen that data visualizations can easily be manipulated to lie. By misrepresenting, altering, or faking the data they visualize, data scientists can twist public opinion to their benefit and even profit at our expense.

We have a natural tendency to trust images more than text. As a result, we’re easily fooled by data visualizations. But fortunately, there are three easy steps we can follow to save ourselves from getting duped in the data deluge.

The subtlest way a data visualization can fool you is by using visual cues to make data stand out that normally wouldn’t. Be on the lookout for these visual tricks.

Color cues
Color is one popular tool for making certain data stand out above the rest. When considering the map below, Kentucky and Utah (the darkest and the lightest) will most likely stand out to us first.

Image: Visualization c/o IBM Many Eyes

If this map were showing percentage of the population that smokes (dark = more, light = fewer), we may quickly conclude that Kentucky has a serious smoking problem. But what if we looked at the raw numbers and saw that 27% of Kentuckians and 23% of Utahans smoke? Not so big of a difference after all.

Not all data are created equal.

Make sure to look at what the colors actually represent before drawing a conclusion from the visualization.

Is an increase of 15 fraudulent visualizations from last year really “skyrocketing”? Don’t let the structure of the visualization decide that for you. Always check the numbers that the visualization is representing.

Not all data are created equal. Here’s how to sort out the good data from the bad.

Make sure the data source is reliable
Data collected by an amateur is more error-prone than data collected by a professional scientist. Do a quick web search to see if the people who collected and organized the data have a good track record of collecting and distributing data.

advertisement

Make sure the data source isn’t biased
A drug company may be inclined to present fake data showing that their latest drug is more effective than it really is, or a political organization may manipulate data to discredit their political opponents. Think twice when considering data provided by biased groups.

Every good data visualization will come with the blueprints describing how the data was manipulated.

Many data sets require a little bit of house cleaning before they can be visualized, but excessive curation can be a sign of misrepresented data. Every good data visualization will come with the blueprints describing how the data was manipulated from its raw form into the visualization you see. Give the blueprints a quick read and watch out for the following data alterations.

Excluded data
Ensure that the explanations for excluding that data are reasonable. Sometimes the “explanation” may be that the data inconveniently contrasted with the story the author wanted to tell.

Transformed dataData transformation, the process of converting data from one format to another format, can complicate the relationships between data. It’s difficult to interpret a finding such as, “The log transform of a city’s productivity is related to the log transform of the city’s population.” See how that doesn’t make any sense to us in practical terms? That’s why transforming data can potentially be misleading. Be wary if several transformations have been applied to the data.

Sometimes authors exclude data because it inconveniently contrasted with the story they wanted to tell.

Statistics
Statistics are an often-abused tool in data science. “Fatal shark attacks have risen 100% this year” sounds like an alarming statistic until you realize that only one person was fatally attacked by a shark last year. Check the raw numbers when data visualizations present only the statistics.

advertisement

Comparing statistics is even trickier. If a survey shows that 50% of Latinos and only 30% of Caucasians enjoy watching baseball, those results could easily have been purely due to chance because the survey only interviewed 20 people of each ethnicity. If the visualization doesn’t indicate their confidence in the comparison (called statistical significance), then we shouldn’t be confident in their comparison.

If the details on the data alterations aren’t provided with the visualization, always keep in mind how easy it is to make data lie when it’s visualized.

About the author

Randy Olson is a Computer Science graduate research assistant at Michigan State University specializing in artificial intelligence, artificial life, and evolutionary computation. He runs a research blog where he writes about Python, scientific computing, evolution, and AI.