Sunday, April 14, 2013

Making your message visible: Trend lines in scatter plots

If you teach infographics and visualization, here's an example to use when explaining the differences and similarities between designing for your peers (or for analyzing your data) and designing to communicate with broader, nonspecialized audiences. An hour ago, while reading The New York Times*, I came across a lovely scatter plot, the first picture above this paragraph. I read its headline and deck: "Mirth and Taxes. A study of 54 nations (...) found that those with more progressive tax rates had happier citizens, on average."

Then, I took a look at the graph and felt startled and puzzled. I tweeted:

I couldn't see a solid relationship between the variables. In fact, a lot of countries are "more progressive but less happy" than the United States.

A minute later, Stuart Allen sent me the link to the research paper the NYT graph is based on. It showcases a similar scatter plot in which the line of best fit is kept. It also discusses several correlation coefficients the researchers calculated, and they don't look trivial, to say the least (see third screenshot above: r=.41). Isn't the line a critical element here, as it highlights the upward trend? Doesn't dividing the space of the graph in four quadrants make its message murkier, as the positive slope is not that visible? Or is it just me?

*In The Functional Art and also in previous posts in this blog I've mentioned that I am old fashioned. I still get the print paper every morning

5 comments:

Agree on all points. Perhaps more technical than you would be interested in, but I will continue on with merely the intent of statistical education.

If you divide a scatterplot into quadrants like this, but place the lines at the mean of the values on the x and y axis, this shows a more clear picture of the nature of the relationship (although still not as obvious as the regression line). This is because points in either the lower-left or the upper-right quadrant have a positive contribution to the covariance, and the points in the upper-left and lower-right quadrants have a negative contribution to the covariance. So simply counting up the points, if the (LL + UR) has more points than (UL + LR) gives a general indication if the relationship is positive or negative. This is IMO a useful way to visualize what correlation/covariance is, and also to understand what high-leverage and outlier observations are (useful things to know and understand if your placing regression lines on your plots!)

The way the NY Times graphic actually divvies up the quadrants visually makes it appear like there is no relationship when there is. This is because the quadrants aren't drawn at the means of the x and y values, but at the value of the United States at each of those coordinates (which appears above the mean on the happiness axis). Because there aren't that many points "above and to the right" of the United States, it isn't clear from the scatterplot that the relationship between the two variables is positive, but that ends up being misleading.

If the goal was to merely show that the relationship between tax burden and happiness was positive, the quadrants drawn in this graphic are actually harmful to that goal (e.g. even without the line in the scatterplot from the research article it is easier to see a slight positive relationship than in the Times graphic - although probably the backwards to usual y-x axis-ratio does not help either). If another goal was to show which countries had both higher tax burden and were "happier" than the U.S., there are likely alternative ways (e.g. coloring the points) that don't obfucate other general relationships in the data.

Even with the trend line, the relationship in the second graph isn't obvious. R=0.41 seems pretty weak, but then, I was raised in the physical sciences. But as Alberto and Andy both point out, the shaded quadrants obscure any relationship.

The quadrants are adding too much clutter for the (rather weak) trend to be visible. What's also interesting is that the aspect ratio in the original is wider, which makes the trend more apparent. The NY Times one is taller and not as wide, so the points appear to be more random.

Thank you all. I may try to do what Andy suggested when I present this in classes.

By the way, .41 is not weak, as far as I know, at least in the social sciences. This may depend on many factors, such as on what you're comparing, obviously. Take my words with a grain of salt here. I am not a data analyst, but an amateur.

It may be the case that the two graphs have different purposes based on the same dataset.

Andy touched on it by suggesting the NYT may be trying to simply show which countries are more or less happy than the US as a function of their taxation system (as opposed to showing a correlation between taxation and happiness). If that were the case, isn't the purpose of the visualisation fundamentally different to that of the scientific paper?

This all suggests a good starting point is: "what was the purpose of the visualisation?". Understanding that is paramount to commentary on its effectiveness. Of course, if the purpose itself isn't apparent, that's another matter...