Tuesday, July 29, 2014

Yes-or-no survey data is phrased as if there are two possible responses, but there are three when you factor in "no answer" (a lack of answer can be for a variety of reasons). This makes it tricky to report on and visualize survey data: some stories a couple of weeks ago about Pew Research Center's survey of attitudes toward Russia in 44 countries (taken in March 2014 during the Crimea crisis but months before the shooting started and MH17 was downed) expressed the results in terms of most in favor of Russia; some expressed them in terms of least in disfavor.

These are not at all the same thing if they don't add up to 100%; for example, Pakistani respondents were 11% in favour of Russia, 29% in disfavor and a whopping 60% no answer. (I wonder how much is because people are too poor to have the luxury of an opinion about something that is not involved in feeding their family, how much is cultural parochialism, how much is an unwillingness to answer a stranger's questions, and how much is all the other possibilities.) In terms of favorable attitude to Russia, Pakistan is dead last. In terms of lack of disfavor, it's the country that's eighth-most aligned towards Russia!

I had the thought that all three responses could be portrayed with a ternary plot, which I used to use all the time in physical chemistry to chart properties of three-component mixtures. They can be a little weird for those used to thinking in 90-degree terms, but I think they're more intuitive in this context than in plots of the melting points of alloys:

There is obvious clustering in the colors of circles, which corresponds to geographical groupings. The western countries are near the "unfavorable" top left vertex, while eastern countries are more likely to be near the top right "favorable". Ukraine is split (as the current fighting attests), and it's the African, Asian and South American countries that tend to have the highest "no answer" rates.

If you're having trouble, here's a little guide:

Basically, the vertical line bisecting the triangle is where opinion is split 50/50 among those who answered; as you move down the line, there are fewer committed answers so the width of the triangle decreases.

The good thing about this plot is it can be adapted to a color scale (color scales are usually more suited to binary data), and hence to a map. There were few maps made when Pew released their report, and I suspect that's because it's hard to find a robust metric to show.

The intensity of the color shows how many people answered the question. The lighter the shade, the more non-answers; Pakistan's color is barely visible.

We can color our circles in the ternary plot with this scale as well:

In many ways, I think this is a more satisfying (although definitely less simple) approach than the usual bar and line graphs, such as this example by Radio Free Europe / Radio Liberty:

There's absolutely nothing wrong with it, you have to make a choice how to present your data, and this choice is valid. It's clear. But it's also misleading; look at all the orange (unfavorable, the opposite of my graph -- that's the problem with the colorblind-friendly blue and orange, it's less obvious how to use them than red and green) on the rightmost bars, it's all over the place since they've chosen to list the countries according to the leftmost bars. It's impossible to read patterns in the middle bars.

The true advantage of a two-dimensional plot is you can go full Hans Rosling and use animation to show change over time. For 36 of the 44 countries, Pew had stats for 2013 as well as 2014, so we can see how much difference a year and an international crisis makes::

The next two visualizations are a bit data-geeky, consider yourself warned.

The effect ordering in terms of most favorable or least favorable attitudes can be seen if we just make the two lists and join every country with a line (blue by default, red if they change positions by more than 11 places). I've also added a column for another metric, that of the ratio of those in favor to the ratio of those not in favor of Russia (but it's no more meaningful than the other measurements; less so, because there is no way to include the number of non-commitals in a ratio)::

The Spearman Coefficient (it's sort of like an R squared, but for rankings) between the first and second columns is 0.454, so there is a predictive relationship between number in favour and number in disfavour, but it's full of outliers. When we plot the standard deviation of the three rankings to the percentage of people who did not answer, there are two clusters::

This is easily explained by the ternary nature of the data. The upper cluster shows that the more people did not answer the survey, the more volatile the nation's ranking when you change metrics between the possible responses. This makes perfect sense. And the bottom cluster shows that nations that are near evenly split, like Lebanon with 45% favorable, 54% unfavorable and only 1% no answer, being near the middle of the rankings are more vulnerable to being shifted up or down by the more volatile non-committed nations.

This is hardly any kind of profound insight, but I find it an interesting window into the nature of the data. Feel free to disagree, I thrive on adversity! (If by 'thrive', you mean buy ice cream and lock myself in the basement.)

I used the Python library Bokeh for the first time to make the interactive plots; it was quite an experience trying to use a tool in beta, asking stupid questions on StackOverflow and getting answered by the developers! While the documentation is as good as can be expected (if not better), there are still some massive holes for the unwary. Still, it beats doing it all in JavaScript, given my skills. The code is in a GitHub gist, and here's an NBViewer version.
Thanks to the viewer who pointed out that I had the "Unfavorable" and "Favorable" vertexes incorrectly unlabelled, quite an egregious and avoidable brain fart on my part. Luckily I was able to correct it after about three hours of looking like an idiot before the Internet!