Ted Ballachine wrote me about his website Pension360 pointing me to a recent attempt at visualizing pension benefits in various retirement systems in the state of Illinois. The link to the blog post is here.

One of the things they did right is to start with an extended guide to reading the chart. This type of thing should be done more often. Here is the top part of this section.

It turns out that the reading guide is vital for this visualization! The reason is that they made some decisions that shake up our expectations.

Similarly, a person's service increases as you go down the vertical axis, not up.

I have recommended that they switch those since there doesn't seem to be a strong reason to change those conventions.

***

This display facilitates comparing the structure of different retirement systems. For example, I have placed next to each other the images for the Illinois Teacher's Retirement System (blue), and the Chicago Teacher's Pension Fund (black).

It is immediately clear that the Chicago system is miserly. The light gray parts extend only to half of the width compared to the blue cells in the top chart. The fact that the annual payout grows somewhat linearly as the years of service increase makes sense.

What doesn't make sense to me, in the blue chart, is the extreme variance in the annual payout for the beneficiary with "average" tenure of about 35 years. If you look at all of the charts, there are several examples of retirement systems in which employees with similar tenure have payouts that differ by an order of magnitude. Can someone explain that?

***

One consideration for those who make heatmaps using conditional formatting in Excel.

These charts code the count of people in the shades of colors. The reference population is the entire table. This is actually not the only way to code the data. This way of coding it prevents us from understanding the "sparsely populated" regions of the heatmap.

Look at any of the pension charts. Darkness reigns at the bottom of each one, in the rows for people with 50 or 60 years of service. This is because there are few such employees (relative to the total population). An alternative is to color code each row separately. Then you have surfaced the distribution of benefits within each tenure group. (The trade-off is the revised chart no longer tells the reader how service years are distributed.)

Excel's conditional formatting procedure is terrible. It does not remember how you code the colors. It is almost guaranteed that the next time you go back and look at your heatmap, you can't recall whether you did this row by row, column by column, or the entire table at once. And if you coded it cell by cell, my condolences.

Thanks to reader Charles Chris P., I was able to get the police staffing data to play around with. Recall from the previous post that the Washington Post made the following scatter plot, comparing the proportion of whites among police officers relative to the proportion of whites among all residents, by city.

In the last post, I suggested making a histogram. As you see below, the histogram was not helpful.

The histogram does point out one feature of the data. Despite the appearance of dots scattered about, the slopes (equivalently, angles at the origin) do not vary widely.

This feature causes problems with interpreting the scatter plot. The difficulty arises from the need to estimate dot density everywhere. This difficulty, sad to say, is introduced by the designer. It arises from using overly granular data. In this case, the proportions are recorded to one decimal place. This means that a city with 10% is shown separate from one with 10.1%. The effect is jittering the dots, which muddies up densities.

You no longer have every city plotted but you have a better view of the landscape. You learn that most of the action occurs on the top row, especially on the top right. It turns out there are lots of cities (22% of the dataset!) with 100% white police forces.This group of mostly small cities is obscuring the rest of the data. Notice that the yellow cells contain very little data, fewer than 10 cities each.

For the question the reporter is addressing, the subgroup of cities with 100% white police forces is trivially important. Most of these places have at least 60% white residents, frequently much higher. But if every police officer is white, then the racial balance will almost surely be "off". I now remove this subgroup from the heatmap:

Immediately, you are able to see much more. In particular, you see a ridge in the expected direction. The higher the proportion of white residents, the higher the proportion of white officers.

But this view is also too granular. The yellow cells now have only one or two cities. So I collapse the cells.

More of the data lie above the bottom-left-top-right diagonal, indicating that in the U.S., the police force is skewed white on average. When comparing cities, we can take this national bias out. The following view does this.

The point indicated by the circle is the average city indicated by relative proportions of zero and zero. Notice that now, the densest regions are clustered around the 45-degree dotted diagonal.

To conclude, the Washington Post data appear to show these insights:

There is a national bias of whites being more likely to be in the police force

In about one-fifth of the cities, the entire police force is reported to be white. (The following points exclude these cities.)

Most cities confirm to the national bias, within an acceptable margin of error

There are a small number of cities worth investigating further: those that are far away from the 45-degree line through the average city in the final chart shown above.

Showing all the data is not necessarily a good solution. Indeed, it is frequently a suboptimal design choice.

The following Wall Street Journal caught my eye the other day: (Link to article)

Looking closely, I realize that the four charts are identical, except for the call-outs. This is a kind of small-multiples in which the same data reside in each panel but the labeling changes. It's planned redundancy but I'm afraid I don't see the point.

The chart compares four different ways to save money by cutting cable. Here is an alternative that places the focus on the number of dollars saved:

In a comment to my previous post, reader Chris P. pointed me to the following set of maps, also from the New York Times crew, on the legalization of gay marriage in the U.S. (link)

(For those who did not click through, the orange colors represent two types of bans while the dark gray/grey color indicates legalization.)

These maps are pleasing to the eye for sure. By portraying every state as a same-sized square, the presentation avoids the usual areal distortion introduced by the map.

But not so quick. Note that each presentation makes its own assumption on the relative importance of states. The typical map scales weights according to geographical area while this presentation assumes that every state has equal weight. Another typical cartographic display uses squares of different sizes, based on the population of each state.

The location of states are necessarily distorted. One way to remedy this is to have hover-over state labels. On a browser, such interactivity works better than having to scroll to the top where there is a larger map which doubles as the legend.

It would be interesting to learn also about the future. Are there any legislation in the pipeline either to legalize gay marriage in the remaining orange states or to overturn the legalization laws in the gray states?

PS. [5/6/2015] Here is an alternative presentation of this data by David Mendoza.

Those who attended my dataviz talks have seen a version of the following chart that showed up yesterday on New York Times (link):

This chart shows the fluctuation in Arctic sea ice volume over time.

The dataset is a simple time series but contains a bit of complexity. There are several ways to display this data that helps readers understand the complex structure. This particular chart should be read at two levels: there is a seasonal pattern that is illustrated by the dotted curve, and then there are annual fluctuations around that average seasonal pattern. Each year's curve is off from the average in one way or another.

The 2015 line (black) is hugging the bottom of the envelope of curves, which means the ice volume is at a historic low.

Meanwhile the lines for 2010-2014 (blue) all trace near the bottom of the historic collection of curves.

***

There are several nice touches on this graphic, such as the ample annotation describing interesting features of the data, the smart use of foreground/background to make comparisons, and the use of countries and states (note the vertical axis labels) to bring alive the measure of coverage volume.

PS. As Mike S. pointed out to me on Twitter, the measure is "ice cover", not ice volume so I edited the wording above. The language here is tricky because we don't usually talk about the "cover" of a country or state so I am using "coverage". The term "surface area" also makes more sense for describing ice than a country.

The chart took a little time to figure out. This isn't a bad chart. Robbi wondered if there are alternative ways to plot this information.

The U.S. population is divided into percentiles across the horizontal axis, presumably based on the income distribution in some year (I'm guessing 2007, the start of the recession). For each percentile of people, the real per capita growth (decline) in disposable income is computed for two periods: the blue line shows the decline during the recession (2007-2010) and the orange shows the growth (in some cases further decline) during the recovery (2010-2013).

This chart draws attention to the two tails of the distibution, namely, the bottom 10 percent, and the top 5 percent. At one level, these two groups (excepting the bottom 2%) experienced the best of the recovery. But then, they also suffered the worst declines during the recession.

***

Here is one possible view of the same data, in a format with which I have been experimenting recently. You might call this a Bumps panel or a slopegraph panel.

The slopes draw attention to the relative magnitude of the declines and the subsequent recoveries. (I thinned the middle 80% substantially because there isn't much going on in that part of the dataset.) If I have more time, I'd have chosen a different color instead of grayscale for those lines.

I ignored any questions I have about the underlying data. How is disposable income defined and measured? Does it carry the same meaning across the entire spectrum of income distribution? etc. (Milanovic points to the Survey of Consumer Fiannces as the source.)

***

One reason for the reading difficulty is the absence of a reference point. It's unclear how to judge the orange line. Two answers are suggestive (but problematic). One is the zero line: which segments of the population experienced a recovery and which didn't? Another is the mirror image of the blue line: how much of what one lost during the recession did one recover by 2013 (roughly speaking)?

Both of these easy interpretations worry me because they carry an assumption of equal guilt (blue line) and/or equal spoils (orange line). It is very possible that the unwarranted risk-taking or fraud was not evenly spread out amongst the percentiles, and if so, it is impossible to judge whether the distribution exhibited in the blue line was "fair". It is then also impossible to know if the distribution contained in the orange line was "fair". Indeed, if the orange line mirrored the blue line, then all segments recovered similarly what they lost--this would only make sense if all segments are equally culpable in the recession.

It's great for me when my friend Alberto Cairo lent a helping hand (link). Here is the original chart showing deaths in African and Middle East countries due to recent unrest:

This is Cairo's redesign:

There is no doubt the new version brings out the data more clearly. I like the cropping of the continent. I'd color-code the countries using the same legend as above.

I'm troubled by the concept of the original chart. I struggle to find any interesting correlation of deaths, whether with time, with government reaction, or with geography. Of the three, I think geography is the most correlated so a good design should bring that out. (Of course, geographical bias is expected and thus rather boring.)

If the intention of the chart is to answer the question of what factors affect deaths, then the wrong variables are being utilized.

So, as regards the Trifecta Checkup, Cairo solved the V problem while the D problem remains.

I like this New York Timesgraphic illustrating the (over-the-top) reaction by the New York police to the Eric Garner-inspired civic protests during the holidays. This is a case where the data told a story that mere eyes and ears couldn't. The semi-strike was clear as day from the visualization.

There are three sections to the graphic, and each displays a different form of comparisons.

The first chart is the most straightforward, comparing the number of summonses this year to that of the same time a year ago.

One could choose lines for both data series. The combination of one line and column also works. It creates a sensation that the columns should grow in height to meet last year's level. The traffic cops appear to have returned to work more quickly. That said, I don't care for the shades of brown/orange of the columns.

***

The second chart accommodates a more complex scenario, one in which the simple year-on-year comparison is regarded as misleading because the overall crime rate materially dropped from 2013 to 2014. In this scenario, a before-after comparison may be more valid.

The chart has multiple sections and I am only showing the section concerning summonses (The horizontal axis shows time, the first black column being the first ten months, and the other orange columns being individual months since then. The vertical axis is the percent change from a year ago.).

The chart shows that in the first ten months of 2014, before the semi-strike, the number of summonses issued was already slightly below the same period the year before. Through the dotted line, the reader is invited to compare this level of change against those in the ensuing months. How starkly did the summonses rate fell!

***

The final chart reveals yet another comparison. Geography is introduced here in the form of a proportional-symbol map.

Again, you can't miss the story: across every precinct, summonses have disappeared. This chart is very helpful to making the case that the observed drop is not natural.

I'd like to start 2015 on a happy note. I enjoyed reading the piece by Steven Rattner in the New York Times called "The Year in Charts". (link)

I particularly like the crisp headers, and unfussy language, placing the charts at the center. The components of the story flow nicely.

***

Here are my notes on some of the charts:

This chart is missing context, which is performance against population growth or potential. Changing the context also changes the implicit yardstick. The implied metric here is more-than-zero growth or continued growth.

It took me a while to find the titles to know what each section depicts. I'd prefer to put the titles back to the top or the top left corner. The "information in my head" is making me look at the "wrong" places. But otherwise, this is Tufte goodness.

This innocent thing prompts a host of questions. First, how could a "median" be found to have so many values within one population? It would appear that this is an exercise in isolating each quintile (decile in the case of the top 20%) and computing the median within each segment. In other words, the data represent these income percentiles: 95th, 85th, 75th, 50th, 3oth and 10th. Given that the income data have already been grouped, computing group averages makes more sense than calculating group medians. This is especially so when comparing changes over time. The robust median suppresses changes.

The bucketing of income presents another challenge. All buckets except at the very top are essentially bounded. All the central buckets have minimum and maximum values. The bottom bucket is bounded under by zero. The top bucket, however, is basically unbounded so important features of this data could be lost by summarizing the top bucket by its median.

A third problem surfaces if one were to inquire how the survey collects its data. According to the Federal Reserve description, the data concern "usual income" as opposed to "actual income". Respondents are told to ignore "temporary" conditions in describing their "usual incomes". It is likely the case that people think income increases are permanent while getting laid off is temporary so while usual income solves one problem (the long-term planner's problem), it creates a different problem (short-term bias). I particularly don't think it is a good metric for assessing changes around a recession/recovery.

I also wonder about the imputation of missing data. I'd assume that possibly there is a preponderance of missing values for unemployed people. If the imputation cannot predict the employment status of those people, then it would surely have inflated incomes.

I wonder if any of my readers knows details about some of these potential problems. Would love to hear how the Fed's statisticians deal with these issues.

On this chart, the author has found an excellent story, and the graphic is effective. I prefer to see the horizontal axis labelled "More Unequal" as opposed to "Less Equal" because of the conventional that "more" is usually placed to the right of "less" on the horizontal axis. Here is a scatter plot version of the data:

It shows the U.S. is a bit more extreme than all others.

This is another great chart. I like the imagery of the emptying middle. I find the labels a bit too long and requiring too much interpreting. I prefer this: