Prominent marketer Seth Godin came up with some sensible rules for making "graphs that work". We pretty much agree with most of what he says here, unlike the last time he talked about charting.

One must recognize that he has a very specific type of chart in mind, the purpose of which is to facilitate business decisions. And not surprisingly, he advocates simple, predictable story-telling.

His first rule: dispense with Excel and Powerpoint. Agreed but to our dismay, there are not many alternatives out there that sit on corporate computers. So we need a corollary: assume that Excel will unerringly pick the wrong option, whether it is the gridlines, axis labels, font sizes, colors, etc. Spend the time to edit every single aspect of the chart!

His second rule: never show a chart for exploration or one that says nothing. I used to call these charts that murmur but do not opine. (See here, for example.) This pretty much condemns the entire class of infographics as graphs that don't work. This statement will surely drive some mad. One of the challenges that face infographics is to bridge the gap between exploration and enlightenment, between research and insight. As I said repeatedly, I value the immense amount of effort taken to impose structure and clarity on massive volumes of data -- but more is needed for these to jump out of the research lab.

In rules 3 and 4, Seth apparently makes a distinction between rules made to be followed and rules made to be broken. In his view, time going left to right belongs to the former while not using circles belongs to the latter. He gave a good example of why pictures of white teeth are preferred to pie charts, bravo. I hope all those marketers are listening.

As readers know, I cannot agree with "don't connect unrelated events". He's talking about using line charts only for continuous data. This rule condemns the whole class of profile plots, including interaction charts in which statisticians routinely connect average values across discrete groupings. The same rule has created the menace of grouped bar charts used almost exclusively to illustrate market research results (dozens to hundreds of pages of these for each study). I'd file this under rules made to be broken!

What menace?

What menace?

What menace?

What menace?

Alright, I made my point. If you don't work in market research, the mother lode of cross-tabs and grouped bars, consider yourself lucky. If you do, will you start making line charts please?

The NYT team just put up a fantastic visualization of the American Time Use Survey data, which purports to measure how the average American spends the time of a day. (Apparently, thousands of people recalled what they did every minute of an average day.) The amount of data collected is massive, and this graphic allows readers to explore the data in intuitive ways.

The chart shows for each minute of the day (horizontal axis) the proportion of people doing specific activities. Not surprisingly, we spend more time sleeping than any other type of activity. The axis and data labeling as well as gridlines are very restrained.

Normally, I am not a big fan of these proportional area charts because the only relevant dimension to look at is the vertical distance from one curve to the next but the focus on areas put equal weight on the horizontal and vertical distances. The horizontal distance is meaningless, and thus the area is meaningless.

These designers found a solution to the problem, and good for them! Because of the mouse-over effect, I could not save the actual appearance -- here, I show what it looks like.

By mousing over different parts of the graph (say, moving vertically), we can compare the actual proportions. Terrific!

Andrew posted a pretty chart that caught my attention. This is the sort of sophisticated chart that rewards careful reading.

Below is a guide to reading the chart:

It is a small multiples chart with the components arranged in two dimensions (income levels, and a race-religion hybrid category). The top row is a summary of voters of all race-religion grouped by income. Note that there is no corresponding summary column for voters of all incomes grouped by race-religion.

Source of data: 2000 poll but applied to 2008 demographic patterns. In other words, there is an underlying assumption that opinions have stayed stable within the demographic groups.

The chart is in fact three dimensional because each map gives us the geographical (state by state) breakdown.

It is useful to figure out the smallest unit of data: in this case, this is the percentage support of federal school vouchers by voters of a given race-religion-income-and-state category.

The color scheme is such that red represents highest support and blue lowest support, with pink and purple in the middle

It's almost always better to start from the aggregate (that is, the average) and then study variations along different dimensions, and this is how the chart is arranged from top to bottom

On the top row, the higher income groups tended to favor vouchers more than lower income groups, with a break point around $75k; even here, the regional differences are significant, with northeast and southwest hotter for vouchers at all income levels

As we move from row to row, we realize that the aggregate data hides many disparities. For example, white Catholics (second row) are more likely to support vouchers regardless of income level while white non-evangelical Protestants (fourth row) are much less likely than average to support vouchers at all income levels.

Notice that the statistician (Andrew) has carefully defined the race-religion categories to balance between collapsing subgroups that are distinct and showing too many subgroups so as to cloud the patterns. That is why there are many more race-religion subgroups that are not shown. The ones shown are of special interest. Consider the white protestants, evangelical vs. non-evangelical (third and fourth rows). If one were to fix the race, geography and income dimensions, and even fix half of the religion dimension, we still find the two subgroups to be on different ends of the spectrum relating to the voucher issue. This is why the evangelical or not dimension has been included.

The white space is interesting. Here, the issue faced by the statistician is sparse data when one gets down to multi-dimensional subgroups. Andrew chose to ignore all the data, which is the wise thing to do. With so few samples, it is particularly easy to draw bad conclusions.

Because of the white space, we get additional information on the spatial distribution of the demographic subgroups. The black population (at least the voters) are predominantly found in the southeast while Hispanics are in the southwest. The subgroup of income higher than $150k is essentially all white. Admittedly, this is a very crude read because we only have two levels (below 2% of state population and above). Of the colored states, we cannot differentiate between densely populated and not.

From Bernard L., another exemplary effort by the Times. This one really got me excited.

The set of line graphs shows how demographics of students in American schools have evolved in the last two decades. Here, I selected New York City schools, and the tool sensibly decided to compare those with New York State schools (gray line).

There is so much to learn from one simple chart:

The blue and gray lines are almost parallel everywhere, which tells us that in terms of the change in demographic composition, New York City pretty much resembled New York State during this entire period.

However, in terms of demographic composition, rather than the change in composition, New York City schools are very different from the rest of the state, in that the proportion of white is lower by a third while that of minorities are much higher, especially black and Hispanic students.

State-wide (as well as city-wide), black and white students have been declining as a proportion while Hispanics and Asians have increased.

The extent of the change is immediately visible, Asians have jumped from 7% to 14% for example.

From a graph design perspective, the execution is very clean. Data labels are limited to the first and last values. A small multiples concept is used with the ethnic groups placed side by side. A great awareness of foreground and background as well. And imagine how much data has been visualized here, and be impressed. You can look at any county in the country.

Here's one where the county change does not exactly mirror the state change (Napa in California):

Reference: "Diversity in the classroom", New York Times, March 12 2009.

Jerome C., a reader and blogger, wrote up a wonderful piece on different ways to publish charts on the web. Highly, highly recommended.

*** Rant ***

One of the points he made was that images (jpegs, gifs, etc.) are often published with poor quality. I feel the pain. Ever since Typepad switched to its new and "improved" editor, this blog has been suffering from low-quality thumbnails. I know, I know... I need to move to Wordpress. But from a 15-minute online research effort, I realized that moving a blog with lots of images is rather impossible! All of the images would have to be uploaded, and a lot of links would need to be fixed. Maybe the next time I am on holiday, I will get around to it.*** Excel ***

Take
a look at his comparisons of four ways to forklift an Excel chart onto
a blog. (The image on the right showed one of the four ways.) The difference in image sharpness is marked.

Resizing Excel charts is a common source of headache. Always right-size the chart inside Excel before exporting!

*** Swivel, Google, etc.***

I also share Jerome's point of view on these on-line graphics creators. Good idea, wishing for more. In his words:

to make a point, you absolutely need to be able to control every aspect
of your graph, even if its form remains familiar: combine series, group
or highlight some datapoints, format axis, and so on.

I would like to explore the other options he cited, such as Processing.

*** Great example ***

Jerome's blog has a promising beginning. The following chart is both informative and beautifully crafted. It brings out the clear message that OECD countries have done admirably well in life expectancy, and particularly impressive in reducing the variance among member countries by lifting the expected age of the worse-off, relative to the better-off, with most of the gain happening during the 1980s. (Adding quartiles may also be meaningful. And I prefer to put the labels outside the plot area.) The graph does not explain what caused the shift in the 1980s but this is a great starting point for the curious.

Gelman pointed to this Brendan Nyhan post dissecting David Sirota's chart purportedly showing a "race chasm" in the Democratic primaries. The left chart is David's original and the right is a Nyhan revision.

Please see Nyhan for the political interpretation. Here, I want to note a number of improvements Brendan made to the chart:

Sirota plotted the ranks of the percent of black population, which is misleading. Nyhan plotted the actual percentages on his horizontal axis

Sirota connected the dots which highlighted the noise (ups and downs) in the data. Nyhan fitted a linear model (he also tried other non-linear versions).

Nyhan exposed the excluded states in a footnote. Sirota didn't. For this chart, this piece of information is very important since so many states were excluded.

Nyhan walked us through multiple charts he used to explore the data. Much of the time was spent picking and choosing states to include or exclude. We learnt that Sirota excluded states with large Hispanic populations, which Nyhan disagreed with while Nyhan wanted to exclude Florida, which Sirota decided against, even though Sirota excluded Michigan, which Nyhan consented but Nyhan also wanted to exclude the causus states, and so on...

Judging from the charts, this picking and choosing appears not to have changed the outcome in this case. In general, one should exercise great care in such decisions because one might end up seeing what one wants to see.

The following chart is missing from the post, which I think points out something more telling than the negative correlation between Obama's margin with white voters and the proportion of black population.

Reader Daniel sent us a great example of how even little things matter a lot in chart-making. The left chart is the original. The right chart (created by Daniel) shuffled the order of the legend to match the curves, and spaced them out. All of a sudden, the chart is much easier to read.

I love articles that expose the behind-the-scenes of creating complex graphs. This Wall Street Journal blog post tells us some dirty secrets behind these cartograms that depict the "influence" of different media outlets throughout the world.

In the second interesting item of the week, I return to the fabulous Google Finance chart, which shows the distribution of stock market returns by sector. I wrote about it twice (here and here). In the original post, I saluted the engineers for figuring out the formidable technical issues of turning a live dynamic data stream into a live dynamic graphic but didn't go into details. (Trust me.)

The other night, this chart popped up on my browser.

Oops.

If someone kept track of each time such a mishap showed up, the tally would probably be 1-5% of the time.

The triple challenge of generating this graphic is the volume of data that needs to be processed, the velocity at which it changes, and the flicker of time from input to output, probably not more than a few minutes. The analysis and charting must be maintained continuously during market hours. For any such projects, the thing to manage is the error rate, and one should be totally thrilled if it's in the range Google engineers have achieved.