Sophisticated ideas are difficult to get across in a chart. For instance, the NYT recently described the gender gap in the workplace by comparing the proportion of men versus women in managerial positions relative to the overall proportion. Two simultaneous comparisons are taking place, one between men and women, and the other between managerial positions and overall employment.

The published chart (below left) used eight pie charts. To my eyes, this graphic is confusing, not least because the primary comparison between managers and overall is set far apart. The junkchart version (right) tries to fix this by graphically showing the gender gap using a horizontal line segment. Also, the 50% gray dotted line allows the reader to see quickly that in the three industries where men comprise the minority overall employment, they take up the majority of managerial positions.

The scatter plot construct allows easy understanding of correlation between two variables, the average annual income and child mortality. Moreover, the "sliding timebar" shows this correlation was quadratic/non-linear in 1975 but linear in 2000 (but note log-log scale).

Food for thought:

a) Readers of this blog won't want to hear my rant on bubble charts again (here, here and here). The most important message of this chart concerns regional differences in per capita income and yet the visual message is dominated by the overlapping bubbles (i.e. population size).

b) The size of bubbles also distracts when used in a scatter plot (see right), where the reader must identify the center of a bubble to figure out its x- and y- dimensions. This becomes difficult when bubbles of differing sizes overlap and obscure one another.

c) The log-log scale requires careful interpretation. For example, the statement cited on the right is erroneous because the change in child mortality rate in Africa actually dropped 18% (from 22% to 18%), rather than remaining "almost the same". The confusion arises because Africa appears in the larger end of the log scale (for child mortality) where even small visual distances represent large separations in the raw data.

d) A clearer way to investigate changes over time is shown on the left. Each line traces the development path from 1975 through 1990 to 2003. Africa and Eastern Europe experienced negative per capita GDP growth. Meanwhile, Latin America and the Arab States had stagnant economies but rapidly declining child mortality. Asia, on the other hand, experienced fast growth but modest gain in child mortality. Finally, the distance between the "high-income" OECD countries and the rest of the world is vast, especially when we note the use of a log scale.

Again, Gapminder anticipated this line of thinking and addressed this in Chapter 7. However, the insistent use of bubbles makes their development path chart more muddled than our version here.

Overall, Chapter 4 is another valuable contribution to our knowledge of this data; the result would have been even better if they had omitted the population size dimension.

PS. Thanks to Hadley for pointing out the need to indicate the directions of the paths. I have updated the chart now.

This first post concerns primarily the first three chapters which examine the distribution of income across the world.

Gapminder is a Swedish non-profit dedicated to utilizing visualization software to help enliven and disseminate social science data. The quality of these animations is impeccable, taking full advantage of innovative web-based technologies to explain graphical constructs visually. The site is invaluable for policy analysts needing to interpret the voluminous data, and for students of data visualization although seasoned statisticians will find the pacing too sluggish.

a "sliding timebar" (see right) allows visitors to explore the change in the shape of income distribution over time

Food for thought:a) The log scale for $ per day was not explained. In general, using the log scale involves a tradeoff: visual clarity is gained with better spacing between data but distances are distorted so that an inch on the left-side of the distribution is not the same (in terms of $ per day) as an inch on the right-side of the distribution. In fact, the log transformation artificially changes the visual shape of the distribution.b) The important concept of PPP, which holds the key to comparability, is used but not explained. It is scary and sad to know that large portions of the world's population live on less than $10 per day (around US$3,600 per year), after already adjusting the number upwards to account for cheaper costs of living.c) The tails of any income distribution get short shift in this presentation even though the tails are crucial to understanding income inequality.

An immediate question comes to mind, which is where more and less developed nations fall in this distribution. Part of the amazing experience with this site is that the designers anticipated our questions and address them in later chapters.

II. Regional Income Distribution

Concepts used: stacked area chart, distribution by segment

This was not my favorite chapter and I note these

Problems:a) The stacked area chart does not do the data justice! The only distribution that is clearly visible is that of Africa (in pink) because it sits at the bottom. All other distributions are layered on top of one another, totally distorting their shapes. For example, the Latin America & East Europe distributions (orange, green) look like flat pancakes on this chart.b) The above problem is multiplied during the "sliding timebar" animation. Changes in shapes over time as visualized are highly problematic.

This chapter continues the graphical construct from the previous but hones in on the segment living below the "poverty line".

The same problems apply. It is easy enough to note the temporal changes of those living at the poverty line but much harder to visualize these changes for people earning less than $1 per day.

In conclusion, Chapter 1 is chock-full of important concepts, and clever visual explanation but the graphical construct chosen for Chapters 2 & 3, that is, the stacked area chart, leaves something to be desired. I'm eager to look at the other six chapters and will let you know what I think.

Today I return to analysis of the sad tally, or are suicide locations on the Golden Gate Bridge random, or how does one determine if a sequence of numbers is random? The visual evidence, from cumulative distributions and box plots, tells us that the shape of distribution matters. One way to directly compare two distributions is by comparing quantiles.

The following chart shows the (smoothed) cumulative distribution of some non-random data (Dataset 9) on the left, and randomly generated data on the right. It is clear that the two lines are not the same shape; is there a systematic way to compare them?

The orange line identifies the point at which the number of suicides equal 40% of the total. On the left, this means the number of suicides committed between locations 41 and 72 is 40% of the total. On the right, the same number occurred between locations 41 and 70. The pink line similarly compares the point at which the suicides equal 20% of the total. Notice that at this point on the distribution, the locations are significantly different, 41-65 on the left versus 41-58 on the right.

Such comparisons can be made at different points on the distribution, 10%, 20%, 30%, etc. The result is a qqplot (quantile-quantile plot) as shown below. Each distribution is compared to an ideal "uniform" distribution (i.e. random) which is the straight line. Not surprisingly, the data on the right, generated randomly, is much more likely to be random. The left line is consistently above the straight line, which indicates systematic difference from random.

P.S. I have neglected the tricky issue of how much difference from random is required to pronounce the visual evidence conclusive. Usually, after inspecting graphs, we have to resort to mathematics by running statistical tests. But statistical tests, with the omnipresent p-values, often give a false sense of security, particularly where the theory is incomplete, as is the case in tests of randomness. Running statistical tests without visualizing the data is dangerous.