A matter of timing

A reader Carly C. from Streetsblog created the following chart and wanted to know if there are better ways to present the data. She already disliked the double axes and thought of various options including using relative scale.

Generally speaking, dual axes in which each axis takes its own scale is like a football team with two "good" quarterbacks rotating under center, or two "great" CEOs sharing power. We have never seen those situations work out.

When we have two quantities under comparison, we like to put them on the same scale. In this case, converting the scale from absolute numbers to relative would do the trick.

The data paint a powerful story: as bike volume increased over time, bike accidents decreased. The stitching together of two lines at year 1999 was an artifact of manipulating the scales. What Carly had in mind can be accomplished using an index set at 100 in 1999. This would lead to the chart shown left. The substance of this chart and Carly's original is the same but the revised one has a single axis.

Indexing time series data is a widely used technique. Each issue of the Economist, for example, contains many such charts. This type of chart, however, suffers from a critical and under-appreciated problem: the visible pattern frequently and critically depends on timing. Specifically, it makes a huge difference which year is selected as the baseline (index=100).

A lot of mischief is possible by picking a special baseline. Take for example, I created the same chart three times, using 1998, 1999 and 2000 respectively as baselines. When 1999 w
as 100 (middle chart), a criss-cross pattern showed up between 2001 and 2002, leading readers to conclude that the gap between growth in volume and growth in accidents developed during 2001. In the other two charts, the gap appeared around 2000. Also, the bottom chart exhibited a clear growing gap (after dumping the disagreeable data before 2000).

Unfortunately, this is a feature of such charts; whether or not timing distorts the information presented depends on how rugged the underlying data is. Put another way, these charts can be affected by outliers. (In this example, there were sharp changes in bike volumes in 1998-2000.)

PS. [5/12/2008] How opportune was Andrew's post on R graphics default headaches. I was too lazy to figure out the defaults and let R figure out the dimensions (poorly); with Jake's suggestions, the new set of charts looked much better.

TrackBack

Comments

Any time I see a chart like this where the bottom of the y-axis is not zero, I distrust it.

One needs to be able to see the difference in the final two values with respect to zero. In other words, the opening might be only 0.02% on one graph but 70% on another but look exactly the same due to the select of the y-axis range.

The two axis chart has an important advantage of tangibility. Some people find it much easier to trust a chart if they can find tangible numbers such as 6000 bike crashes, as compared to an intangible number such as bike crash index of 120.

Grant, I think in this post all the charts that need to have y-axis beginning from zero actually have it there (the original graph).

The "zero-level" of index charts here is 100, and that's centered in the middle of the graph. Value 0 has no meaning in these charts and displaying it would just be confusing.

I agree with Michael that displaying a metric calculated from these two time series could be a good option, although that hides the interesting info that absolute values of both variables are changing, rather than just one of them.

To avoid the baselining and tangibility issues, one could use a panel chart, where the two series occupy parallel panels in the chart. Each panel has its own scale, without normalizing, so the reader can see actual values, and in separate panels there's no way the lines will cross and lead to spurious conclusions.

I rarely used charts of indexed timeseries, but tried one recently after reading this post and learned that they can be treacherous! With all the media coverage of rising fuel prices, I got hold of some data for Sydney retail petrol prices and wholesale crude oil and gasoline prices. Rather than doing the sensible thing and converting to an equivalent unit ($ per L), I thought indexed timeseries would be a shortcut. I didn't think it through. The chart showed wholesale prices increasing much more than retail, suggesting that retail prices could increase further. Of course, since retail prices are wholesale + margin, without significant increases in the margin, the retail growth rate should only be a fraction of wholesale and the proper chart in common units showed no divergence. Next time I'll be more careful!

These things take advantage of the human psychological weakness for pareidolia, or "seeing the Virgin Mary in a tortilla". Or, a related weakness, which is our story brain, that makes a narrative out of random events, or privileges poor explanations that fit a story, over better explanations that diss the story.

Ironically, a technique which may be thought of as cheating-- finding the least-squares fit between the two curves and adapting the scales to use that-- actually reproduces the "approved" way of demonstrating correlation, which is to find the least squares straight line through a scatter plot of the data. I don't know what to think about that :-)

@derek: thanks for the excellent word, pareidolia, I'll have to remember that one. I must admit I am always highly suspicious of these kinds of shifyed/scaled charts although they are extremely popular in finance.