First, one can think of it as a pattern matching problem: which of 9 graphs contain a pattern that matches that in the map? This really isn't the point I was trying to make but I realize now that the question could have been interpreted this way. In this line of reasoning, one needs to identify the features that distinguish the pattern in the map. The most standout feature, for me, is the spike at location 69/70. Only the last two graphs contain spikes near this location and more careful inspection will reveal the bottom right chart to have the real data.

Alternately, one can ignore the context (of the sad tally) and treat this as a problem of comparing probability distributions. This was my original intent. Is there an "odd man out" among the 9 distributions?

We now know that the bottom right chart contains the real data and the other 8 charts plot random data. If the real data is the "odd man out", which features of the distribution allow us to differentiate it from the other graphs? I'll discuss my findings on some features in the next post.

John Shonder, a reader, alerted me to the following unusual chart which identifies the precise locations where people jumped from the Golden Gate Bridge:

He asks: is the choice of location "random"?

This is a very rich question and different statisticians will take different approaches. In this post, I take a purely visual, non-rigorous look at the question; and if I have time (and if other readers haven't commented already), I may discuss more rigorous methods in the future.

First, I restrict my attention to light poles 43 through 112, i.e. the bridge segment that lies above the water. Also, I only consider the north-south locations: in other words, 43 and 44 are counted as one, so are 111 and 112. Otherwise, the distribution is clearly biased (towards the water and the east side).

When we say "random", we usually mean there is equal chance that someone will jump from location 43/44 or from location 111/112 or any location in between. There are 35 locations and 755 documented suicides, averaging to 21.6 suicides per location. But 21.6 is the average which is not observable; assuming that the choice of location is random, we still would not find exactly 21-22 suicides at each location. (Similarly, even if there is a 50/50 chance of getting a head when we flip a coin, in any given run of 100 flips, it is very unlikely that we will see exactly 50 heads.)

So, at some locations we will see more than 21.6 deaths; at others, fewer. The question becomes whether the fluctuations are too much to refute the notion that the choice of location is random.

In the following set of graphs, I ran some simulations. Eight of the nine graphs represent scenarios under which I sent 755 people to the bridge and randomly assign them one of the 35 locations to jump from (okay, this is a thought experiment only; please don't do this at home). The x-axis represent locations; the y-axis represent the number of suicides at that location -- but on a standardized scale.

The standardized scale allows us to compare across graphs. The zero line represents the mean number of suicides per location. The number of suicides at most locations is within one standard deviation away from this mean (i.e. between -1 and 1 on the y-axis). In some extreme cases, the number of suicides is more than 3 standard deviations larger than the mean (i.e. greater than 3).

Back to randomness: well, one of the 9 graphs is the real data from the map above. If you can guess which of the 9 is real, then the real data is probably not random. If you can't, then the real data may be random!

I will publish the answer tomorrow. In the meantime, feel free to take a guess and/or comment on what other approach you'd take. One take-away from this exercise is that it's very hard to tell non-random from random unless it is very obvious.

The rise of robots elicited an uninspired, robotic graphical response from the Economist, reprinted by Mahalanobis.

A first fix, shown on the left below, puts the two data series in a scatter plot. If one accepts the existence of a linear relationship between 2004 installations and 2004 stock, one would be mistaken indeed as such a comparison is meaningless; for countries differ significantly in terms of the number of robots deployed (Japan has over 300,000 while many other countries have fewer than 1,000).

A second fix substitutes the 2004 growth rate for absolute number of installations. It is now clear that the growth rate is not much associated with the size of the installed base, contradicting the perceived linear relationship from before. (Note that the x-axis is plotted on a log scale.) The European countries are shown in red, most of whom have grown their stock of robots at a higher rate than Japan.

In order to highlight the Europe/Japan comparison, one can plot the European average, rather than individual countries. The message is less murky because the graph is less busy. The following set illustrates this. What really stands out from these graphs is China (& Taiwan), not Europe. Incidentally, China was omitted from the Economist chart, which is a rather mischievous deletion -- but is understandable since China's data is hidden when they used the original data series of installations versus stock (green text on the left chart).

Reference: Economist; United Nations Commission for Europe

I'm writing this from a different computer while I'm travelling and I'm having trouble with the tools at my disposal. Apologies for some glitches in the charts.

By way of Quantum of Wantum, here is another example of a tag cloud, courtesy of Harvard Law School. I suppose the font size is proportional to the number of posts related to a particular country. This graphic does a great job highlighting the most important categories; as the number of posts grow, the more powerful it gets.