Sunday, June 22, 2014

So what's the deal with the Birthday Paradox and the Word Cup Football?

As a data scientist I'm always happy when a newspaper spends time in explaining something from the field of Statistics. The Guardian is one of those newspapers that does a very good job at that. @alexbellos often contributes to the Guardian and I must say I often like the stuff he writes. Just recently he wrote a piece entitled "World Cup birthday paradox: footballers born on the same day", which was taken over by the Belgian quality newspaper De Standaard. The headline there was "Verbazend veel WK-voetballers zijn samen jarig", which roughly translates to "Surprisingly many Word Cup players share birthdays". Notice already that the headline in De Standaard is less subtle than the one in The Guardian.

Alex Bellos starts with explaning what the birthday paradox is:

The birthday paradox is the surprising mathematical result that you only need 23 people in order for it to be more likely than not that two of them share the same birthday.

He then refers to the internet for explanations of why this is in fact the case (see, for instance, here). He then, rightfully, remarks that the world cup football offers an interesting dataset to verify the birthday paradox. Indeed, the 32 nations that participate have 23 players each. We would therefore expect to see about half of the teams to have shared birthdays. It turns out that 19 of the teams have shared birthdays. So far so good.

The problem I have with the article is in the subsequent part. But before we come to that, let's have a look at the summary in the begining of the article:

An analysis of the birth dates of all 736 footballers at the World Cup reveals that a surprisingly large number of teammates share the same birthday, and that seven were born on Valentines' Day

The observation about Valentine's day is an interesting one because it plays on the same distinction between the "a same day" and "the same birthday" that makes the birthday paradox surprising for some. From that perspective it would have been interesting to mention what the probability is that in a group of 736 we would see 7 or more people that share the same birthday. In defence of the author, I must admit that it is surprisingly hard to find references to this extension of the birthday problem (but see here, here and here). I understand a closed solution for triplets was published by Anirban DasGupta in Journal of Statistical Planning and Inference in 2005. On the web I only found one solution for the general problem, but I could only get it to work for the trivial case of 2 and the more complicated case of 3. But for 7 it gave very strange results. So either the formula was wrong, or, more likely, my implementation of the formula was wrong. I then used the poor man's mathematics, i.e. the simulation.

In a first simulation I randomly selected 736 birthdays from a uniform distribution. I then counted how many players I found that didn't share a birthday with any of the other players, and how many pairs of players shared a birthday, how many triplets, and so on. This is a barplot of the results I got:

As you can see, 7 was present as well. Granted, it was not Valentine's day, but nonetheless it is a birthday shared by 7 players. Notice, by the way, that there are far more players that share a birthday with one other player than those that don't share a birthday (2 times about 110 versus about 100).

I then repeated that process 10,000 times and each time verified whether there were birthdays that were present 7 or more times. This allowed me to estimate the probability that in a selection of 736 players one (or more) birthdays is shared by 7 or more players to around 83%. It is therefore not remarkable at all that in the Worldcup in Brazil we've found such a birthday as well.

The second issue I have with this article is the part where the question was asked why we observed 59.4% (19 out of 32) instead of the expected 50.7% (the theoretical probability for a group of 23). Although the author suggests the possibility that this is because of chance, he doubts it and instead offers an alternative based on the observation that footballplayers are more likely to have their birthdays in the beginning of the year than at the end of the year. The reason for this skewed distribution has to do with the school cut-off date (very often the first of January), height of the children in school and dominance in sports.

I don't question this theory, it's not my area of expertise. Furthermore, I believe that the skewed distribution amongst sportsmen has been observed before. What suprises me, though, is that an article in which the birthday paradox plays an important role, does not use probability theory and statistics more to put these observations in perspective. In this case the natural question to ask is: if, in a team of 23 players, the probability of having a shared birthday is 0.507 and we have 32 teams what is the probability to find 19 or more teams with a shared birthday. This can easily be calculated with the binomial distriubution and results in 0.21, again not unlikely at all. That said Alex Bellos does not exclude that it's all by chance, he simply doubts it, which is fair.

As said earlier, I don't question the theory of the skewed distribution for sportsmen, so I will not calculate what the probability is to observe the worldcup specific distribution under the hypothesis of a uniform distribution. But I do think that the author should also have looked at what the probabality is of having players with shared birthdays under a "footballer"-specific distribution rather than the uniform distribution. I don't have such a distribution or a more general "sportsman"-specific distribution available (although I'm sure it must exist, because the skewed distribution of birthdays of sportsmen is well documented), so here I will simply use those that Alex mentioned in his artcicle, i.e.January 72, February 79, March 64, April 63, May 73, June 61, July 54, August 57, September 65, October 52, November 46, and December 47. I simply transformed those to daily probabilities and then assumed they are generaly valid for the population of "Worldcup attending football players". The plot below shows the two distributions considered.

Furthermore, if we can't rely on the uniform distribution, the calculations for the birthday paradox become complex (at least to me), so I again resort to simulations.

After 10,000 replications, the result of the simulation is 0.518, which means that under the skewed footballer distribution we would expect to see shared birthdays in 51.8% of the teams of 23 players. This is only 1.1 percentage points higher than in the uniform distribution case. If you don't accept 19 out of 36 (i.e. 59.4%) because that's too far from 50.7%, it's hard to see why you would find 51.8% so much more convincing. In other words, the birthday paradox is not such a good measure for indicating whether football players really have a different (skewed) birthday pattern compared to the rest of the population. It would have been clearer if the two topics were separated:

Do football players, like other sportsmen, have a different birthday pattern than the rest of the population?

The worldcup is an excellent opportunity to illustrate the Birthday paradox.

As an interesting side note, in the mean time it turns out that the data Alex used was not completely correct and with the new data the number of teams with shared birthdays has become 16. This is exactly the number we would expect under the uniform distribution. Notice though that under the skewed distribution and using the usual conventions of rounding, we would expect to see 17 teams teams with shared birthdays instead of 16. So, using their own reasoning, the headline in the De Standaard Newspaper now should change to: "Suprisingly few Wordcup players share a birthday". Unless, of course, you follow the reasoning using the binomial distribution mentioned above and conclude that with 32 replications this is likely to be coincidental.

Order a professional Sparkling White SmilesCustom Teeth Whitening System online and get BIG SAVINGS!* Up to 10 shades whiter in days!* Professional Results Are Guaranteed.* As good as your dentist.* Same as dentists use.

About Me

Istvan Hajnal is a veteran of more than 20 years in the fields of data analysis, survey methodology and market research. First at the University of Leuven, Belgium and then
about 10 years with The Nielsen Company, the world's largest Market Research Company. Istvan is currently Insights Director, Marketing & Data Sciences for GfK, Belgium.
He received a master's degree in computer science (Leuven), a master's degree in quantitative applications in the social sciences (Brussels) and finally a Phd in Social sciences from the University of Leuven.
He blogs about Data Science but occasionally also on management and leadership in general and the Market Research Industry in particular.