In 1873, Sir Francis Galton, a statistician, was working on the problem of how to predict surname extinction and submitted problem 4001 to the Educational Times:

Francis Galton. Image: public domain.

” A large nation of whom we will only concern ourselves with the adult males, $N$ in number, and who each bear separate surnames, colonise a district. Their law of population is such that, in each generation, $a_0$ percent of the adult males have no male children who reach adult life; $a_1$ have one such male child; $a_2$ have two; and so on, up to $a_5$ who have five. Find (1) what proportion of the surnames will have become extinct after $r$ generations; and (2) how many instances there will be of the same surname being held by $m$ persons.”

He noted that a general solution was preferable as ‘he finds it a laborious matter to work it out numerically’. The Reverend Henry Watson replied to him in the next issue of the Educational Times with a solution and together in 1875 they published their paper ‘On the extinction of families.’

Henry Watson.Image: public domain.

So what was Watson’s solution? Well, it formed the basis for the Galton-Watson process, a stochastic branching process that is used to model the transferal of Y chromosome genes and, of course, surnames.

The Galton-Watson Process

Firstly note we are assuming a purely patriarchal society so the surname can only be transferred though the male line (although the model is the same for matriarchal societies) so as a by-product female children are effectively ignored. Galton’s suggested set-up assumes that the generations are effectively discrete and separate from one another and the maximum number of male surviving children each male may have is 5.

Firstly, let’s generalise this a bit more. Let’s say there is some number of sons, say $q$, where it is highly unlikely that any man will have more than $q$ sons who survive to adulthood. So we may say that $a_{q+1}$ is so small it is effectively zero and we don’t need to consider the possibility of there being more than $q$ surviving sons.

Now let’s address the question of distinct generations. Clearly the generations must overlap. It’s perfectly possible for a father to have a son and grandson in the same generation, so how do we tackle this? We may define $$t_i=\frac{a_i}{100}$$ to be the chance of any individual man having $i$ surviving sons in any generation. Because we are effectively saying a man can’t have more than $q$ children, we must have that $$\sum_{i = 0}^{i = q} t_i = 1.$$

So if there are $p$ men with the surname Skywalker, consider the following polynomial: $$(t_0+t_1 x+t_2 x^2+….+t_q x^q)^p$$

If $p=1$ then clearly the chance of there being $n$ Skywalkers in the next generation is $t_n$. So for some general $p$ the chance of there being $n$ Skywalkers will be the coefficient of $x^n$ in the polynomial above. For neatness, let $$T=t_0+t_1 x+t_2 x^2+…+t_q x^q.$$

But we’re not just interested in the number of individuals with a particular surname, we also want to know the likelihood of a surname dying out. Define $m_{rs}$ to be the fraction of $N$ surnames that have $s$ representatives in the $r$th generation. So the number of surnames with $s$ representatives in generation $r$ is $m_{rs}N$. Hence the number of surnames with $n$ representatives in generation $r$ will be the coefficient of $x^n$ in the polynomial $$(m_{r0}+m_{r1}T+m_{r2}T^2+…+m_{rq}T^q)N.$$

We may define a series of functions $$ f_r(x)=f_{r-1}(t_0+t_1 x+t_2 x^2+…+t_q x^q) $$

with $$ f_1(x)=t_0+t_1 x+t_2 x^2+…+t_q x^q. $$

So the fraction of surnames with $s$ representatives in the $r$th generation is the coefficient of $x^s$ in $f_r(x)$. The total number can be found by multiplying this number by $N$.

So $N f_r(0)$ surnames become extinct in the $r$th generation and there will be $ \frac{N}{s! } $$ \frac{d}{dx} f_r(0) $ surnames with $s$ representatives in the $r$th generation.

Using the Galton-Watson Process

Let’s consider a couple of examples. Consider the situation where we have a population with $10$ surnames and let’s take $q=5$ as in Galton’s original problem. Firstly if we consider the case where a man is equally likely to have anywhere between no, and $5$ surviving sons so $$t_0=t_1=t_2=t_3=t_4=t_5=\frac{1}{6}.$$

We may use the Galton-Watson process to set up an iterative procedure. Let $$f(x)=\frac{1}{6}\sum_{i=0}{5} x^i$$ so the first generation will have $10-10f(0)$ surnames then the second generation will have $10-10f(f(0))$ and so on. It is fairly simple to use Mathematica to find the number of surnames left after $10$ generations.

Galton-Watson simulation with N=10, q=5 and equal likelihood of number of surviving sons.

We see a leveling off after the third generation with roughly 2 out of 10 surnames lost. But what if the likelihood of the number of sons is dependent on the number of sons? Consider the case $$t_i=\frac{1}{5+i}$$ with $$t_0=1-t_1-t_2-t_3-t_4-t_5.$$ We see

This time we see roughly five surnames are lost with again a plateau forming after the fifth generation.

Now, let’s look at a more realistic simulation. Don’t forget that Galton was a statistician, so we’ll try a probability-based approach. Let’s suppose that the number of surviving sons is distributed using a Poisson Distribution (chosen as it conveniently doesn’t cover negative values) with an a mean of 1 (ie the average number of surviving sons is 1). So $$t_i=P(x=i).$$ In this case we obtain the following.

Galton-Watson simulation with N=10, q=5 and the likelihood of surviving sons dependent on the Poisson Distribution.

This time we see less of a plateau occurring and about eight surnames having become extinct by the $10$th generation.

But let’s face it, so far we’ve just been guessing numbers and trying them. To use the Galton-Watson process we really need to get the statistics on how many sons a man would have in each generation, and then fit an appropriate distribution.

But does Surname Extinction Actually Happen?

So in our simulations we saw that some surnames became extinct after $10$ generations depending on how we quantify the likelihood of surviving sons. But how do we know it actually works?