Estimated dating of Y-chromosome events

(by James Dow Allen)

Information in the following chart is taken from

A paper by Francalacci et al on Sardinian Y-chromosomes which appeared
August 2013 in Science.

A paper by Raghavan, Skoglund et al on a 24,000-year old Siberian genome
(found near Lake Baikal with Venus figurines) which appeared November 2013 in Nature.
(The supplement is
available
on-line. Here is the relevant chart.)

The Francalacci phylogeny tree seems to be one of the widest-coverage
Y-chromosome trees shown on-line.
Assuming that the rate of SNP mutations is almost constant and has been estimated,
rough date estimates of clading splits can be read directly from his SNP chart.
I've done that here, with my own arbitrary estimate of mutation rate imposed.

The chart is a modified version of Figure 1 of the Francalacci paper.
I have made only the following modifications to that chart:

Added my own rough date estimates at the far left.

Incorporated information from Figure SI-5a of the Raghavan-Skoglund paper
in brown at the left of the chart. "M*" denotes the ancient Siberian boy.
The brown numbers in parentheses are, more-or-less, the SNP counts ("d"+"a")
shown in that Figure.
I think that "d"+"a"+"N" numbers might be more appropriate, but don't
fully understand the issues.

The SNP counts shown in a chart like this will vary from study to study (especially when
ancient skeletons are involved) due to genome reading difficulties.
One might expect, I think, that SNP counts be proportional.
Here they are not: SNP distance from R-L to R-Q is 119 from Francalacci
but 39 (1/3 as much) from Raghavan.
But R-Q to R1a-R1b distances are 43+38 from Francalacci and 19+5+18 (1/2 as much)
from Raghavan.
Is this within normal statistical variation? Or is my interpretation flawed?

The SNP distance from an R* to the Siberian boy is shown as "35 private non-N"
in the Raghavan Figure. Am I correct to compare this count with the "a"+"d" counts
shown for modern genomes in the Figure?

The date estimates I provide are not justified by any specific evidence
and should be treated as "wild guesses" just to start discussion.
However, I think they may be about right. The sudden emergence of haplogroups
C, F, D and E from BT appears just before the alleged Toba supereruption.
The rapid division of F into G,H,...,T coincides with a plausible population
expansion near West or South Asia.
Furthermore, this estimate places the Western Europe split-up of R1b-L11
near 3000 BC, about the date of Corded Ware intrusion into Western Europe
and/or the date of Bell Beaker expansion.

Francalacci calibrates the mutation rate of his genome subset as 205 years
per SNP. I have used 165 years per SNP for the years I show.

Click here to view the same chart, but with the
time axis showing 140 years per Francalacci SNP, instead of 165 years.
Now the BT breakup occurs after the Toba superuption, perhaps more
plausible.

Although Francalacci considers an ancient skeleton (Ötzi the Iceman),
coverage of that DNA was inadequate to place it accurately, so he calibrates
his mutation rate by assuming the sudden fanout of I2a1a1 in Sardinia coincides
with the sudden expansion of farming on that island. I think it's just as likely
that the fanout was associated with a later "secondary founder" event.
(As his own paper may admit the 205 year estimate represents more of an upper bound
than an expected value for the Ötzi data.)

Ignoring statistical variation, the mutation rate can be read directly from
the Raghavan chart. For example, 35 "private non-N" SNPs separate the Siberian
boy from a defined node; 94 "d"+"a" SNPs separate a modern R2 speciman from the
same node. The difference of 59 (94-35) divided by 24000 years gives 407 years
per Raghavan SNP.
If we use R-L to Modern R2 to calibrate the two SNP rates; Francalacci 119+43+209 (=371)
compared with Raghavan 39+24+89 (=152), and divide 407 years by (371/152) we conclude
that the rate (normalized to Francalacci's subset) implied by Raghavan data is
167 years per SNP.
Obviously this calculation should be averaged over various paths,
is fraught with statistical uncertainty and should
be done by someone who understands Raghavan data better (perhaps using "d"+"a"+"N"
counts instead of "d"+"a").

The path-lengths to present are consistent enough in the Raghavan data to suggest
that its sampling size is ample enough to draw conclusions.
Proceding in a different way, the distance from F-breakup to Siberian boy and to present
are 98 and 155 SNPs respectively, so the F-breakup should be (24000 * 155/(155-98)) years ago,
or 65000 BP.

Thus, despite caveats, I think that, interpreted with more understanding than I have,
the Raghavan data should be adequate to form a good ballpark
estimate of mutation rate, and to estimate the date of the R1-R2 split.

As far as I can tell, Y-chromosome SNP rate is still controversial, with estimates
typically 100 to 200 years, when normalized to Francalacci's subset.
(In round figures, I think Francalacci's subset is about 17% of the NRY-chromosome,
so his 205 years per SNP becomes 35 years per SNP for the entire NRY.)