December 30, 2010

How old is Y-chromosome Adam?

The presumed shallow time depth of the human Y-chromosome phylogeny is one of the main arguments of the recent Out-of-Africa theory. One of the major things I found while working on my Y-STR series is that point estimates from Y-STR variation are associated with huge confidence intervals, because of uncertainty about factors such as generation length, population history, mutation rates, even if the mutation model behaves "perfectly" in symmetrical stepwise fashion.

Trouble is, the deeper we go in time, the more uncertain we are about the behavior of our models. That is why I have generally avoided providing any age estimates for events prior to the Neolithic.

Nonetheless, it is interesting to see the state of the art in this area, because claims about the shallow time depth of the human Y-chromosome phylogeny are always flying around, but, if you follow the citation labyrinth, you will soon realize that the whole edifice is erected on sand.

Fortunately, I was recently reminded of a thoughtful post by Tim Janzen on the GENEALOGY-DNA-L from 2009 which is probably the "best thing" when it comes to Y-chromosome age estimation for deep clades of the phylogeny.

The most basal clade in the phylogeny is haplogroup A which is found in Africa. By comparing A chromosomes with those of the BT clade (everyone else), we can arrive at an estimate of Y-chromosome Adam. And, since BT clade contains much structure itself, we can compare A chromosomes with different subclades within BT, e.g., E or J or T.

This is essentially what Tim did: he compared a group of haplogroup A chromosomes with all the major clades of the BT group. Different age estimates produced by this method are not independent, because different haplogroups share more recent common ancestors: for example A vs I and A vs J both contain a common line of patrilineal descent (from the BT founder to the IJ founder). In any case, the different age estimates should all give approximately the same figure, as they are estimating the same quantity: if they do not, this is evidence about the inability of Y-STRs to provide good age estimates.

Tim went a step further, and he did his comparisons on different sets of markers: slow-evolving ones to fast-evolving ones. Again, age estimates with fast vs. slow-evolving markers should give similar age estimates. If they do not, then this means that an age estimate is a product not only of the true age of a lineage, but also of the particular mix of fast- and slow-evolving markers that one uses.

In short: age estimates by comparing haplogroup A with several other haplogroups and by using different sets of markers should be roughly similar. But, that is hardly what happened.

Below is Tim's table of age estimates in years. I have added an extra row and extra column: this contains the standard deviation of each column/row divided by the average (in %), and is useful to quantify how varied the age estimates are across different BT haplogroups and across different marker sets.

The standard deviation of the age estimates across haplogroups is reasonably small, but large enough to render any archaeological correlations useless. The real trouble is in the standard deviation of the age estimates across marker sets: they are higher than 100%!

What this means is that age estimates are largely a function of whether one uses slow- or fast- mutating markers.

Age estimates vary overall between 6,530 years and 535,755! It is obvious that fast/medium mutating markers provide unbelievably small age estimates (most of them are less than 20 thousand years). However, if we limit the analysis to slow mutating markers, most age estimates are in excess of 300,000 years!

In short, you can arrive at any age estimate you want, by choosing a particular mix of slow and fast mutating markers.

It could be argued that using all markers (50 markers column) would provide a better estimate, and, indeed, that estimate is in the order of 40-80ky, which is close to what is usually reported for human Y-chromosomes.

But that is equivalent to having a number of different clocks, some of which tell you that 3 seconds have transpired, and some which tell you that it's been a whole minute. The rational thing to do is not to take an average, but to throw the clocks in the garbage, or figure out what's wrong with them.

Conclusion

At present I am aware of no research that quantifies the depth of the human Y-chromosome phylogeny with anything bearing a semblance of accuracy. The 1000 genomes project has the potential to do this using using relatively well-behaved point mutations rather than Y-STRs, but, in the initial publication no actual age estimates were given, and the samples used to produce Supplementary Figure 7 lacked the most basal part of the tree (both clade A and the next most basal clade B).

UPDATE (Jan 2, 2011):

In a post in GENEALOGY-DNA-L, I show that by using slow- vs. fast-evolving markers using the Ballantyne et al. mutation rates and the tested haplogroup A and haplogroup C 67-marker haplotypes from the respective FTDNA projects, you can arrive at age estimates between 10-219ky.

This has confirmed to my mind that Tim Janzen's numbers about the dependence of age estimates on marker mutation rates are basically correct, and that age estimates about Y-chromosome Adam using Y-STRs are basically useless.

Let's hope that the 1000 Genomes Project will produce the data in the coming year that will allow us to make a better estimate, in terms of number of SNPs between A and non-A chromosomes presented as e.g., (i) a fraction of number of SNPs between human and chimpanzee, or (ii) by dividing with father-son Y-SNP mutation rates; the latter is already estimated but should become better fixed by looking at the father-son pairs included in 1000 genomes project

The recent OoA theory is based on different lines of evidence. This does not significantly impact it, as I am not advocating either a "young" or "old" age for the Y-chromosome MRCA, I am simply pointing out that people who discuss the theory with the certainty of a recent MRCA should let go of that certainty until there is better evidence.

As I posted some times ago on rootsweb list I don't believe A haplogroup is at the root . Even in this case the STR might not be suited for "long distance" estimates. The time for SNP comparisons will come soon and may solve the point.

I would say that the majority of age estimates for individual haplogroups like R1b, R1a, J1, J2, G2..are based more on prejudice and peoples idea of the origins of certain human groups than anything based on logical and unemotional data and thinking.

I have been balking at age estimates for the Y chromosomes every since R1b was linked with Paleolithic Europeans and Cro Magnons. What was the proof? The high frequency of R1b in the region of Europe were Paleolithic humans were known to live, some beautiful cave paintings, and the attractiveness of the Cro Magnon skull who despite his disease which would have disfigured him, looked like a European is thought to look like. Also I could see that the R haplogroup basal point is a long way from its ancestor haplogroup F than many others that are aged younger. That means that the R group is not older than it was claimed or that the R group is highly unstable and mutagenic.

The problem of the separation of haplogroup I from haplogroup IJ or haplogroup J has never been satisfactorily explained regarding there current regional spread nor the age estimates of each and why do they differ.

I do rough age estimates based on 67 STR markers. I use an average which flattens out the fast and slow STR twitchers. The studies that use 6, 9, 15, 21 STR markers are,frankly a waste of time, resources and paper. I totally ignore most of them except the data as to the frequency and spread of haplogroups. In any case how can a group of 5 to 100 men adequately show the variance, frequency and types of haplogroups in populations covering millions.

The data look to me, at first glance, like a question of calibration, rather than accuracy.

If slow mutating and fast mutating markers give rank orders that fit the known phylogenetic order of appearance in all or most cases, then the mutation rates are measuring something real, but one or more of the marker calibrations is out of order. Eyeballing the data, there seem to be a few cases where the mutation rate ages are not well ordered relative to the known sequence, but most of the data subsets do seem to be relatively well ordered and even roughly proportional to each other.

The naiive solution would be to toss out the markers that are not producing a reasonable approximation of the proper rank order where known for phylogeny, and then to recalibrate each subset of markers for an estimated Out of Africa date and so as to not violate data from ancient DNA (e.g. IIRC there is a Y-DNA F aDNA sample from ca. 30,000 years ago), and based on other archaeological limits (for example, Q can't be younger than the arrival of modern humans in the Americas).

This won't allow any more accurate inferrence of the dates that were used to calibrate the mutation rates, of course, but it would add considerable confidence to the dates for which we don't have good archaeological corollaries, which in turn might help discriminate better based upon genetic data for different hypothetical possible scenarios in pre-history.

There was no need to think again the Adam’s age for understanding that R1b1b2 might be Palaeolithic. I am saying from many years that the calculations made by everybody except me are wrong, because they don’t take in consideration the mutations around the modal. See a post of mine on Worldfamilies, unfortunately Dienekes didn’t let pass here: there are “three dozen” SNPs between R1b1b and R1b1b2 and they can have been cumulated only in a long period of time and in isolation: the LGM better than the Younger Dryas as I was thinking.

"I think better age estimates would come with known settlements, like Papua New Guinea, Australia, and the Americas."

You must be joking, Eurologist. One of the reasons why the pop genetics science is having all their theories falsified is because it relies on putatively "known facts." The only peopling of the Americas that we know the timing of is the European colonization of 1492. There's no archaeological evidence to support the claim that Amerindians came to the Americas with the Clovis technology. The extent of linguistic diversity supports a much earlier entry. Instead of jumping on the bandwagon of human origins genetics should painstakingly collect more and more samples from around the world (just like linguists and anthropologists have been doing for the past 200 years), measure mutation rate in known (sic!) lineages and align their patterns of variation with known (sic!) linguistic classifications and very recent (within the last 5,000 years) events.

The data look to me, at first glance, like a question of calibration, rather than accuracy.

If slow mutating and fast mutating markers give rank orders that fit the known phylogenetic order of appearance in all or most cases, then the mutation rates are measuring something real, but one or more of the marker calibrations is out of order. Eyeballing the data, there seem to be a few cases where the mutation rate ages are not well ordered relative to the known sequence, but most of the data subsets do seem to be relatively well ordered and even roughly proportional to each other.

Well, I calculated the correlation coefficient between the 8 differentestimate columns

If markers of different mutability measured the same quantity butdiffered in scale, we would expect most of these to be high and closeto 1. The way I see it, correlations are weak and positive, so whileusing the different marker sets captures the same underlying variable(age), there is plenty of noise besides.

"Well, I calculated the correlation coefficient between the 8 different estimate columns"

I appreciate your effort and it is very effective at making your point about the noise to signal ratio here. The V1, V3 and V4 cluster do seem to be closely aligned, and the V7 cluster seems wildly out of synch, but the V2, V5, V6 and V8 groups do seem to have a lot of noise, not only with the V1, V3 and V4 group but with each other as well.

This topic was well-discussed on Rootsweb in 2009. Then, I suggested and still believe that the fastest markers make poor clocks over longer periods of time due to mutational saturation. In other words, variance for these high mutation rate markers does not accumulate linearly with time after a certain point. This phenomenon is well-known with regard to mtDNA (e.g. Bandelt et al.) but not well known with regard to Y-STRs. You can easily see this by comparing slow vs fast markers over a range of TMRCA estimates. The relationship between the slow marker estimate and the fast marker estimate varies depending on TMRCA, being more similar for recent TMRCAs than for ancient ones.

As for the fast marker sets, Ken Nordvedt suggested, and it makes sense, that pedigree studies may tend to underestimate the mutation rate of slow markers because these studies are often not large enough. Since the rates used by Janzen are derived from these studies, an underestimation of mutation rate would lead to an overestimation of TMRCA.

Both hypotheses are easily testable, so the problem with TMRCA estimation using Y-STRs is not intractable. It's just a shame that no one has the resources and the motivation to work it out and publish it.

As for the fast marker sets, Ken Nordvedt suggested, and it makes sense, that pedigree studies may tend to underestimate the mutation rate of slow markers because these studies are often not large enough. Since the rates used by Janzen are derived from these studies, an underestimation of mutation rate would lead to an overestimation of TMRCA.

I don't see why the mutation rate would either be under- or over- estimated for slow-mutating markers. If no mutations are observed even for that many meioses, then the credible interval excludes some mutation rates (above a certain level), but the mutation rate could also be much lower than the point estimate.

For starters, Janzen in 20009 didn't have access to Ballantyne's rates (his study is, indeed, a good one) but rather relied on older less well-designed studies.

Moreover, the slowest 10 markers used by FTDNA (and, ergo, Janzen) are estimated to average just one mutation for every 8500 meiotic transfers (the 10 slowest have an average rate of less than 0.00012). Even the Ballantyne study was probably an order of magnitude too small to accurately measure those rates.

The idea that mutation rates may be underestimated for slow markers flows from the way that researchers estimate the rate in cases where they observe no mutations in their sample. In the past, many studies have estimated the maximum likelihood estimate, which is lower than the expectancy estimate that you really want.

It was Ken's argument, originally, so you should probably ask him to explain it more fully. He surely can do so much more comprehensively than I can. It's been discussed on Rootsweb before at some length, but I'm having trouble finding the archived posts.

Could you point to the place on the rootsweb mailing list where you argue to this effect. The only statements of yours I could find challenges Sasson's Balanced Tree hypothesis an says "SNPs like M168 leave no way for any balanced tree hypothesis. The SNP tree proves that A and B haplogroups are earlier than all other groups; period. Didier."

But there's an interesting note from Sasson, which may be relevant: "M168 is a "polymorphism". Some people have C, others have T. There are two "forms". SNP does not say which of the two forms is ancestral, and which is derived. But there are few SNPs (like M168, M294, and P9.1) where science simply does not know the direction of mutation.

For most SNPs, the direction of the mutation is clear. For example: M70 is derived in Haplogroup T, ancestral everywhere else.

A particular direction of M168 mutation is traditionally assumed, weakly motivated by comparison with chimps. It reflects the European's perception of themselves as "a more evolved"race, relatively to African "bushmen" and "pygmies"."

If the M168 and M294 states found in hg A and B are derived vis-a-vis the one found on all other human haplogroups, then this suggests that Africa was colonized, and the proximity of non-Africans to archaic hominids (Neanderthals and Denisovans) becomes that of common ancestry and not admixture.

Old Blog Archive

Dienekes' Anthropology blog is dedicated to human population genetics, physical anthropology, archaeology, and history.

You are free to reuse any of the materials of this blog for non-commercial purposes, as long as you attribute them to Dienekes Pontikos and provide a link to either the individual blog entry or to Dienekes Anthropology Blog.

Feel free to send e-mail to Dienekes Pontikos, or follow @dienekesp on Twitter.