October 18, 2008

Why Y-STR haplotype clusters are not clades

A Y-chromosome clade is the set of Y-chromosomes descended from a single Y-chromosome (the founder). In human terms, it consists of all the patrilineal descendants of a single man.

Clades are usually defined in terms of unique event polymorphisms (UEPs). Such polymorphisms occur rarely enough to be useful for cladistic analysis and determination of the human Y-chromosome phylogeny. A clade defined on the basis of UEPs is a haplogroup.

There is a misconception among some people that haplotypes, i.e. the alleles at several Y-STR loci can also define a clade. This is, however, impossible, for at least three reasons.

First, those who erroneously define clades based on Y-STR haplotypes do so by means of identification of a cluster of similar haplotypes.

But, this isn't enough. Suppose you identify a cluster of haplotypes, and every pair of them has a genetic distance of at most 3. First, it must be shown that the genetic distance between any haplotype in the cluster and any other haplotype (not in the cluster), must be greater than 3. Suppose you have identified a cluster of haplotypes {a, b, c} and dist(a, b)=3. Now, suppose that there is another haplotype d and dist(a, d) = 3. You are not justified to exclude d from the proposed "clade", since it may share a common ancestor with a that is more recent than the common ancestor of a and b.

Moreover, since age estimates are associated with verywide confidence intervals, it is not guaranteed that greater genetic distance implies an older MRCA. To ensure that a group of Y-chromosomes are part of a clade, you must ensure that other Y-chromosomes have an even greater genetic distance than 3, so great indeed, that it is extremely unlikely that they are closely related to any Y-chromosomes in the haplotype cluster.

Needless to say, none of the folks who propose various "clades" on the basis of Y-STR haplotypes have bothered to prove that their haplotype clusters share a common ancestor that is more recent than that between cluster members and non-cluster members.

Second, suppose that you have identified a very distinctive haplotype cluster that addresses the first concern. Suppose that every pair of haplotypes within this cluster is within a short genetic distance (e.g., 3) and very far from any other haplotype (e.g., more than 15). Is this sufficent to define a clade?

It is not, since you are not certain that you have sampled the relevant Y-chromosomes, i.e., those that bridge the gap between your cluster and other Y-chromosomes, revealing them to be part of a continuum, rather than distinct members of a particular clade.

There are several cases in which supposed clades were defined, e.g., if a marker has a value of 12 or 14 but no intermediate (13) values, only to be invalidated later on when chromosomes with intermediate values popped up.

So, while the first concern identifies the need for clusters to be tight and distinct, the second concern identifies the problem that tight and distinct clusters may be spurious due to incomplete sampling of the genetic continuum.

Third, suppose that you have identified a tight and distinct cluster, and that moreover you have extremely large and comprehensive samples that give you a strong degree of confidence in your cluster. Have you now identified a true clade of the Y-chromosome phylogeny?

The answer is still no, and the reason is the time symmetry of the mutation model of Y-STR loci. Consider the following Y-chromosome tree.

Nodes with capital letters are at most g=4 generations away from the clade founder. It is perhaps possible to devise a test that would be able to detect all these haplotypes as related. But, any test that would identify these haplotypes as descendants of the "founder" node, despite 4 generations of mutations, would also erroneously identify all the smallcase nodes, also at most 4 generations away from the "founder" as members of the clade.

A haplotype cluster centered on a presumed founder who lived g generations ago will invariably include a set of Y chromosomes that do not form a clade.

Whereas a clade includes all the descendants of a single founder, a haplotype cluster will invariably include many men who are g generations away from the founder, whether they are his descendants or not.

Why are Y-STRs qualitatively different from UEPs? While a UEP at the founder defines a watershed moment, separating the founder's descendants (who possess the UEP derived state) from his other relatives (who do not), Y-STRs do not define such a moment: node "m", a cousin of the founder, will possess a haplotype that is 4-generations removed from the "founder", just as node "Q" who is a great great grandson. By looking at haplotypes it is impossible to distinguish between the two.

There is a practical reason why the distinction between haplotype clusters and clades is important, and this has to do with ancient DNA.

Suppose that a very old archaeological sample (of age A years) is Y-STR tested and reveals an R1b-like haplotype. Can we make the inference that this was a member of the R1b clade? No, since many (non-descendant) patrilineal relatives of the R1b founder would have similar haplotypes.

Are we justified in claiming that the founder of haplogroup R1b was earlier than A years? The answer is again no, as haplotypes similar to current R1b ones existed before R1b was founded.

How is this compatible with the known fact that haplogroups can be predicted from sufficiently long Y-STR haplotypes?

First, such predictions don't rely only on the Y-STR haplotypes, but also on large number of haplotypes with known UEP results. Haplogroup prediction relies on UEPs and can't be made independent of UEPs.

Second, such predictions don't rely only on the Y-STR haplotypes, but also on the knowledge that they are present-day haplotypes (last row in the figure). Today, only the descendants of the clade founder survive in the haplotype cluster, but this is not necessarily the truth for earlier times.

Conclusion

Clades cannot be defined based on Y-STR haplotype clusters for several reasons, both practical and theoretical.

On the practical side, it is extremely difficult to define a clade using Y-STRs because haplotype clusters must be shown to be distinctive (clearly separated from other Y-chromosomes) and genuine (separated because of common descent, and not incomplete sampling).

But, even if a clear-cut genuine haplotype cluster is detected, it does not constitute a clade, since the time symmetry of Y-STR mutations necessitates that it will include (erroneously) non-descendant relatives of the founder.

There is nothing wrong with exploratory analysis of haplotype clusters, if one keeps in mind that such clusters are not and should not be thought of as clades of the Y-chromosome phylogeny.

33 comments:

Good discussion! This is a big problem in this field. 1. Its a new field populated by a wide range of persons with different technical and non-technical backgrounds. 2. People use many different terms to describe what they are analyzing, as you have shown. We need to create a better structure. The nomenclature to describe R, especially the many SNP's is burdensome. Look how SNP's are named.

One point that crossed my mind is that you didn't use the term sub-clade? How do we identify all the groups down the line from M269 in the R Clade? They are something which has to be defined since they have subsequent UEP's.

In defense of some of what that has been going on is the fact that the Y STR's, in general, have been more available than SNP's. In fact, as you know, that has been a big problem. Take Clan Gregor with which I am familiar. The STR uniqueness of a MacGregor is a 10 at 385a. Of course, all haplotypes that have a 10 at 385a are not MacGregors. Until s116+ came along we had no screen and even that is very coarse. It is more a job of pattern recognition, do the preponderance of STR values fit the mold. One positive is that its only been a little over 700 years since the founder was born. This whole sorting job will be much simpler when all the SNP's are known

"Clade" is a term which is used in the branch of Systematics called Cladistics.Cladistic systematics requires that for each character there be a clear ancestral and clear derived state. SNP markers have clear derived and clear ancestral states. On the other hand STR markers do not have clear ancestral/derived states, so therefore Cladistics cannot be used for constructing tree diagrams from STR data.Within the context of genetic genealogy, I use "clade" when referring to biological groups that are defined by UEP's (ie. any of the types of mutation that are accepted by the YCC, eg. SNP's). For biological groupings that are defined by STR's (for which clear ancestral/derived states cannot be determined) I use the term "cluster".Clusters of haplotypes sometimes do correlate to true biological groups (but not always) - so you can never be 100% certain that an STR cluster does represent a true clade. Likewise, you can never assume that the way that haplotype clusters link together correlates to the true underlying Phylogenetic relationship. In a way, clusters represent a hypothesis that the grouping is true/real - but we need SNP's (or other binomial markers) to confirm those hypotheses.

But, even if a clear-cut genuine haplotype cluster is detected, it does not constitute a clade, since the time symmetry of Y-STR mutations necessitates that it will include (erroneously) non-descendant relatives of the founder.

Can you clarify this, Dienekes? I really don't get it: if there are other relatives of the "founder" that share the same haplotype, this would probably be because your "founder" is not the real founder but a descendant of the true one, right?

Of course there could have been several parallel mutations but that would not just happen within related lines but can perfectly happen within unrelated lines as well.

Anyhow, can't this problem of parallel mutations be minimized by testing for a greater number of STR markers? The largest the array of possible mutating sites, the less likely that they are just a coincidence, right? So if, instead of testing for, say, five markers, we test for 25, we should get a much better picture of the likely genealogy, am I wrong?

After all it is always about minimizing the likehood of parallel mutations. This is very reasonably minimized with SNPs but to assure some consistency when studying STRs (inside haplogroups, I presume - as this is a lot safer) testing for many variable markers should yield results that are much more likely to exclude parallel mutations. The problem then would be minimized, if not totally supressed.

Maju: Dienekes is technically correct. All we have to sort people into piles is SNP's. To date there has been a relatively sparse number and people have been erroneously put in "clusters". As you say the more STR markers looked at, the lesser the chance. I think the mutation rate of the founders mutation is important since slow mutators are less likely to have many "random" occurrences. TMRCA analysis can also help by showing that the number of mutations the questionable entry has fits in with the core group?? In the Ian Cam, it ranges from zero in 700 years to 6 or 7 over 67. They all fit within the TMRCA timeframe of the founder. This is why I say that you can't look at one individual and make that decision, it is the combined mutations of the whole group that tells the truth. Names, certainly don't work. In the Ian Cam we have Stirlings who, many of which, are genetically MacGregors. The history of the clan is that the name was proscribed by the British for almost two hundred years. Clan members in this case have many different assumed names, many though reverted back after the proscription expired. To restate Dienekes point, Surname projects and their lists are not Clades.

It is nice to see Dienekes attempt to tackle topics like this, but in this case I think he has muddled his argument by confusing and conflating several different issues.

Of course a cluster is not necessarily a clade. A clade is a particular thing, and a group of haplotypes may not be a legitimate clade even if they constitute a legitimate cluster. Even that is hard enough to discern, as Dienekes observed.

However, he overreaches when he claims it is "impossible" for Y-STRs to be used to define a clade. Of course this contention is nonsense, and most likely stems from an incomplete understanding of cladistics.

STRs can be used for either non-cladistic analysis (e.g. clustering) or cladistic analysis. Using them for cladistic analysis has many challenges, and in some cases those challenges are practically insurmountable. But not always, and certainly not in principle.

In light of what I am reading on DNA Forum about the preliminary analysis of Decodeme and 23 and me results, I would have to now agree with Vince. There is probably a UEP for each cluster being studied? I would have to guess that the founder of Clan Gregor who had an STR mutation at 385a from 11 to 10, also had a UEP mutation which also is probably carried by every male descendant in his tree??? In this sense, clusters are now trees stemming from a UEP. I think everything we have ever read about Y SNP's may need to be tossed out the window and we're starting from Scratch as to the number and frequency of SNP mutations upon transmission??

However, he overreaches when he claims it is "impossible" for Y-STRs to be used to define a clade. Of course this contention is nonsense, and most likely stems from an incomplete understanding of cladistics

It may be nonsense, but you have made no argument as to how Problem #3 can be overcome.

It is perhaps possible to devise a test that would be able to detect all these haplotypes as related. But, any test that would identify these haplotypes as descendants of the "founder" node, despite 4 generations of mutations, would also erroneously identify all the smallcase nodes, also at most 4 generations away from the "founder" as members of the clade. (from the original post)

Actually no test would detect such people because they would be dead, at least most of them. Additionally, you can probably detect two (or several) different histories for each branch, provided you have tested enough markers as to get unambiguous trees.

Each individual in the tree probably has some unique STR mutations (or none at all), with enough info (enough markers tested and compared) the lineage history can be reconstructed with a great degree of certainty.

Actually no test would detect such people because they would be dead, at least most of them.

Incorrect, since Y-STRs of people who have been dead for thousands of years have already be tested.

Each individual in the tree probably has some unique STR mutations (or none at all), with enough info (enough markers tested and compared) the lineage history can be reconstructed with a great degree of certainty.

You don't get to see each individual in the tree, not to mention that you are wrong a lot of the time, when you try to estimate an ancestral Y-STR value.

It may be nonsense, but you have made no argument as to how Problem #3 can be overcome.

The so-called "Problem #3" appears to boil down to an assertion that "any test that would identify these haplotypes as descendants of the "founder" node . . . would also erroneously identify all the smallcase nodes . . . as members of the clade."

You might find that some GD-based clustering methods would, in some cases, result in the kind of classification errors you seem to fear. But I see not attempt to demonstrate such errors would be universal, which is wise since that attempt would be futile.

And again, "clustering" has virtually nothing to do with cladistics. Attacking clustering as imperfect does not advance your argument against STR-based clades, since the two things are unrelated except in the most superficial ways.

Vincent said: STRs can be used for either non-cladistic analysis (e.g. clustering) or cladistic analysis. Using them for cladistic analysis has many challenges, and in some cases those challenges are practically insurmountable. But not always, and certainly not in principle.

Vince - Name one example of a true cladistic analysis being done on STR data.

I've done Cladistic analyses (in Botany), and when a doing Phylogenetic analyses using true Cladistic methodology you explicitly define for each character which character state is ancestral and which is derived.All the methods I've used or seen used on STR data would be classed as methods in Numerical Phenetics (where you don't explicitly define which character states are ancestral or derived).

Take the character "DYS390" - what character state would you define as ancestral?

So, what -in concrete terms- is the methodology that could distinguish between smallcase and uppercase nodes in the Figure using Y-STRs.I'm really not particularly interested in your so-called Problem 3 as a topic for a hypothetical discussion, since it really has to do with the challenges of defining clusters and offers only indirect insight into the challenges of using highly homoplastic data (i.e. STRs) to resolve cladistic relationships.

I'm fine with the title of your post, and the main thrust of it. I'll happily grant you that clusters are not the same thing as clades. I'd also happily concede, in much the same way, that strawberries are not the same as bananas.

My argument is with the idea that STRs can never be used to resolve a phylogeny.

First, any attempt to draw a distinction between a "clade" and a "cluster" should begin by defining both terms. The word "clade" has a very definite, generally accepted, meaning: a taxon and its descendants. We almost always use take the word "clade" to be a monphyletic clade (a taxon and ALL its descendants) but there are other types of clades (e.g. paraphyletic clades). Dienekes used a reasonable definition of the word "clade" but never tried to define the word "cluster". "Cluster" is a much more imprecise term than "clade", but I think most people would be comfortable defining "cluster" as "a group of similar things". Any more definite definition would depend on the context.

Note that all clades are clusters under the two definitions I just gave, but not all clusters are clades. Common descent is one type of similarity you might use to make a cluster (possibly ending up with a clade), but a cluster can be defined using an almost endless array of characteristics whose similarity is evaluated.

Any attempt to draw a distinction between clusters and clades is, whether implicitly or explicitly, drawing a distinction between phenetic classification and phyletic classification. Pheneticists group things based on similarity in character states, and phyleticists group things based on commonality of descent.

Europeans, sea mammals, Christians, carnivores, and fruits are all basically examples of phenetic classification. None of these things are clades, but all could be considered clusters.

Pheneticists have an advantage over phyleticists, in that phenetic classifications are knowable (because they are made up by the pheneticist) whereas phyletic classifications - that is, clades - are unknowable (because we are generally not around to directly observe the phylogeny unfold over time). Phyleticists do what they do, which is attempt to infer the correct phylogeny, because phyletic classifications correspond to real, or natural, evolutionary entities.

Given all that, you can see that the word "cluster" might mean different things in different contexts even within the world of Y-DNA studies. But it is not hard to see that many uses of the word "cluster" in our community are definitely phenetic classifications and not phyletic ones. When people group STR haplotypes by degree of similarity based on GD or some common allele, they are generally clustering. The null DYS425 cluster in R1b and the DYS388=13 cluster in J1, for example, are phenetic classifications: they are clusters and not necessarily clades. I see nothing controversial about observing this.

So, basically, the title of Dienekes post ("Why Y-STR haplotype clusters are not clades") boils down to the fact that phenetic classification is not the same thing as phyletic classification. Maybe some people are not aware that they are different things, but as a I said before, he might as well have entitled his piece "why apples are not oranges".

The majority of what Dinekes wrote, though, has nothing to do with the difference between phenetics and phyletics. Instead, he focused on several practical difficulties in constructing STR-based clusters (i.e his problems 1, 2, and 3). There ARE practical difficulties in constructing robust STR-based clades, and Dienekes identified some of them. But, since this has everything to do with phenetics and nothing to do with phyletics, it is irrelevant to the matter of STR-based clades.

So when Dienekes goes on to suggest that it is "impossible" to use STRs to define a clades, he does so without having made any serious argument for that position. And indeed, he is wrong to suggest it.

Go back to the definition of "clade": a taxon and all its descendants. There is nothing there that prevents IN PRINCIPLE the use of STRs by someone to correctly infer the cladistic relationship that exists. There is no legitimate theoretical argument against it, and honestly Dienekes does not even attempt one.

There are practical arguments against the use of STRs in cladistic analysis, the main one being the difficulty in resolving issues of homoplasy and homology. These issues are not always surmountable, and often are not. I've seen at least one textbook spend a whole chapter making a case that using STRs in phylogenetic analysis should be avoided. Biologists almost never use STRs in constructing species phylogenies, for example. But the reasons are practical, not philosophical. In the case of human population studies where the time depths are much smaller, the number of STRs with low mutation rates (e.g. tetra- and pentanucleotide STRs) is reasonable, and the extinction rate was very high for stretches of time, I argue that it is entirely possible - in some cases - to use STRs cladistically.

In fact, human population geneticists do it all the time. Any paper that includes a STR-based median joining network in which reticulations are not present - there have been several published this year - is engaging in an ad hoc phyletic (aka phylogenetic) analysis using STRs. FLUXUS is one piece of software that can handle STRs in their original character states, thereby allowing the construction of a tree (aka network) in which the transition from ancestral to derived state is inferred. PAUP* is another piece of software than can do this.

I've used STRs to produce true cladistic analysis in some cases (R1b1* and R1b1b1), but have tried and failed in other cases (subclades of R-M269, for example). Some factors contributing to the successes are in the scientists control (good sampling, robust haplotypes, etc.) and others are not (lineage extinction rates being the main one).

I'll close with one final analogy. Imagine that someone had suggested that it is impossible to use house paint on an automobile. I think most people who know about paint and cars would agree that it is both difficult and usually ill-advised, but it is not impossible. And people who are both knowledgeable and creative would probably be able to think of some situations in which it is possible and to name some possible techniques for doing so with usable results (thinning, experimenting with primers, covering with clear coat, etc). A competent and conventional painter might never choose house paint over auto paint if given a choice, but might choose house paint over no paint if those were his options.

I think both of you guys are beating a dead horse. I still say if you go get a biology book, Dienekes wins; if you talk to surname project mgrs Vince wins. Who really cares???

The real problem we are faced with is how do you model the Y chromosome and its evolution? Are we now seeing that there may be an SNP mutation for every birth?? If we knew all the STR's, would it be similar? i.e. is there a paired mutation for each birth??

If you had the entire Y chromosome mapped - how would you analyze it??

On top of this we have the issue of bottlenecks? Two of Dienekes recent posts suggest to me that Neaderthal demise may be associated with a Volcanic eruption and subsequent global cooling?

None of this has much do with this thread directly (no different than the above discussion I'd say). But you two guys are two of the brighter guys I read and I wish you would look over this problem, prioritize what we have to understand better and identify the data we need to solve the problem. The recent Kemp effort at WSU looks real interesting and very focussed. Thats what I'm talking about. In vino veritas

Dienekes used a reasonable definition of the word "clade" but never tried to define the word "cluster".

"Cluster", unlike "clade" does not have a formal definition. In the body of the post it is clear that I am speaking of attempts to define "clusters" reflective of recent common descent relative to the rest of mankind.

The rest of your entry is not worth responding to, since you once again fail to provide a concrete methodology for defining a clade using Y-STRs.

The only point of interest is the allegation that "a STR-based median joining network in which reticulations are not present" is a clade. That is wrong.

I challenge you to provide evidence that such networks or any part thereof have been named "clades" in the scientific literature, or that they can be made the basis of a clade definition, i.e. a procedure to infer whether a haplotype belongs to the clade or not.

The only point of interest is the allegation that "a STR-based median joining network in which reticulations are not present" is a clade. That is wrong.

Maybe, but that is not what I said. Of course a network is obviously not a clade, just as a phylogenetic tree is not a clade.

What I said was that MJ networks constructed using STRs are examples of cladistic analysis, and they are.

Can you really look at a phylogenetic tree, in which the root (aka MRCA) is marked - based on an inferred ancestral haplotype - and from which a TMRCA is estimated, and not recognize it as a cladistic analysis? Really?

As for the idea of "defining" a clade, the method is the same regardles of the kind of character you employ: Build a tree that most describes the history of your taxa, then pick a node and all the descendant taxa. That's your clade, or at least your best representation of it. The method is the same regardless of whether your characters are SNPs, STRs, or the number of tail feathers on a bird.

A haplogroup is a clade defined on the basis of binary markers. It includes the Y-chromosomes that possess the derived state, and does not include those that possess the ancestral state.

That is not accurate. A haplogroup is a clade, that part is true, but the rest is imprecisely stated.

I don't know how many times we should remind you that clades exist independently of the binary markers to which you so helplessly cling. They are independent of the existence of the marker, and are independent of our our knowlege of the marker.

We don't define clades, we infer them (or uncover them, or discover them, .... you get the idea). We characterize them, for our own convenience, but they could not care less how we do it.

For example, a chromosome that is P25+ M269+ S127+ S128- S129+ S116+ U152+ would be recognized by everyone as being in the haplogroup we call R-U152 despite the absence of one of its "defining" binary marker (S128). This is because the weight of the evidence (that is, the rest of the haplotype) is strong enough to demonstrate this.

I don't know how many times we should remind you that clades exist independently of the binary markers to which you so helplessly cling.

The set of haplogroups (clades defined on the basis of UEPs) is a subset of the set of clades. It is possible to define a clade on the basis of UEPs. That does not, of course, mean that all clades can be defined on the basis of UEPs.

>> What is a clade defined on the basis of Y-STRs?

The same as any other clade: a taxon and its descendants.

You are not answering the question. I asked you how a clade is defined on the basis of Y-STRs. Your "definition" does not even refer to Y-STRs.

Repeating the dictionary definition of what a clade is, is not the issue here.

The set of human Y-chromosomes clades includes inter alia the following:

(a) Socrates and his patrilineal descendants(b) Members of haplogroup E-V13

For (a), a clade is defined on the basis of a historical individual. It's a valid definition even though there is no way of testing it.

For (b), it is defined on the basis of a SNP, V13. There is a way of testing it: testing for V13, and it's a very accurate one, even though there may be (in a large populations) a handful individuals who will be V13+ even though they are not in the clade or V13- even if they are.

Incorrect, since Y-STRs of people who have been dead for thousands of years have already be tested.

Admitted, but my point stands anyhow, as most tests (and related studies) are done with living people.

You don't get to see each individual in the tree, not to mention that you are wronga lot of the time, when you try to estimate an ancestral Y-STR value.

I am not talking of age estimates but of the genealogical tree structure. If a branch is longer or shorter (in time or generations), it does not fundamentally alters the lineage.

In a different context: Y-DNA R will always be a descendant of P, whatever the timeframe estimated for each node of the tree. Equally, if you can pinpoint a root haplotype for STR analysis, you can reconstruct a most-likely tree.

Of course, you can err when selecting a root haplotype (different choices would produce somewhat different trees) but you will still get a non-arbitrary "horizontal" relationship structure, wherever the root is. This structure is meaningful in itself (though if you use few markers it can be very ambiguous).

Maju: Let me respond to you. I would like your opinion. Based on what I've read I think that Dienekes objection to "defining" a clade using STR's cannot be done since they are not unique??? Again, it appears, at the 67 dys loci STR level, that the direct descendant of the Clan Gregor founder has the same haplotype as the founder?? Ergo, there are many persons over time who possess the same haplotype. It is not unique! Again ergo, it cannot define a clade. Is this Dienekes argument?

If it is I would suggest that we truly don't know that a STR haplotype is always not unique?? Is it possible to define a Clade on the basis of a unique haplotype?

I bring this up because there are many more Dys loci than we work with. For all I know if we knew the full set of STR dys loci, we would find out that each person has a unique STR haplotype. Given this circumstance, could we then define a clade based on this unique haplotype???

The common property in this scenario is an STR mutation. Example: The Scottis are a Irish tribe who invaded Scotland and eventually took over kingship. The trademark of all the scotti clans is a 10 at 391 and an 11 at 385a. The clan Gregor founder had a back mutation from 11 to 10 at 385a and the characteristic of a descendant of the MacGregor is this 10,10 set of dys loci values. This is the common property of what I call a generic MacGregor. The next question is: was the founder unique. If in fact all descendants have a new mutation then he is. Over 67 dys loci I don't see this difference, but thats not to say that if all the Y STR's were known, then he might have a unique haplotype such that the founder could be uniquely identified.

I only mention this because with 23 and me and decodeme results becoming available it may be such that each person has a completely unique signature defined by an SNP and an STR mutation??? Therefore we are unique in an SNP sense, but we are also unique in an STR sense.

The problem is not so much finding a unique haplotype, but rather a common property shared by the descendants of a node that isn't shared by its ancestors.

But a haplotype is a common property: those people have accumulated the same STR mutations, probably in the same ancestral event/s.

Let's see: since a number of STR mutations accumulated in some guy (as we are talking of Y-DNA), all his descendants will have such haplotype (let's call it HT1)... until some of them develope new mutations (hence: new haplotypes: HT2, HT3, etc.), right?

The only problem I can see is wether the same HT1 could be formed independently in a totally unrelated person by chance. This is a posibility, specially when the resolution (number of variable loci) is low. But otherwise...

And that's why STR haplotyping is safer (more accurate) at higher resolutions and also necesarily subservient to haplogroup determination.

In other words: I do think that STR haplotyping can give valuable info on ancestry and help to create genealogical trees within haplogroups, always provided that the typing is of sufficient quality.

There can be always doubts if this or that haplotype is ancestor, "sybling" or descendant (IDK: is the McGregor "back-mutation" a true derived back-mutation, a preserved ancestral state or a parallel "sybling" mutation to the general Scott marker?) but there is valuable info in all that anyhow.

I fail to see how this is a difficult problem. You point out that a phenetic analysis (genetic distance) may fail to give good clades. Well, this is not a new problem, in fact this is the reason people do cladistics. As I'm sure everyone knows, cladistic relationships are defined solely by shared-derived states (synapomorphies).

The simplest cladistic method based on Y-STRs would take each STR locus as a character and each STR length as a character state. You could then simply do an unweighted Maximum Parsimony analysis (easily implemented in PAUP*). You don't need to declare or know polarity for this. Polarity is determined after the analysis by placing the root. Roots are best placed with a good outgroup.

Is this guaranteed to give you the correct answer? No, of course not. Phylognetic signal will decay rapidly when your character mutates rapidly. The high mutation rate of STRs makes them terrible phylogenetic characters. However, there is nothing inherently different about STRs as characters that means you can't use well-established cladistic methods to infer the phylogeny.

Old Blog Archive

Dienekes' Anthropology blog is dedicated to human population genetics, physical anthropology, archaeology, and history.

You are free to reuse any of the materials of this blog for non-commercial purposes, as long as you attribute them to Dienekes Pontikos and provide a link to either the individual blog entry or to Dienekes Anthropology Blog.

Feel free to send e-mail to Dienekes Pontikos, or follow @dienekesp on Twitter.