I think it has more to do with comparing SNPs that are at the same "level" (same number of steps down from a common ancestor) than SNPs that are "terminal," which is even more of a moving target. I may be mistaken, but anyway that was why I've been pestering the FTDNA help desk (for more Z-SNPs to test) since last June: in part, to get M153 down to a more realistic level (about seven to nine steps lower than it's been appearing, on the ISOGG tree) -- but mainly to facilitate MRCA comparisons between nodes on the NS cluster side and nodes on the the L176.2 side, of Z196. The confusion of tongues (as it were) between the SNP naming systems (Z, L, DF or whatever) doesn't help. There may be persons for whom it isn't confusing, but I'm not all alone in that category.

Wouldn't having that as a requirement make it nearly impossible to calculate TMRCA? How can you be reasonably sure you have arrived at the terminal SNP short of whole genome testing?

I think this has been our fundamental problem with TMRCA estimatations. My clan gregor Ian Cam calculations uses father/son rates and I can estimate pretty well the founders date. At one time I did the same for M226 and got c. 400 AD which is near when O'Neill was born?

The problem is the M226 entries end at the SNP for that class of entrants, there is abolutely no 226 older than that date. Therefore, it follows to make an intelligent TMRCA calculation, the data set has to be entries with only one SNP; any younger SNP's will bias the estimate lower!

I've been mulling this over the weekend and I have convinced myself, that is the only way to make sense out of this. This is why, I believe, all TMRCA's have appeared to be too young, they were a "gemischt" of subsequent, sometimes unfound/unknown SNP's which confounded the calculation. JMHO!

I think it has more to do with comparing SNPs that are at the same "level" (same number of steps down from a common ancestor) than SNPs that are "terminal," which is even more of a moving target. I may be mistaken, but anyway that was why I've been pestering the FTDNA help desk (for more Z-SNPs to test) since last June: in part, to get M153 down to a more realistic level (about seven to nine steps lower than it's been appearing, on the ISOGG tree) -- but mainly to facilitate MRCA comparisons between nodes on the NS cluster side and nodes on the the L176.2 side, of Z196. The confusion of tongues (as it were) between the SNP naming systems (Z, L, DF or whatever) doesn't help. There may be persons for whom it isn't confusing, but I'm not all alone in that category.

I laud your objective. I'm not sure how many SNP's are in some sense parallel, there has to some kind of pecking order I believe, but that question may be in the noise when you are comparing two SNP's of comparable age relative to the total TMRCA?

I've been mulling this over the weekend and I have convinced myself, that is the only way to make sense out of this. This is why, I believe, all TMRCA's have appeared to be too young, they were a "gemischt" of subsequent, sometimes unfound/unknown SNP's which confounded the calculation. JMHO!

Why do you think all the TMRCA calcs 'appear' too young ?

If you would like an upper boundary for L21 calculate the interclade age for P312, you don't even have to use L21 people in the calculation.

The question of the age of SNP's is contentious, it is not clear, certainly in academia whether Zhiv is right or straight father son meisoses calculations are correct. So my statement stands for itself.

What data do you use to calculate the interclade age for 312? If its from subclades, it should probably only be entries from the subclades not downstream SNP's?

The question of the age of SNP's is contentious, it is not clear, certainly in academia whether Zhiv is right or straight father son meisoses calculations are correct. So my statement stands for itself.

What data do you use to calculate the interclade age for 312? If its from subclades, it should probably only be entries from the subclades not downstream SNP's?

Two sub groups of P312, U152 and L21 spring to mind but you could use U152 and what's predicted to be under DF27.

How do you know, or does it make any difference, whether you have U152 or U152 and all its downstreams SNP's. I'm arguing you shouldn't include downstream SNP's in TMRCA calculations. I'm not sure what the impact on interclade calculations, it depends on the implied assumptions. Carrying all the extra baggage makes the dates seem smaller. Thats what I am finding out.

How do you know, or does it make any difference, whether you have U152 or U152 and all its downstreams SNP's. I'm arguing you shouldn't include downstream SNP's in TMRCA calculations. I'm not sure what the impact on interclade calculations, it depends on the implied assumptions. Carrying all the extra baggage makes the dates seem smaller. Thats what I am finding out.

I don't think it would make any difference, I doubt there are any P312+ people who don't have a known (probably almost all of them now) or unknown downstream SNP anyway.

The important thing is to make sure you have two clearly defined groups.

I am suggesting that having multiple SNP's in a dataset is not a clearly defined group!

when I comix SNP groups, I muddle up the TMRCA. No M226 , founder c. 400AD, is older than the founder. To include those entries in a calculation for an older haplotype is wrong. I am living today and am M226 -, As far as I know my lines last SNP was Z253. To compute the TMRCA of Z253, I have to use a group of 253's. If I include downstream SNP's the calculation will be shortened. I would conclude that if you wanted to do a 106, R-L21 interclade then you would only use entries that have 106 and L21 SNP's as their last SNP, otherwise you're "gemischting".

I am suggesting that having multiple SNP's in a dataset is not a clearly defined group!

when I comix SNP groups, I muddle up the TMRCA. No M226 , founder c. 400AD, is older than the founder. To include those entries in a calculation for an older haplotype is wrong. I am living today and am M226 -, As far as I know my lines last SNP was Z253. To compute the TMRCA of Z253, I have to use a group of 253's. If I include downstream SNP's the calculation will be shortened. I would conclude that if you wanted to do a 106, R-L21 interclade then you would only use entries that have 106 and L21 SNP's as their last SNP, otherwise you're "gemischting".

Interclade calculations finds the approximate TMRCA of two groups, Ken's spreadsheet also outputs the intraclade ages.

I would like a good discussion on this point, because, I believe it explains the problems we've been having in making time estimates using STR's.

Suppose you wanted to estimate the TMRCA to M269, the only entries that should be used are those that are M269+, people living today, who have not experienced a SNP in their family tree since M269+ was born. Including other entries subsequent to M269 would shorten the estimate. It appears that the most recent SNP a person has determines their most recent common ancestor with entries of like SNP.

Think of it from a "bottoms up" approach. Take the most recent SNP we are aware of and using the entries that have that SNP, we can estimate that group. What we have left is the founders haplotype who has a previous SNP, do the same thing for that SNP. You will be working back in time far as founders go, but you will still have entries living today who will go back to the next previous SNP. etc.

I believe this is the approach one has to use to intelligently compute TMRCA's. JMHO.

I would like a good discussion on this point, because, I believe it explains the problems we've been having in making time estimates using STR's.

Suppose you wanted to estimate the TMRCA to M269, the only entries that should be used are those that are M269+, people living today, who have not experienced a SNP in their family tree since M269+ was born. Including other entries subsequent to M269 would shorten the estimate. It appears that the most recent SNP a person has determines their most recent common ancestor with entries of like SNP.

Think of it from a "bottoms up" approach. Take the most recent SNP we are aware of and using the entries that have that SNP, we can estimate that group. What we have left is the founders haplotype who has a previous SNP, do the same thing for that SNP. You will be working back in time far as founders go, but you will still have entries living today who will go back to the next previous SNP. etc.

I believe this is the approach one has to use to intelligently compute TMRCA's. JMHO.

?????

perhaps you mean haven’t tested positive for any known downstream SNP, but I can't see the logic in that either.

You are correct. There are no downstream/subsequent SNP's after M269. This implies that the haplotype will be very diverse since its TMRCA is when the M269 mutation occurred.

The logic is as I stated above, the haplotype of that entry will have had a long time to experience STR mutations and will therefore reflect the time back to its founder. It's becoming clearer to me that including younger SNP's will reduce the diversity, since the founder existed a briefer period of time.

On another thread Mike asked the importance of diversity. I think I can now answer his question: the haplotype has to have the SNP of interest and no subsequent SNP mutations to reflect the diversity in his haplotype for the age of the SNP of interest.

Certainly, you will agree that time zero for a SNP is the founder haplotype and subsequent descendants reflect diversity from that haplotype only?

You are correct. There are no downstream/subsequent SNP's after M269. This implies that the haplotype will be very diverse since its TMRCA is when the M269 mutation occurred.

The logic is as I stated above, the haplotype of that entry will have had a long time to experience STR mutations and will therefore reflect the time back to its founder. It's becoming clearer to me that including younger SNP's will reduce the diversity, since the founder existed a briefer period of time.

On another thread Mike asked the importance of diversity. I think I can now answer his question: the haplotype has to have the SNP of interest and no subsequent SNP mutations to reflect the diversity in his haplotype for the age of the SNP of interest.

Certainly, you will agree that time zero for a SNP is the founder haplotype and subsequent descendants reflect diversity from that haplotype only?

I agree to an extent about not including the downstream. It seems more realistic as an estimate. However, if the samples being compared in an interclade are small and the result of a recent founder effect for example, it may become important to include downstream snp's so a more accurate variance will be reflected.

I think we have a lot of work ahead of us trying to fully understand this effect, if it turns out to be true. As Razyn pointed out, you may be able to include parallel SNP data of comparable age, that certainly seems feasible.

I think one thing that has to happen is to have FtDNA project data be organized by SNP within a clade/subclade. This will permit calculations to determine which SNP's are parallel among other things.

Rms's issue about do we have all the SNP's is still valid and may confuse the data?

You are correct. There are no downstream/subsequent SNP's after M269. This implies that the haplotype will be very diverse since its TMRCA is when the M269 mutation occurred.

The logic is as I stated above, the haplotype of that entry will have had a long time to experience STR mutations and will therefore reflect the time back to its founder. It's becoming clearer to me that including younger SNP's will reduce the diversity, since the founder existed a briefer period of time.

On another thread Mike asked the importance of diversity. I think I can now answer his question: the haplotype has to have the SNP of interest and no subsequent SNP mutations to reflect the diversity in his haplotype for the age of the SNP of interest.

Certainly, you will agree that time zero for a SNP is the founder haplotype and subsequent descendants reflect diversity from that haplotype only?

That's not what I meant.

I'm afraid I don't understand what you are saying and struggle to find any logic in your explanations.

I'll try again. SNP's are a time ordered set, characterized by Hgs. Accompanying the formation of a Hg is a unique SNP occurrence as I understand it. Further within a Hg are subclades which are also characterized by the occurrence of a SNP mutation, such as M269, L11, R-L21, Z253, M226 etc. This is a hierarchical set with M269 being the eldest.

Associated, I believe, with each subclade is a particular sequence of STR's, called the founders haplotype. This is what we converge to when we do an STR TMRCA.

We form sets of people in FtDNA projects with different sets of SNP's./names etc. All the entries we use are from people who are alive today, or who just recently passed away. Each person has a set of SNP's on his Y chromosome, which appears to be a historical record through time.

My point is that when I want to estimate the TMRCA of R-L21, I should only use entries whose last mutation is that one, no subsequent SNP's. Each subsequent SNP has a shorter TMRCA and will reduce the time estimate to R-L21, since their "diversity" started at a more recent point in time. It's like comparing apples and oranges, they're just not the same kind of thing nor, more especially, they do not show the same kind of "diversity".

I don't think downstream SNPs should be dismissed. The thing is that while say R-L21 people might have a common ancestor that lived say 3400 ybp, and R-U152 people have a common ancestor that lived say 5000 ybp, when you calculate the common ancestor of a set that has both L21 and U152, one is finding the common ancestor of both L21 and U152, that TMRCA should ideally be older than both the common ancestor of L21 and U152. That would in fact be a good way to test the reliability of TMRCA. I'm not sure how the sample dynamics would affect the TMRCA, i.e. if there is a set that is overpopulated by L21 folks with little U152, would the TMRCA be driven down to a number closer to the L21 TMRCA, or would it not change. I suspect there would be a significant impact in the TMRCA, just because the way it is calculated is calculated. In fact, let's see something very quickly:

Using the data from Myres et al(2010) let’s explore the variance of each SNP individually, and then as a group.

The variance for L21(n=126) is 0.2238, the variance of U152(n=203) is 0.2089, the variance of the combine sample L21+U152(n=329) is 0.2146. So there is definitely something off here, the variance of two different SNPs should be greater when combined than each one of them separate, because if variance is a direct measured of TMRCA, the earliest R-L21 and R-U152 could share their MRCA would be R-P312, however their MRCA could have lived later than R-P312, however it would still have to be older than both the MRCA of the L21, and the U152.

PS: I know there is an anomaly here with the U152, this is probably caused by the Bashkirs who have a very young U152. Nonetheless, that doesn’t change the fact that the variance of any two SNPs which descend from a common ancestor should at least in theory be greater than their individual variances. But to make sure this anomaly isn't causing this Mikeww if you could direct me to sets of L21 and U152 that use the 36 most linear STRs you have mentioned before, so that I can repeat this test.

They are not being dismissed, excluded is a better word. They reduce the variance of the R-L21 sample. All the M222 entries converge to one younger man. If I looked at a tree, of R-L21, I would see a small set of entries whose origin is from R--L21 to the current time. For others I would see lines from subsequent SNPs such as M222 to a larger set of entries. To compute their variance I would have to add the variance to M222 and then the variance from M222 to R - L21? I don't believe we are including the latter variance? We are computing the variance to M222 and saying that is the same as computing the total variance back to L21? JMHO

I am just exploring ways to express my concerns. Your first calculation indicates something is not right here. I will be exploring Z253 data set in more detail to see what I can observe from the data.

They are not being dismissed, excluded is a better word. They reduce the variance of the R-L21 sample. All the M222 entries converge to one younger man. If I looked at a tree, of R-L21, I would see a small set of entries whose origin is from R--L21 to the current time. For others I would see lines from subsequent SNPs such as M222 to a larger set of entries. To compute their variance I would have to add the variance to M222 and then the variance from M222 to R - L21? I don't believe we are including the latter variance? We are computing the variance to M222 and saying that is the same as computing the total variance back to L21? JMHO

I am just exploring ways to express my concerns. Your first calculation indicates something is not right here. I will be exploring Z253 data set in more detail to see what I can observe from the data.

Just to give you my perspective, I try to avoid TMRCA or variance estimates of groups of people in different subclades in one TMRCA. However, this is per the level of the Y DNA tree I'm on. For instance, if I do a calculation for R-L21, I include all of R-M222 along with all of the other L21 subclades. If I was doing P312, I'd also add in Z196, U152, etc.

I would not add in U106 or parts of U106 or L11* though. That would be akin to including a "partial" and perhaps arbitrary data set.

I think of SNPs as great filters. As you have noted, TMRCA estimates generally are not very precise and subject to mutation rate controversies. However, if I filter everyone out except those that are in the subclade at question (as marked by some SNP derived (+) result) then I've reduced the potential for error.

We should not think of SNPs, themselves, as the subclades. They are just markers on the subclade branches of the Y DNA tree. They could be representative only a portion of a bigger, but very closely related branch of people, that they all sit on.

... PS: I know there is an anomaly here with the U152, this is probably caused by the Bashkirs who have a very young U152. Nonetheless, that doesn’t change the fact that the variance of any two SNPs which descend from a common ancestor should at least in theory be greater than their individual variances. But to make sure this anomaly isn't causing this Mikeww if you could direct me to sets of L21 and U152 that use the 36 most linear STRs you have mentioned before, so that I can repeat this test.

When I can get back to my home computer, I'll update the Haplotype_Data_P312xL21 file. Do you want to compare L21 to U152? I don't have the Myres data set in my file. I just have stuff straight from FTDNA project screens. Has has been discussed, there should be probably be some random sampling runs based on some cross-sectional "representative" reference by geography or by subclade (of L21 or U152). I don't do that because I'm not smart enough and I think our data is limited as it is.

Based on some recent msgs from Hans Van vliet and Machiavelli, I am beginning to believe that only SNP subsets that only have the same SNP and no subsequent SNPs can be used for TMRCA calculations?

I think I'm missing the point and am just catching up on this thread, but there are plenty of SNPs out there. We just haven't discovered them yet. If literally, we wanted to find some group with no lower level SNPs, we might have to relegate ourselves to groups only as big as a father and his sons.. or maybe also the uncle, grandfather and g-grandfather, but maybe not all of the sons.

When I can get back to my home computer, I'll update the Haplotype_Data_P312xL21 file. Do you want to compare L21 to U152? I don't have the Myres data set in my file. I just have stuff straight from FTDNA project screens. Has has been discussed, there should be probably be some random sampling runs based on some cross-sectional "representative" reference by geography or by subclade (of L21 or U152). I don't do that because I'm not smart enough and I think our data is limited as it is.

I don't have access to the Yahoo Project, you will have to tell me how to sign up using a hotmail account. It’s fine if is FTDNA, I just want to try it out at a higher number of STRs, just to make sure that the small numbers aren’t tricking me. As for the random sampling, that would actually be a good thing to test. I can write a program to extract 75 random haplotypes from each set. It will be good to test the whole set of L21, the whole set of U152, then see if their total population numbers have any inference in the variance, and then repeat the same test using 75 or 100 randomly sampled haplotypes from each set.