I've actually played with "miccrosatellite choice" in the past, because of concern about your point. I ran through the R-L21 file of long haplotypes and tried 12, 25, 37, 67 length haplotypes and after throwing out the non-multicopy non-null STRs, I would run variance calculations adding an STR or two or subtracting. What I found was the variance relationships between the subclade of L21 were fairly stable when you start using above 15-20 STRs.

Generally, I find very little jostling of the relationships in R1b subclades when you start using 25 or so markers and get up to about 30 haplotypes.

Here is "test" run for you on R-L21's major subclades based on different sets of markers.Relative variance with the 49 mixed speed, non-multicopy, non-null STRs from FTDNA's 1st 67:L21__________: Var=0.99 (N=2590)DF21_________: Var=0.80 (N=116)L513_________: Var=0.75 (N=157)Z253_________: Var=0.61 (N=145)M222_________: Var=0.49 (N=540)Z255_________: Var=0.39 (N=102)

* Linear durations greater than 7000 years according to Marko Heinila's analysis.

See how stable the order of the above haplogroup stays? The percentage differences between the different haplogroups do change depending on the STRs used. I am not trying to say that STR variance is precise. It isn't, but the more data you have you can improve precision.

Generally, what I've found is that the linear 36 STR (most of which are slower) and the 49 STR mixed speed marker calculation runs rarely change the positioning of haplogroups.

Most variance relationships between R1b haplogroups work well at 16 or 24 markers on 37 length haplotypes. M222 did flip-flop with Z255 for us on the low marker runs above, however, the notable exception is that U198 looks quite old (high variance compared to U106 or Z381) with the 37 length haplotypes. However if you ratchet up the U198 analysis to 36 or 49 markers on 67 length haplotypes everything seems to fit back into place (younger than Z381.)

I just think it is the law of large numbers at work and the value of having more STR "experiments."

First off, I’m quite curious, how do you get variances that are so close to 1, or that are even greater than 1. Generally what I understand as variance is average mutations/marker, or per haplotype? Could you show me what the mean mutation rate per marker is for the 36 or 16 best linear duration, and if possible the standard deviation? Like I said the law of large number works when the overtly different markers behave as outliers, if one has overtly different markers in terms of mutation rate it doesn’t matter if you have 10, 100 or 500 STRs your estimates are going to be poor. Here just to show you an example of what I am saying:

Say one has a series of 10 STR with the following mutation rates(10-3)

10 7 8 7 8 9 50 20 3 6

The mean value would be 12.8, the standard deviation would be 13.79, this distribution would result in a poor estimate, because any mutations occurring in the marker with the mutation rate of 50 would overestimate the TMRCA largely, while any mutation occurring in the marker with mutation rate 3 would underestimate the TMRCA somewhat.

Now let’s say we increase our testing set to 37 markers with the following mutation rates(10-3)

The mean value would be 32.41, the standard deviation would be 27.1656(Granted it improved significantly from the 10 STR distribution), this distribution would results in mutation occurring in the markers with mutation rates of 96, 89, 50, 56 all overestimating the TMRCA largely, while a large portion of mutation found in all loci with mutation rates less than 32.41 would be underestimated.

Now say instead one chose a panel with the following set of markers, again mutation rates(10-3)

5 8 9 11 7 8 4 5 8 7 9 10

That is 12 markers, they have a mean mutation rate of 7.583333, and a standard deviation of 2.1087. This in turn is actually a really good distribution, because even if most mutation were in say the marker with the slowest mutation rate, they amount by which the TMRCA would be underestimated would be far smaller than the amount all mutations located in slow(i.e. with mutation rate less than 20) markers in the 37 markers sample set.

So as you can see, the law of large number can only do so much to try to harmonize the distribution, but if one has such a distribution were the numbers differ from one another in orders of magnitude, no matter how much numbers you keep adding, is never gonna get fixed. The other solution would be that adding a large amount of fast STRs, might drive the slow STRs to a minority position, where they would act as outliers, and hence, the more fast STRs one adds, the smaller the effect of the outliers. However, if one tries to measure TMRCA that are outside the linear range of those STRs, then the margins of errors are going to be huge, and the value would be greatly underestimated.

Despite the lack of harmony among hobbyist researchers, what Vincent posted on this specific topis is supportive of positions Anatole Klyosov has taken where he uses the average rate for a set of STRs in his TMRCA calculations rather than applying each individually, which would be needed if you wanted to weight STRs.

Essentially in a nutshell this is the current thought of process of some of the folks in the hobbyist community:

1-We know that there might be loss of linearity and saturation of mutations in the long run if a haplogroup is older than x kyp.

2-Well it is safe to use it in haplogroup A, because I think haplogroup A isn’t older than x kyp.

3-I got a TMRCA y which is younger than x kyp, so it must be correct.

4-It has been crosschecked using different sets from FTDNA all of them yielded a TMRCA younger than x kyp for haplogroup A, so it must definitely be correct.

5-In the case of Klyosov, he disregards the observed mutation rates in father-son pairs, as he considers those to be statistically insignificant due to small sample sizes. However, this is one of those cases where instead of the theory adjusting to meet the practical results, the practical results are dismissed because they do not agree with the theory.

Despite the lack of harmony among hobbyist researchers, what Vincent posted on this specific topis is supportive of positions Anatole Klyosov has taken where he uses the average rate for a set of STRs in his TMRCA calculations rather than applying each individually, which would be needed if you wanted to weight STRs.

Essentially in a nutshell this is the current thought of process of some of the folks in the hobbyist community:

1-We know that there might be loss of linearity and saturation of mutations in the long run if a haplogroup is older than x kyp.

2-Well it is safe to use it in haplogroup A, because I think haplogroup A isn’t older than x kyp.

3-I got a TMRCA y which is younger than x kyp, so it must be correct.

4-It has been crosschecked using different sets from FTDNA all of them yielded a TMRCA younger than x kyp for haplogroup A, so it must definitely be correct.

5-In the case of Klyosov, he disregards the observed mutation rates in father-son pairs, as he considers those to be statistically insignificant due to small sample sizes. However, this is one of those cases where instead of the theory adjusting to meet the practical results, the practical results are dismissed because they do not agree with the theory.

There are disagreements about the phylogeny when going back to those "base" SNP in the "Out of Africa" discussion. Those are also much, much older haplogroups than R-M269.

We have much better knowledge of the SNP phylogeny in the R1b family. My example shows you SNPs within the known phylogeny of the L21 family. There is no circular reasoning in that. I'm not scaling anything to mutation rates.

In the case of the Out of Africa arguments or mutation rates in general, you might want to start another thread for those topics..

I just set this up to talk about STR diversity and variance and its usefulness or lack of usefulness.

I've actually played with "miccrosatellite choice" in the past, because of concern about your point. I ran through the R-L21 file of long haplotypes and tried 12, 25, 37, 67 length haplotypes and after throwing out the non-multicopy non-null STRs, I would run variance calculations adding an STR or two or subtracting. What I found was the variance relationships between the subclade of L21 were fairly stable when you start using above 15-20 STRs.

Generally, I find very little jostling of the relationships in R1b subclades when you start using 25 or so markers and get up to about 30 haplotypes.

Here is "test" run for you on R-L21's major subclades based on different sets of markers.Relative variance with the 49 mixed speed, non-multicopy, non-null STRs from FTDNA's 1st 67:L21__________: Var=0.99 (N=2590)DF21_________: Var=0.80 (N=116)L513_________: Var=0.75 (N=157)Z253_________: Var=0.61 (N=145)M222_________: Var=0.49 (N=540)Z255_________: Var=0.39 (N=102)

* Linear durations greater than 7000 years according to Marko Heinila's analysis.

See how stable the order of the above haplogroup stays? The percentage differences between the different haplogroups do change depending on the STRs used. I am not trying to say that STR variance is precise. It isn't, but the more data you have you can improve precision.

Generally, what I've found is that the linear 36 STR (most of which are slower) and the 49 STR mixed speed marker calculation runs rarely change the positioning of haplogroups.

Most variance relationships between R1b haplogroups work well at 16 or 24 markers on 37 length haplotypes. M222 did flip-flop with Z255 for us on the low marker runs above, however, the notable exception is that U198 looks quite old (high variance compared to U106 or Z381) with the 37 length haplotypes. However if you ratchet up the U198 analysis to 36 or 49 markers on 67 length haplotypes everything seems to fit back into place (younger than Z381.)

I just think it is the law of large numbers at work and the value of having more STR "experiments."

First off, I’m quite curious, how do you get variances that are so close to 1, or that are even greater than 1. Generally what I understand as variance is average mutations/marker, or per haplotype?

I'm calculating the sum of the variance for the STR markers, which is pretty standard statistics. However, since those results are not easy to comprehend in their absolute form I divide every sum of the variance for every calculation by a standard population's sum of the variance. I'm using P312 = 1.0 as the standard although my data is a little old as far as P312 "all". All I'm doing is rescaling the results to that base of 1.0. It does not change any of the percentage differences between calculations (haplogroups in the examples.)

Could you show me what the mean mutation rate per marker is for the 36 or 16 best linear duration, and if possible the standard deviation? ...That is 12 markers...So as you can see, the law of large number can only do so much to try to harmonize the distribution, but if one has such a distribution were the numbers differ from one another in orders of magnitude, no matter how much numbers you keep adding, is never gonna get fixed. The other solution would be that adding a large amount of fast STRs, might drive the slow STRs to a minority position, where they would act as outliers, and hence, the more fast STRs one adds, the smaller the effect of the outliers. However, if one tries to measure TMRCA that are outside the linear range of those STRs, then the margins of errors are going to be huge, and the value would be greatly underestimated.

All STRs do not have to be the same speed to be aggregated for calculations. There is nothing to fix as long as long as the "expected" mutation rate for each STR does not change per group (haplogroup in my example) and as long as you are using the same STRs for each group compared.

I'm not using mutation rates at all. You would only need them if you want to calculate a TMRCA. I'm just calculating the relationships between groups.

I've tried to show you that changing the STRs used when you use good sized (25 or more) set of STRs and a good sized group of haplotypes doesn't change the relationships of the STR diversity between different groups of haplotypes (that are related.) To me that is the law of large numbers in action. You can come up with hypothetical problems, but with real data the concepts work.

Ultimately, STR variance really does have a relationship with age. It can be argued that it is not completely linear or that it is linear for a limited duration, but I've shown you that whether use the researched "linear" markers (at least according to Heinila) or a combination of STRs, it makes little difference for haplogroups within the age of the R-M269 family.

Companies like FTDNA, a large number of academic scientists and hobbyist-scientists use STR variance, accepting their generally linear relationship with time. It works! Now you can disagree with mutation rates, but I'm not using any, so all you can say about STR variance is you don't trust Y DNA STRs. That's okay, but the scientific community, by and large, is using them.

Ken quoted an estimated figure of 1 SNP per generation on average but I was told this would work out as 1/2 because only 1/2 the Y chromosome is readable or something, got to be honest I got lost in the conversation at that point :)

He has calculated the rate of .78 SNPs per father-son transmission, so about 3/4th. I don't know how much of the Y chromosome can now be scanned by testing. I would think its mostly just a matter of cost.

All STRs do not have to be the same speed to be aggregated for calculations. There is nothing to fix as long as long as the "expected" mutation rate for each STR does not change per group (haplogroup in my example) and as long as you are using the same STRs for each group compared.

I'm not using mutation rates at all. You would only need them if you want to calculate a TMRCA. I'm just calculating the relationships between groups.

I've tried to show you that changing the STRs used when you use good sized (25 or more) set of STRs and a good sized group of haplotypes doesn't change the relationships of the STR diversity between different groups of haplotypes (that are related.) To me that is the law of large numbers in action. You can come up with hypothetical problems, but with real data the concepts work.

Ultimately, STR variance really does have a relationship with age. It can be argued that it is not completely linear or that it is linear for a limited duration, but I've shown you that whether use the researched "linear" markers (at least according to Heinila) or a combination of STRs, it makes little difference for haplogroups within the age of the R-M269 family.

We are going in circles, it seems to me like this is a never ending discussion. Ok, let’s reiterate some points:

1-You said that you are not using mutation rates, because you are not interested in calculating the TMRCA. Well, I’m not referring to you specifically, but to the bunch of other people that do use mutation rates, and mixed(mutation wise) STR samples to calculate TMRCA. After all, you mention not too long ago, that R1b-M269 was between 4-8 kybp following estimates by Heinila, Klyosov, etc. Moreover, you accept that there is a relationship with age, so while, you mention that you do not want to get into the details of TMRCA calculation, you would still take the mixed STR sets as indicative of age.

2-I was providing hypothetical examples, just to show the public how the law of large number works. The other thing, you seem to forget, is that there is direct (nonlinear) relationship between mutation rates, and number of mutations observed in any locus. So when using large sample of mixed(mutation wise) STRs, there is a very real possibility of a sample displaying a relatively higher variance compared to a different sample while having that variance mostly accumulated in fast mutating STR markers. Like I said a higher variance in the slower/ more stable loci is a better indicator than the overall variance. What I’m talking about, is not just calculating the variance of a haplogroup using 37, 67, or 25 STRs. I’m talking about comparing the variance of Haplogroup Y in population X, to the variance of the same haplogroup in population Z.

Companies like FTDNA, a large number of academic scientists and hobbyist-scientists use STR variance, accepting their generally linear relationship with time. It works! Now you can disagree with mutation rates, but I'm not using any, so all you can say about STR variance is you don't trust Y DNA STRs. That's okay, but the scientific community, by and large, is using them.

Well, that’s quite an oversimplification there. What makes you think that I don’t trust variance in loci? Have I ever said anything where I have specifically said that STR variance is useless? I said I disagree with the concept of accepting the linear relationship of certain STRs with time. I said I think loci should be carefully selected in terms of purpose of the test, mutation rate, etc. There is quite some difference between saying: “this car ought to be fixed, and cars don’t work at all”.

Ken quoted an estimated figure of 1 SNP per generation on average but I was told this would work out as 1/2 because only 1/2 the Y chromosome is readable or something, got to be honest I got lost in the conversation at that point :)

He has calculated the rate of .78 SNPs per father-son transmission, so about 3/4th. I don't know how much of the Y chromosome can now be scanned by testing. I would think its mostly just a matter of cost.

Thanks for the link, I tried ploughing through it and only had to contend with three interruptions (one about 1 1/2 hours long) whilst trying to fathom it, just as well we have Vince at hand for the interpretation :)

Would I be write in guessing that the mutation rate they came up with needs to be multiplied out depending on what length of DNA is being investigated, it seemed very small ?

Presumably in order to make more use of this information we also need to know what % the Y chromosome is being covered by 1000 Genome. I know it's a lot bigger than WTY but presumably it's still a fraction, but how big ?

I was looking the other day at the spreadsheet detailing kits in the 1000 Genome project with singleton mutations, which I found a bit odd. There was no mention of U152 or P312 (which presumably was to do with lack of data on the part of the complier ?) but the nos. being reported for L21 were much lower than that of the other haplogroups.

1-You said that you are not using mutation rates, because you are not interested in calculating the TMRCA. Well, I’m not referring to you specifically, but to the bunch of other people that do use mutation rates, and mixed(mutation wise) STR samples to calculate TMRCA. After all, you mention not too long ago, that R1b-M269 was between 4-8 kybp following estimates by Heinila, Klyosov, etc. Moreover, you accept that there is a relationship with age, so while, you mention that you do not want to get into the details of TMRCA calculation, you would still take the mixed STR sets as indicative of age.

I think there is a general linear relationship between the variance of non-multicopy/non-null STRs and the number of generations (which infers time) to the initial time of expansion for a related group of people (that have a common ancestor.)

I'm interested in TMRCAs as well but I think that is a more complex topic and there is definitely a disagreement in the academic community and to some degree in the hobbyist community about whether to use evolutionary rates or germ-line rates. The hobbyist community, at least the scientists in it, seem heavily inclined towards germ-line rates, but I don't want to try to argue that as there is a general stale-mate in those arguments.

The way I look at it, I'll just calculation the variance of one haplogroup relative to another and then you tell me what mutation rates you want to use and we'll slide the whole scale (when multiplying to get years) whichever direction you want for the discussion.

2-I was providing hypothetical examples, just to show the public how the law of large number works. The other thing, you seem to forget, is that there is direct (nonlinear) relationship between mutation rates, and number of mutations observed in any locus. So when using large sample of mixed(mutation wise) STRs, there is a very real possibility of a sample displaying a relatively higher variance compared to a different sample while having that variance mostly accumulated in fast mutating STR markers. Like I said a higher variance in the slower/ more stable loci is a better indicator than the overall variance. What I’m talking about, is not just calculating the variance of a haplogroup using 37, 67, or 25 STRs. I’m talking about comparing the variance of Haplogroup Y in population X, to the variance of the same haplogroup in population Z.

Let me say that anything is possible in a particular situation, however, my experience backs up with folks like Ken Nordtvedt say and do. They use mixed speed markers to calculate variance and TMRCA's. The explanation from Ken is that there are greater accuracy benefits from including more STRs than there is from reducing the number of STR "experients."

You are interested in are talking about comparing the same haplogroup in two different populations. This would be similar to geographic comparison. I think this can be done for analysis but there will obviously be greater risk since the haplogroup in the first geography to compare may be sourced from two or three different geographies while the haplogroup in the second geography may be from a single common ancestor. I don't this invalidates STR variance or diversity, it just means that the application has additional challenges convoluting the interpretation of the results.

Companies like FTDNA, a large number of academic scientists and hobbyist-scientists use STR variance, accepting their generally linear relationship with time. It works! Now you can disagree with mutation rates, but I'm not using any, so all you can say about STR variance is you don't trust Y DNA STRs. That's okay, but the scientific community, by and large, is using them.

Quote from: JeanL

Well, that’s quite an oversimplification there. What makes you think that I don’t trust variance in loci? Have I ever said anything where I have specifically said that STR variance is useless? I said I disagree with the concept of accepting the linear relationship of certain STRs with time. I said I think loci should be carefully selected in terms of purpose of the test, mutation rate, etc. There is quite some difference between saying: “this car ought to be fixed, and cars don’t work at all”.

I'm all for using STRs smartly too. For instance, per Ken Nordtvedt and Vince Vizachero, I throw out multi-copy STRs. I also throw out STRs that have null values. I am also saying that I've tried using just highly linear STRs as well as mixed speed STRs and the results about the same as long as you get the number STRs up into a good high range. Mixed STRs groups also seem to be a little more precise than when faster markers are thrown out. Ken says he has demonstrated this in simulations. I have observed on that is neutral to slightly positive on what Ken says.

I've actually played with "miccrosatellite choice" in the past, because of concern about your point. I ran through the R-L21 file of long haplotypes and tried 12, 25, 37, 67 length haplotypes and after throwing out the non-multicopy non-null STRs, I would run variance calculations adding an STR or two or subtracting. What I found was the variance relationships between the subclade of L21 were fairly stable when you start using above 15-20 STRs.

Generally, I find very little jostling of the relationships in R1b subclades when you start using 25 or so markers and get up to about 30 haplotypes.

Here is "test" run for you on R-L21's major subclades based on different sets of markers.

My M153 data is very limited. I also included everyone with 67 markers in the 1418 North-South cluster which encompasses M153. This cluster may be marked by Z209 and is quite old.

Below is U152 and its large subclades. I think it is still the oldest subclade of P312 but DF27 may some day challenge it.Relative variance with the 49 mixed speed, non-multicopy, non-null STRsU152________: Var=1.07 (N=806)L2__________: Var=1.02 (N=287)Z56_________: Var=0.97 (N=32) Z36_________: Var=0.92 (N=34)

Z56 and Z36 did flip-flop on me. I wouldn't interpret this as significant. They are just both about the same age.

Generally, I observe there isn't much difference in the relationships if you de-select markers that Heinila calculates don't have as high a confidence of being linear for >7k years.

umm.... I still think U152, L2, Z36, Z56, L21, DF23, Z196 (and I guess now DF27 and Z209) all must have expanded fairly rapidly. Not everyone goes for this, but I think if we were doing family surname project about 400-500 years after P312 started expanding, we'd include them all in the same cluster.

I think there is a general linear relationship between the variance of non-multicopy/non-null STRs and the number of generations (which infers time) to the initial time of expansion for a related group of people (that have a common ancestor.)

I'm interested in TMRCAs as well but I think that is a more complex topic and there is definitely a disagreement in the academic community and to some degree in the hobbyist community about whether to use evolutionary rates or germ-line rates. The hobbyist community, at least the scientists in it, seem heavily inclined towards germ-line rates, but I don't want to try to argue that as there is a general stale-mate in those arguments.

The way I look at it, I'll just calculation the variance of one haplogroup relative to another and then you tell me what mutation rates you want to use and we'll slide the whole scale (when multiplying to get years) whichever direction you want for the discussion.

I think our main disagreement comes in the part I highlighted. You think there is a linear relationship between TMRCA and variance regardless of the STRs used, as long as they aren’t multicopy/null STRs. I don’t think so, and I already explained my reasons. So I propose we wait and see what comes up, hopefully we’ll get some good aDNA studies soon. I think the samples from SJAPL and Longar all dating to the 4500-5000 ybp in the fringe of the Basque Country would probably be tested for Y-DNA soon; I mean, they were already tested for lactose tolerance, so it seems Y-DNA will follow.

I think there is a general linear relationship between the variance of non-multicopy/non-null STRs and the number of generations (which infers time) to the initial time of expansion for a related group of people (that have a common ancestor.)

I think our main disagreement comes in the part I highlighted. You think there is a linear relationship between TMRCA and variance regardless of the STRs used, as long as they aren’t multicopy/null STRs. I don’t think so, and I already explained my reasons. So I propose we wait and see what comes up, hopefully we’ll get some good aDNA studies soon. I think the samples from SJAPL and Longar all dating to the 4500-5000 ybp in the fringe of the Basque Country would probably be tested for Y-DNA soon; I mean, they were already tested for lactose tolerance, so it seems Y-DNA will follow.

Please be cautious in reading my thoughts which is what you did on the sentence I highlighted (emboldened.)

I went up to my original statement above and underlined the word "general" which you omitted when paraphrasing me. I'm NOT asserting that every non-multicopy non-null STR has a strict linear relationship with number of generations, which is related to time. I think that, generally speaking, the STRs that FTDNA tests for in the first 111 markers (less the multicopy/null) have a general relationship with number of generations. In aggregate, statistical use of these markers can provide improved precision to the relationship with time.

Think about it... FTDNA, and science in general, probably have tested a broader range of STRs. The only reasons they would select these STRs is because they have some value in measuring the "closeness" in relationship, which is a function of generations.

Some STRs are probably better than others for different timeframes, but the problem is we don't know which are better and which are worse. Only Heinila has really attempted any kind of thorough analysis that I can find. The best way to address this problem is with large numbers and statistics. This is why I value it when folks like Ken Nordtvedt say they run simulations and the benefits of including more STRs, rather than less, outweigh the negatives.

You can wait more ancient DNA, but I'm not. Here's why - We do NOT have adequate data (long haplotypes and SNPs) across the board to do the proper cross-sectionally representative random sampling. We don't have this with tens of thousands of haplotypes of modern people. How long do you think it'll be before we have that much ancient DNA? I'm NOT saying ancient DNA is useless. It is very valuable, but its just another piece of data.

Do you think FTDNA should drop their Tip TMRCA calculator? Should the academics, i.e. Busby, Barlaresque, Myres, etc. quit using STR diversity to estimate time?

I went up to my original statement above and underlined the word "general" which you omitted when paraphrasing me. I'm NOT asserting that every non-multicopy non-null STR has a strict linear relationship with number of generations, which is related to time. I think that, generally speaking, the STRs that FTDNA tests for in the first 111 markers (less the multicopy/null) have a general relationship with number of generations. In aggregate, statistical use of these markers can provide improved precision to the relationship with time.

Think about it... FTDNA, and science in general, probably have tested a broader range of STRs. The only reasons they would select these STRs is because they have some value in measuring the "closeness" in relationship, which is a function of generations.

Like I said before, we can go on forever. A large number of markers are good in determining close relationship, i.e. if one gets a match on a 12 STR set it could mean that there is a common ancestry in anywhere from a recent ancestor to a very distant one. Now if one gets a match on the 37, 67 or 111 STRs set is another story, given that if two people's haplotypes are only 2 mutations apart on the 111 STRs set, one could definitely estimate the time of common ancestry for those two people. That works on an individual level, what I’m talking about is population genetics, where that doesn’t work. The more I think about, the more I realize how important things like the choice of microsatellite in age estimates are being overlooked in population studies.

Some STRs are probably better than others for different timeframes, but the problem is we don't know which are better and which are worse. Only Heinila has really attempted any kind of thorough analysis that I can find. The best way to address this problem is with large numbers and statistics. This is why I value it when folks like Ken Nordtvedt say they run simulations and the benefits of including more STRs, rather than less, outweigh the negatives.

The best way to address that problem is to actually do an experiment using empirically measured mutation rates, and seeing if micro-satellite choice has an effect on age estimates. However to see any considerable effects one must look at large time frames. Likely I doubt the loss of linearity would have any effect in folks that share common ancestry in the last 1000 years; now, when we are working to determine the age of haplogroups that could presumably be older than 5000 ybp, then it is definitely a big issue. This is something that large numbers cannot fix, as what we see is that certain STRs are only good to measure certain time spans. So to try to measure TMRCA that are older than 5000 ybp with STRs that lose their linearity in less than 5000 ybp, would be like measuring a mile with a 6 inch ruler. Yes, I applaud Ken Nordvedt for taking the time to run the simulations. Now is there any practical example out there where the simulations could be tested? I mean Dr.Nordvedt might have ran some simulations, and perhaps he got that using 37 STRs instead of 12 STRs was a better predictor of TMRCA for maybe a set of people that descend from a guy who lived in say 1700. That doesn’t mean that one could extrapolate those results and use them in a time span of 5000+ ybp.

You can wait more ancient DNA, but I'm not. Here's why - We do NOT have adequate data (long haplotypes and SNPs) across the board to do the proper cross-sectionally representative random sampling. We don't have this with tens of thousands of haplotypes of modern people. How long do you think it'll be before we have that much ancient DNA? I'm NOT saying ancient DNA is useless. It is very valuable, but its just another piece of data.

Do you think FTDNA should drop their Tip TMRCA calculator? Should the academics, i.e. Busby, Barlaresque, Myres, etc. quit using STR diversity to estimate time?

I agree that improvements are needed.

The only thing I would say is that we should start exploring the effects of microsatellite choice in age estimates, that we shouldn’t neglect the effects of loss of linearity, saturations, and non-constant mutation rates. I think Busby et al(2012) already noticed that, so I don’t think his team should quit using STR diversity, on the contrary, they should work on it even more, to explore the effects using large STRs sets.

.... Anyway the problem will be resolved by the aDNA and all turns around the age of these haplogroups. I remember to you all that many markers get the same value with hg.Q, and the separation happened at least 20,000 years ago.

Remember:1) mutations around the modal2) convergence to the modal as time passes3) sometimes a value goes for the tangent

I've never understood the relevance of your three points other then they are your objections. The key is are they well founded objections or baseless?

Do you have any statistical analysis that demonstrates the value of your objections? or are these just concerns that you feel?

I'll begin my comments with a simple presentation of the observed properties of Y STR's and datasets. Hopefully these comments will be reviewed and commented on to the point we can all agree what the fundamental properites of the process we are trying to model are.

First a definition from Schaums outline on random processes: A random process is the mathematical model of an empirical process whose development is governed by probability laws.

Observation #1: A probability space for the STR mutational process can be created by taking a set of, say 69, Y STR's, summing the mutational rates to determine the probability of a mutation and then defining the P (no mutations) = 1 - Sum. Given that we also observe a dynamic range exceeding 100 for the mutational rates, we conclude that the Y STR is not a random process with equally likely events.

Observation #2: In data sets Y STR single step mutations are most probable. Multi-step occur some 5% to 10% of the time with step changes up to 4 or more.

Observation #3: For haplogroup R1b it is observed that Y STR frequency distributions over the first 69 FtDNA dys loci (as presented atwww.freepages.genealogy.rootsweb.ancestry.com) includes 95% of the entry values for only 3 values; modal and +/_ 1 from modal for some 59 of the 69 entries. This suggests that if a mutation to a non-modal value has occurred , then the most probable next mutation is back to the modal. This also suggests that as time accumulates many hidden mutations may have occurred.

Observation #4: Most data sets, especially family sets are highly correlated. Many entries have apparent mutations that are all derived from a single ancestor. (See Kerchners family analysis and his definition of "unique mutational events"). This effect will cause an overestimation of diversity.

Observation #5: Within a data set all descending from a common ancestor such as the Ian Cam of Clan Gregor, there is a wide range in the number of mutations observed. In the case of Clan Gregor from 0 to 7 within the data set. (note the apparent direct descendant of the founder appears to have not had a mutation in his family line for almost 700 years). There appears to be some correlation as to when the line separated and age but as shown above its not all that direct.

A parting comment re: technique. I use the equation developed by Stumpf and Goldstein , 2 March 2001, SCIENCE, "Genealogical and Evolutionary Inference with the Human Y Chromosome", p. 1740. I only count mutations, I do not use ASD/variance. In their analysis, they point out that you calculate ASD for one locus and then average over many to increase the accuracy of the observation. This implies that using Y STR's with the same or similar mutation rate will improve precision and increase accuracy.

... Observation #1: A probability space for the STR mutational process can be created by taking a set of, say 69, Y STR's, summing the mutational rates to determine the probability of a mutation and then defining the P (no mutations) = 1 - Sum. Given that we also observe a dynamic range exceeding 100 for the mutational rates, we conclude that the Y STR is not a random process with equally likely events....

You are mixing your observations with your conclusions.

Are you saying that it is required for STR mutations to be perfectly random for them to be useful? What in life is perfect? I can think of only one.

Observation #2: In data sets Y STR single step mutations are most probable. Multi-step occur some 5% to 10% of the time with step changes up to 4 or more....

That seems very plausible, but I would think different STRs would have different properties. I don't know. Do you have a study that has determined the distribution of single step and multi-step mutations by STR or on average?

Observation #3: For haplogroup R1b it is observed that Y STR frequency distributions over the first 69 FtDNA dys loci (as presented atwww.freepages.genealogy.rootsweb.ancestry.com) includes 95% of the entry values for only 3 values; modal and +/_ 1 from modal for some 59 of the 69 entries.

Why choose R1b? Why not R1b-L21 or R1b-U106 or R1b-L226. Each has different modals, sometimes dramatically. For that matter, why not use R1?

Are you saying that Y STRs have different expected properties depending on the haplogroup? Y SNPs don't generally or necessarily have any biological connection to Y STRs. I've asked this before (on this thread even) but I have no reason to think that the expected property of one Y STR is different by haplogroup. We are all homo sapiens sapiens and are much more alike than different.

This suggests that if a mutation to a non-modal value has occurred , then the most probable next mutation is back to the modal.

Why? We use modal haplotypes as proxies for ancestral haplotypes, but mode is really just statistical concept. Naturally, we might expect mutations within STRs that are primarly single-step focused would revolve around the mode. That's essentially circular reasoning or perhaps I should say self-defining.

... Observation #4: Most data sets, especially family sets are highly correlated. Many entries have apparent mutations that are all derived from a single ancestor. (See Kerchners family analysis and his definition of "unique mutational events"). This effect will cause an overestimation of diversity.

I agree that our DNA project data is not representative. I think most academic studies try to guard against this but I don't know if they are doing a good job.

I think you mean this will cause an underestimation of diversity, right? Fortunately, interclade calculations effectively eliminate this concern.

Observation #5: Within a data set all descending from a common ancestor such as the Ian Cam of Clan Gregor, there is a wide range in the number of mutations observed. In the case of Clan Gregor from 0 to 7 within the data set. (note the apparent direct descendant of the founder appears to have not had a mutation in his family line for almost 700 years). There appears to be some correlation as to when the line separated and age but as shown above its not all that direct.

You are assuming in the data set you are citing that everyone has a common ancestor. Maybe so, but how do we know? How long ago did the Most Recent Common Ancestor live? It sounds like at least 700 years. That's a long time for genealogical records and zero NPE's to happen. I don't know the situation for this group, but since you bring it up as point of evidence, how do we know if this group has the common ancestor it is thought to have? This is not defined by a series of SNPs, is it?

... Observation #1: A probability space for the STR mutational process can be created by taking a set of, say 69, Y STR's, summing the mutational rates to determine the probability of a mutation and then defining the P (no mutations) = 1 - Sum. Given that we also observe a dynamic range exceeding 100 for the mutational rates, we conclude that the Y STR is not a random process with equally likely events....

You are mixing your observations with your conclusions.

Are you saying that it is required for STR mutations to be perfectly random for them to be useful? What in life is perfect? I can think of only one.

These observations were, hopefully, meant to be rhetorical. I was hoping that we could discuss whether these observations are generally true and then try to determine the best method for modelling the mutational process. I should have also pointed out that the mutational process appears to be a linear, independent process, which suggests that probabilities of events can be multiplied.

Observation #2: In data sets Y STR single step mutations are most probable. Multi-step occur some 5% to 10% of the time with step changes up to 4 or more....

That seems very plausible, but I would think different STRs would have different properties. I don't know. Do you have a study that has determined the distribution of single step and multi-step mutations by STR or on average?

The only work I am aware of is a work in progress(?) by Charles Kirchner. The only hiccup there was he wasn't keeping any track of multiple step mutations, just single step. He was/is trying to determine average mutational rates for different sets of FtDNA dys loci. If you look at the rootsweb Y STR frequency tables, it is evident that some of the outlier mutations were multi-step. I've also heard John Chandler espouse this view, but I don't know if he's published any data?

Observation #3: For haplogroup R1b it is observed that Y STR frequency distributions over the first 69 FtDNA dys loci (as presented atwww.freepages.genealogy.rootsweb.ancestry.com) includes 95% of the entry values for only 3 values; modal and +/_ 1 from modal for some 59 of the 69 entries.

Why choose R1b? Why not R1b-L21 or R1b-U106 or R1b-L226. Each has different modals, sometimes dramatically. For that matter, why not use R1?

Are you saying that Y STRs have different expected properties depending on the haplogroup? Y SNPs don't generally or necessarily have any biological connection to Y STRs. I've asked this before (on this thread even) but I have no reason to think that the expected property of one Y STR is different by haplogroup. We are all homo sapiens sapiens and are much more alike than different.It appears that mutation rate depends on the modal value for some STR's. in the table I referenced which includes data for 7 Hgs (not only R1b). DYS 388 is quite different for I1 and J2. They have a higher modal value, 14 and 15 respectively.

This suggests that if a mutation to a non-modal value has occurred , then the most probable next mutation is back to the modal.

Why? We use modal haplotypes as proxies for ancestral haplotypes, but mode is really just statistical concept. Naturally, we might expect mutations within STRs that are primarly single-step focused would revolve around the mode. That's essentially circular reasoning or perhaps I should say self-defining.I do not have a lot of confidence in the modal concept personally. In the case of disasters it can lead to erroneous comments about modals for a Hg. In the data set referenced, the data shows that the modal +/- 1 are the most frequent values observed. If 95% or more of the entries have these values and if multisteps occur at the 5% or greater rate, than my observation stands.

This also suggests that as time accumulates many hidden mutations may have occurred.

Why? I agree there are back-mutations, but so what? John Chandler has told us that mutation rate calculations and variance account for back mutations.

. That comment may be true for a DYS loci like CDYa,b where over a few thousand year, many mutations will have occurred. Most DYS loci have very few mutations relative to CDYa,b and a mutation is a low probability event. But, the point is that there are no range of values for most dys loci, unlike CDY a,b. If a mutation from the modal has occurred, then the most probable next event is a mutation back to the modal. Again, this can't be modelled by a random walk process, which I believe assumes that the process is unbounded?

... Observation #4: Most data sets, especially family sets are highly correlated. Many entries have apparent mutations that are all derived from a single ancestor. (See Kerchners family analysis and his definition of "unique mutational events"). This effect will cause an overestimation of diversity.

I agree that our DNA project data is not representative. I think most academic studies try to guard against this but I don't know if they are doing a good job.

I think you mean this will cause an underestimation of diversity, right? Fortunately, interclade calculations effectively eliminate this concern.

In Kerchners case, there are 14 apparent mutations and only 8 unique mutational event. If you consider all 14 as real mutations you overestimate diversity, as I said, i.e. you will estimate a TMRCA older than is real.

I believe interclade calculations, like most of the Variance calculations only consider coalescent time, the time to the build-up of the population, it doesn't penetrate the "disaster" and give real TMRCA. So, the value of the interclade estimate will be dependent on the population history under examination. I can certainly stand correction if this observation is incorrect?

I have tried to reply to your questions as honestly as I can. I presented my comments in this format, because I wanted to explore the assumptions built into the calculations presented for Variance/diversity.

It is critical that the assumptions represent what we observe in the data generated to date, whether it has been published as a study or is just a dataset.

The ASD/Variance model was developed by Goldstein, et.al., and modified by Nordtvedt and others. At the time of development, not much data existed to verify the assumptions.

We will never have enought data, but sufficient data has been collected, both germ-line and family data sets to make observations about the "real" properties of the Y STR mutational process.

I do not believe your subject question can ever be rationally answered until these properties are understood and agreed upon and a model created using these properties.