World Families Forums - The age of R-M343 calculated by Dienekes

ARCHIVAL COPY

WorldFamilies has changed our Forum Operating system and migrated the postings from the prior system. We hope that you’ll find this new system easier to use and we expect it to manage spammers much better. If you can’t find an old posting, please check our Legacy Forum to see if you can see the old posting there.

I think you forgot L132.1 beneath L1, or L127.2 beneath L128. That would make Z156 about 2500 ybp counting from the bottom. (Interestingly, using Klyosov's STR formula I calculated the age of Z156 was about 2750 ybp.) But I know the proponants of SNP counting want us to count from the top of the tree and assume no SNPs are missing between Z381 and Z156 and that Z156 should be the same age as Z301. I don't understand that reasoning.

There are only 100 Z156 tests out of the 7000 U106 men tested. When we get to 1000 tests then we will have a better understanding of this group and a lot more SNPs. Do you think that L132.1 and L127.2 are private SNPs?L48 are the most tested group. Do you know which country has the highest divesity?

ISOGG lists those two SNPs as "under investigation". I doubt there are enough Continental results to calculate Z156 diversity.

......It makes sense to use multiple SNP counting and STR ......Each STR, since it has a range of values, not just ancestral or derived, is its own clock. The great value of a large number of STRs is they can be averaged ......The STRs provide a better composite clock is the point.

....Is Table 22 slow markers STR (by Aklyosov) useful to calculate the TMRC human and chimpanzee?....The experiment showed there is no connection between the number of actual mutations and the number of observed mutations.This table is also unsuitable for calculating the times of Adam.Repeated and back mutations are basically uncountable.In 67 marker haplotype mutations multiple (in the same marker) may be formed at the beginning. And they are uncountable!!! Therefore doubted their usefulness, especially for time over a thousand or several thousand years.Stan

Stan and Maliclavelli, my understanding from Bob Chandler and Ken Nordtvedt is that back mutations are accounted for.in the mutation rates (EDIT) . I guess Maliclavelli is suggesting they don't understand this. I guess I don't either because to me it is logical to see how non-observed mutations force a mutation rate calibration adjustment, but they don't stop the generally observable expansion of diversity altogether.

Ultimately, there is greater STR diversity, on average, with a greater number of generations, which correlates, although not perfectly, with time.

There probably are ranges of time that different characteristics are useful for. This would be similar to the law of diminishing returns in economics. An example would be that STRs with high numbers of repeats may appear to saturate.

Any method is probably only relevant for a given time frame or situation. We need to understand what time frame. There may be some situations where one method is better than another. What works for a chimpanzee branch of the ape family may not work for a branch of Levantine farmers or the branch my 2nd cousin and I are on.

EDIT: This has been corrected. Back mutations are accounted for, but not in mutation rates. See post by VV above.

It looks to me as if you are counting SNPs in the ISOGG tree. If that's what you are doing, the method is fatally flawed.

On the Hg I forum Ken Nordtvedt responded to an inquiry I made on this topic.

Quote from: Ken Nordtvedt

Using snp counts as clock across the board for different parts of the tree requires absence of bias in the construction of the source of the snps to be counted --- even before one brings in some calibration argument

and

Quote from: Ken Nordvedt

It is important to realize that the snp counts in each tree branch segment are the statistically independent observables, and the node times the items to be estimated from those branch segment snp counts.

Stan and Maliclavelli, my understanding from Bob Chandler and Ken Nordtvedt is that back mutations are accounted for in the mutation rates. I guess Maliclavelli is suggesting they don't understand this. I guess I don't either because to me it is logical to see how non-observed mutations force a mutation rate calibration adjustment, but they don't stop the generally observable expansion of diversity altogether.

Mike, what is the usual timeframe where mutation rates are calibrated? For the most part germ-line mutations are computed from father-son pairs, which is essentially a single generation. On the other hand people like Klyosov calibrate mutation rates using family trees with known paper trail and also have their y-DNA STRs available. Now, setting aside the problem with the calibration method of Klyosov, do you think the behavior of the back mutations, and of the STRs overall is completely captured in a time frame of one generation(Observed father-son pairs), or even 1300-1500 years, as to be able to extrapolate that to time frames that are from 3-5 times longer(i.e. 5000+ybp).

Stan and Maliclavelli, my understanding from Bob Chandler and Ken Nordtvedt is that back mutations are accounted for in the mutation rates. I guess Maliclavelli is suggesting they don't understand this. I guess I don't either because to me it is logical to see how non-observed mutations force a mutation rate calibration adjustment, but they don't stop the generally observable expansion of diversity altogether.

Mike, what is the usual timeframe where mutation rates are calibrated? For the most part germ-line mutations are computed from father-son pairs, which is essentially a single generation. On the other hand people like Klyosov calibrate mutation rates using family trees with known paper trail and have their y-DNA STR tested. Now, setting aside the problem with the calibration method of Klyosov,

I've never calculated mutation rates. It is obvious they are highly controversial although many of the leading hobbyists are used similar rates. I noticed that Ken Nordtvedt is using some of the rates calculated by Marko Heinila so there is some good collaboration. I know Ken also uses rates that Bob Chandler derived. They may be using rates from Leo Little as well (I'm not sure.)

I clearly do not defend Klyosov on anything he does as I don't understand some of the statements he makes.

I would never say STR variance is perfectly aligned with time in a linear fashion. However, most generally seem to accumulate added diversity with time. Do you disagree? This is not to say a timeframe of relevancy, or what some call linear duration doesn't apply.

do you think the behavior of the back mutations, and of the STRs overall is completely captured in a time frame of one generation(Observed father-son pairs), or even 1300-1500 years, as to be able to extrapolate that to time frames that are from 3-5 times longer(i.e. 5000+ybp).

I understand what Bob Chandler and Ken Nordtvedt are saying when they say back-mutations are accounted for in the mutation rates, but I would never use a qualifier such as they are "completely" 100% accounted for perfectly.

The questions you are asking about the timeframes are the same ones I am asking. I don't know what methods are best for what timeframes. That is what I was trying to say in this post.

I would never say STR variance is perfectly aligned with time in a linear fashion. However, most generally seem to accumulate added diversity with time. Do you disagree? This is not to say a timeframe of relevancy, or what some call linear duration doesn't apply.

Well I don’t disagree with the statement that STR variance accumulates with time; I however do not think currently we are measuring variance correctly. I think that effects such as back mutations, changes in modal value due to TMRCA being longer than mutational time are being ignored thus leading to an erroneous estimate of diversity, that’s all. Also, I know this question has been asked a bunch of times, but I would say that STR mutation rates do indeed depend on haplogroups, however not in the way most people think. Take for example, let’s say we measured the mutation rate of DYS492 in a sample of mostly J2 man who have a modal value of say 17, this mutation rate is gonna be faster than the actual mutation rate of say R1a man who have a modal value of 13 in that same locus. So in a sense, they do differ from haplogroup to haplogroup, only, when the modal values of the haplogroups for a given locus are different.

I understand what Bob Chandler and Ken Nordtvedt are saying when they say back-mutations are accounted for in the mutation rates, but I would never use a qualifier such as they are "completely" 100% accounted for perfectly.

The questions you are asking about the timeframes are the same ones I am asking. I don't know what methods are best for what timeframes. That is what I was trying to say in this post.

Think about this, what are the chances of actually seeing a back-mutation in a time frame of 1500 years, what STRs have a higher likelihood of actually having a back-mutation in such timeframe. Now what are the chances of a back-mutation occurring in a timeframe that is 5 times that? Do you think slow STRs (i.e.μ~=10-4) would experience a back-mutation in timeframes of 1500 years, so what would that mean in terms of their calibrated mutation rates?

I would never say STR variance is perfectly aligned with time in a linear fashion. However, most generally seem to accumulate added diversity with time. Do you disagree? This is not to say a timeframe of relevancy, or what some call linear duration doesn't apply.

Well I don’t disagree with the statement that STR variance accumulates with time; I however do not think currently we are measuring variance correctly. I think that effects such as back mutations, changes in modal value due to TMRCA being longer than mutational time are being ignored thus leading to an erroneous estimate of diversity, that’s all. Also, I know this question has been asked a bunch of times, but I would say that STR mutation rates do indeed depend on haplogroups, however not in the way most people think. Take for example, let’s say we measured the mutation rate of DYS492 in a sample of mostly J2 man who have a modal value of say 17, this mutation rate is gonna be faster than the actual mutation rate of say R1a man who have a modal value of 13 in that same locus. So in a sense, they do differ from haplogroup to haplogroup, only, when the modal values of the haplogroups for a given locus are different.

Are you saying that mutation rates are different at higher STR number of repeats? We know there is a study that shows the observable mutation rate on certain high repeat STRs slows down a the high numbers. Is that what you are referring to?

Are you referring to the fact that different lineages would have different historical mutation rates if we were to calculate each lineages' independently?

I understand what Bob Chandler and Ken Nordtvedt are saying when they say back-mutations are accounted for in the mutation rates, but I would never use a qualifier such as they are "completely" 100% accounted for perfectly.

The questions you are asking about the timeframes are the same ones I am asking. I don't know what methods are best for what timeframes. That is what I was trying to say in this post.

Think about this, what are the chances of actually seeing a back-mutation in a time frame of 1500 years, what STRs have a higher likelihood of actually having a back-mutation in such timeframe. Now what are the chances of a back-mutation occurring in a timeframe that is 5 times that? Do you think slow STRs (i.e.μ~=10-4) would experience a back-mutation in timeframes of 1500 years, so what would that mean in terms of their calibrated mutation rates?

What are you saying? What timeframes do you think we should use 1) SNP counting for? 2) slow only STRs? 3) mix of slow, medium and fast STRs? and why?

This may be why we want to segregate data about other species versus homo sapiens sapiens. Their physical properties may be quite different, to the say the least. In other words, findings may not necessarily be generalized between species.

Are you saying that mutation rates are different at higher STR number of repeats? We know there is a study that shows the observable mutation rate on certain high repeat STRs slows down a the high numbers. Is that what you are referring to?

Nope, I’m talking about that for any given STR the mutation rate is a function of the repeat number, higher repeat values have higher mutation rates. Marko H noticed this too when working on his data, but yeah essentially the “saturation” effect is because when an STR reaches certain number of repeats it becomes too unstable and it basically mutates back and forth in between two values. For example dinucleotides can have a mutation rate variation from ~10-5.5 mutations/generation when the repeat number is 5, to ~10-4 when the repeat number is 10. So if you have a dinucleotide STR in a population where the modal is 10, it is 31.63 times more likely to mutate at any given generation than the same dinucleotide STR on a population where the modal was 5. So yes essentially mutation rates are a function of repeat number. If you think about it, the more repeat values a DNA segment has, the more prone it is to slippage, and hence the more likely it is to add or lose a base pair, thus making it less stable.

Are you referring to the fact that different lineages would have different historical mutation rates if we were to calculate each lineages' independently?

Basically, if the modal (There is yet another issue with the assumption that modal==ancestral value, but that’s a different story) values are different for two populations, using the mean mutation rate obtained from one, might not be accurate to apply on a different one.

What are you saying? What timeframes do you think we should use 1) SNP counting for? 2) slow only STRs? 3) mix of slow, medium and fast STRs? and why?

That’s not an easy question to answer, and honestly, I couldn’t answer it before running multiple computer simulations first(Which I plan to do in the future). As for the SNPs I don’t know, but I would say that SNPs are good as long as the sections of the DNA analyzed are unbiased(That is the contain mostly neutral genes, and little linkage disequilibrium*(For autosomal SNPs, NRY doesn't have LD)), there is also the effects of an expanding population. (i.e. The expected number of offspring in a coalescence model is >1)

I would think that Dienekes would understand this. Is the data set he is using unbiased?

That's a question that can only be answered by investigating HOW the 1000G project chose the regions of the Y that they did. I haven't done that in a while, and I haven't kept up to date on new releases very well so I can't say for sure.

That said I'm sure there is some ascertainment bias at play, but how much?

Anyway, the impact of the ascertainment bias diminishes greatly as one move up the tree. Within R-M269, the bias -if there is any - would have a strong impact on TRMCA estimates. By the time you reach node estimates for R-P297 and upstream, the impact moderates quickly.

I understand what Bob Chandler and Ken Nordtvedt are saying when they say back-mutations are accounted for in the mutation rates, but I would never use a qualifier such as they are "completely" 100% accounted for perfectly.

There seems to be some confounding of two different problems in this thread.

One is back mutations. In other words, in a particular lineage a STR mutates from 12 to 13 and then back to 12. If you are using a variance-based approach, such mutations ARE accounted for. Not in the mutation rate, but in the method of calculating genetic distance. I can explain HOW later if someone wants.

A completely different problems is one of STR saturation. When people talk about linearity, this is what they are referring to. STRs tend to have definite ranges of alleles, and no matter how much time passes the value for a particular STR is unlikely to exceed this range.

Imagine a marker that has a starting value of 12, which happens to be (though we have no prior knowledge of this) its midpoint value. This marker has a 4% total mutation rate, but this really means there are two mutation rates p(up) and p(down). For this marker, imagine a 2% probability of mutating up and a 2% probability of mutation down. Now say it happens to mutate up, to 13. The total mutation rate might be 4% still, but p(up) is now only 1% and p(down) is 3%. It might hit a value (e.g. 15) at which p(up) is 0% and p(down) is 4%. After a couple generations, this marker has reached its maximum variance value and is considered "saturated". It's relationship between mutations and time is not linear.

Now most commonly used STRs seem to be essentially linear over very short periods of time. But all markers can become non-linear over a long enough period of time.

A full suite of STRs can provide relatively linear TMRCA estimates for a haplogroup that is 5,000 years old. But one that is 50,000 years old cannot be accurately estimated with fast-moving STRs.

In short, back mutations themselves are not a concern for variance-based TMRCA estimates. Non-linearity is a concern, and the concern is best addressed by marker selection.

If a marker has a 4% total mutation rate, all this means is that it either has a 4% mutation rate up or down, or that when adding (That is say backwards is 2% and forward is 6%, this would produce a mutational bias of 4% forward) the two mutation rates up or down, it produces a 4% mutation rate in one direction (forward or backwards), however the simple addition doesn’t work because the forward and backward mutations are independent events in a sense. So what happens when a marker mutates from 12 to 13, is that now the mutability of the marker increases, so the marker is more likely to mutate from 13 to 14, or from 13 back to 12 than it was prior to the occurrence of the mutation from 12 to 13. Markers only saturate when the mutation rate as a function of repeat value doesn’t change much, and the mutation rate has reached a value high enough where it becomes unstable. That is the region of the DNA becomes prone to slippage or deletions far too often, so it causes the STR to toggle between two values. Now the observable range of allele values doesn’t completely describe the saturation effect, it is possible with time for one marker that saturates at say 24-25 repeats to actually produce a value of 27, however such events would be extremely rare, but they are observable, and they do happen every now and then. As for a time frame of 5000 years, which is roughly 200 generations (assuming 25 years per generation), any marker with a mutation rate higher than 0.005 mut/generations would in fact yield an erroneous modal value, and thus undermine the real TMRCA of the population being analyzed.

... As for a time frame of 5000 years, which is roughly 200 generations (assuming 25 years per generation), any marker with a mutation rate higher than 0.005 mut/generations would in fact yield an erroneous modal value, and thus undermine the real TMRCA of the population being analyzed.

Thanks.

I count 54 of 67 STRs, according to Chandler/Little, would qualify to be useful for 5000 years. This is using your .005 mut/gen guideline. 54 is still a pretty decent number of available STRs. The 13 "too fast" markers include DYS464 (4 of them) and CDY (2 of them) so we've been whittling down to this anyway.

It appears the 36 markers that I've worked with in the past from Heinila's linear duration analysis is even slightly more conservative than your guideline. Your guideline actually fits some what nicely with the 50 marker set of mixed speed markers... and it makes some sense that the relative variances between haplogroups (in R1b-L11) seems most steady with the 50 marker set.

Of the WTY results, there must be some SNPs that are found that belong to the same subclade of a the tree. We have no way of knowing how many of these occurred within the same generation. As we have seen recently with L21 getting bumped down one level from L459 and Z245, SNP estimates may have the same problems as STRs. I'm not saying one is better than the other, but certainly we have more data about STR mutation rates than SNP mutation rates, so to swear off one and then use the other does not make sense to me.

I think Dienekes is wrong to totally dismiss STR-based TMRCA estimation. On the other hand, we do have a pretty good idea of SNP mutation rates so I see no barrier to using SNPs as an accurate clock IF the method is well-designed.

I don't if the content in the quote below is accurate, but I think it would be important if my interpretation of what he is saying is on target.

There has been a intensive discussion about the quality of TMRCA estimates. It has been looked into through theoretical considerations and reflected against historical informations more or less backed up by empirical research. Carbon and isotopic dating artefacts associated with past populations gave us time lines to hold against these TMRCA calculations.

In genetic genealogy we use the SNPs indicators for speciation events while the diversity of the STR based haplotypes is seen as anagenetic developments within monophylitic clusters.

By excluding the multi and fast markers in the STR based haplotypes an effort is made to replace the SNP based synapomorphic characters with sets of STR markers and in doing so we come to a TMRCA estimate.

The mechanism of the variation in the VNTR is (mostly) DNA replication slippage while SNP is the consequence of a fixed mutation caused by external influences such as metabolic stress and/or ionising radiation. STR events are more or less internal; SNP novelties have external causes....Hans

If SNPs are externally caused, does it follow that traumatic environmental events could cause an increase in SNPs. That would mean that SNP mutation rates are subject to geography, climate and who knows what.... at least more so than STR mutation rates.

I hope SNPs are reliable clocks so I'm not looking for ways to shoot the concept for fun. I just would like to confirm this is a logical way to go.

If a marker has a 4% total mutation rate, all this means is that it either has a 4% mutation rate up or down, or that when adding (That is say backwards is 2% and forward is 6%, this would produce a mutational bias of 4% forward) the two mutation rates up or down, it produces a 4% mutation rate in one direction (forward or backwards), however the simple addition doesn’t work because the forward and backward mutations are independent events in a sense. So what happens when a marker mutates from 12 to 13, is that now the mutability of the marker increases, so the marker is more likely to mutate from 13 to 14, or from 13 back to 12 than it was prior to the occurrence of the mutation from 12 to 13.

STR mutation behavior is complex, so any simple model is bound to have limitations. But the best way to think about it is probably to model three different variables: a probability of mutation along with two dependent variables: the likelihood of that mutation being up, and the likelihood of that mutation being down. I simplified the math for an example, but as long as your model incorporates this differential p(up) vs p(down) relationship of a function of allele value for a given marker, you are probably on sound footing.

Markers only saturate when the mutation rate as a function of repeat value doesn’t change much, and the mutation rate has reached a value high enough where it becomes unstable. That is the region of the DNA becomes prone to slippage or deletions far too often, so it causes the STR to toggle between two values.

When we talk about "saturation" in this context, it means saturation of variance. Maybe you are repeating this but inarticulately?

The mechanism of the variation in the VNTR is (mostly) DNA replication slippage while SNP is the consequence of a fixed mutation caused by external influences such as metabolic stress and/or ionising radiation. STR events are more or less internal; SNP novelties have external causes.

All mutations have external causes: external slippage is external to the DNA being replicated.

As for a time frame of 5000 years, which is roughly 200 generations (assuming 25 years per generation), any marker with a mutation rate higher than 0.005 mut/generations would in fact yield an erroneous modal value, and thus undermine the real TMRCA of the population being analyzed.

And for the record, there is no such thing as an "erroneous modal value" unless you are simply bad at arithmetic.

The modal value is simply a calculated central value for a series of observed numbers. You take the values you can observe, and call the middle one the "modal". It can never be wrong unless you the observer can't count.

And for the record, there is no such thing as an "erroneous modal value" unless you are simply bad at arithmetic.

The modal value is simply a calculated central value for a series of observed numbers. You take the values you can observe, and call the middle one the "modal". It can never be wrong unless you the observer can't count.

Erroneous modal value as in the modal value=/=ancestral value of the MRCA. One assumption when calculating TMRCA is that the modal value is such that it minimizes mutations in a set, when the TMRCA is older than 1/mu then the modal value could in fact no longer represent the ancestral value, so measuring diversity from the modal value would lead to underestimating the TMRCA. That is if an individual has an ancestral value of 8 in a locus that has a mutation rate of 0.005 mut/gen, after 200 generations most of its progeny its either going to have a value of 7 or 9, so when randomly analyzing a subset of the descendants of that man, the modal value could appear to be either a 7 or a 9, thus leading to a lower calculation of diversity, because the modal value =/= ancestral value anymore.

DYS456:14 *There is actually 1 sample that has the ancestral value of 15DYS389II:30DYS385:13/15DYS393:14GATAH4: 11DYS448:20

So there you have it, just in 2000 years the modal values of those STR positions changed for the same haplogroup G2a. Also, the distribution of alleles for some haplogroups do look strongly non-normal for some STRs.

For example haplogroup I (n=5700) has the following allele distributions:

Mutation rate is a function of repeat number, so a mutation from 12==>13 is more likely to occur than one from 11==>12. In the same matter a mutation from 13==>14 is more likely to occur than a mutation from 12==>13.

The expected time for at least one mutation to occur in all descendants of individual X for any given STR is 1/mu, where mu is the mutation rate of the STR.