Post navigation

Concepts – Relationship Predictions

One of the ways people utilize autosomal DNA for genealogical matching is by looking for common segments of DNA that match with known, or unknown, relatives. When the relationship to the person is unknown, we attempt to utilize how much DNA we share with that person as a predictor of how, or at what level, we’re related to them – so in essence where or how far back we might look in our tree for a common ancestor.

Until recently, the best estimate we had in terms of how much DNA someone of a particular relationship (like first cousin) could be expected to share both in terms of percentages and also cMs (centiMorgans) of DNA was the table on the ISOGG wiki page. Often, these expected averages didn’t mesh well with what the community was seeing in reality.

Recently, Blaine Bettinger’s crowdsourced Shared cM Project reported the averages for each relationship level, plus the range represented from lowest to highest in a project where more than 10,000 people participated by providing match information.

Additionally, before publication, Blaine worked with a statistician to remove outliers in each category that might represent data entry errors, etc. Not only did Blaine write a nice blog article about this latest data release, he also wrote a corresponding paper that is downloadable that includes tables and histograms not in his blog.

I am constantly looking between the two sources, meaning the ISOGG table and Blaine’s paper, so as an effort in self-preservation, I combined the information I use routinely from the two tables – and did some analysis in the process. Let’s take a look.

The Combined Expected cM and Actual Shared cM Chart

On the chart below, the two yellow headed columns are the Expected Shared cMs from the ISOGG table and Blaine’s Shared cM Average – which is the average amount of DNA that was actually found. These, along with the percent of shared DNA are the columns I use most often, followed by Blaine’s minimum and maximum which are the ranges of matching DNA found for each category. As it turns out, the range is incredibly important – perhaps more important than the averages expected or reported – because the ranges are what we actually see in real life.

I’ve also included the number of respondents, because categories with a larger number of respondents are more likely to be more accurate than categories with only a few, like great-great-aunt/uncle with only 6.

It’s interesting that the greatest number of respondents fell into the aunts/uncles niece/nephew category with second cousins once removed a very close contender.

These were followed by the next closest categories being, in order; first cousins, second cousins, first cousins once removed, second cousins once removed and third cousins.

Note: If you downloaded this chart on August 4, 2016, there was an error in the maximum number of first cousins trice removed. On August 5, 2016, it was corrected to read 413.

You can see that in reality, all categories except two produced larger than the expected cM value. One category was equal and one was smaller (yes I checked to be sure I hadn’t transcribed incorrectly). Actual numbers with higher values are peach colored, lower is green and white is equal.

Most averages aren’t dramatically different for close relationships, but as you move further out, the difference in the averages is significantly greater. Beginning with third cousins once removed, and every category below that in the chart, the actual average is more than twice that of the expected average. In addition, the ranges for all categories are wider than expected, especially the further out you go in terms of relationships.

We often wonder why the relationship predictions, especially beyond first or second cousins vary so widely at the testing companies and GedMatch. In the chart above, you can see that beyond first cousins, the ranges begin to overlap.

Ranges of the same relationship degree should share the same percentages and theoretically, the same amounts of DNA, but they don’t. You can see that the cells marked in red are all 4th degree relatives. However, half first cousins show a maximum of 580, with the two following rows showing 704 and 580 – all 4th degree relatives. There’s a pretty significant difference between 580 and 704.

Through 5th degree relatives, everyone matched at some level, meaning the minimum is above zero, but beginning with 6th degree relatives, row highlighted in yellow, some people did not match relatives at that level, meaning the minimum is zero.

In the last 4 rows on the chart, 15th, 16th and 17th degree relatives, marked with light aqua, where academically we “should” share 0% of our DNA, we see that the observed average is from 7 to 11 cM and the range is up to 29 cM.

An example of why predictions are so difficult is that if you are on the high end of the 4th cousins range, a 9th degree relative with 91 shared cMs of DNA, you are also right at the average between 6th and 7th degree relatives which fall into the half second cousin or third cousin range.

Without relationship knowledge, the vendor, based on averages, is going to call this relationship a 2nd or 3rd cousin, when in reality, it’s a 4th cousin. Most vendor relationship predictions are based on a combination of total shared cMs and longest block, but still, it’s easy to be outside the norm. In other words, not only does one size not fit all, it probably doesn’t fit most.

Graphs

For me, graphs help make information understandable because I can see the visual comparison.

These overlapping ranges are much easier to visualize using charts. Please note that you can click on any image for a larger view.

The values and ranges for 1st, 2nd and 3rd degree relatives are so much larger than more distant relatives, that you can’t effectively see the information for more distant relatives, so I’ve broken the charts apart, below.

This first chart, above, shows third degree relatives and closer. Note that the purple maximum for aunts/uncles, nieces/nephews is larger than the minimum for full siblings and greater than the red average or blue expected for half-siblings.

This second chart shows the more distant relationships, meaning 4th degree through 17th degree relatives, but the more distant relationships are still difficult to see, so let’s switch to bar charts and smaller groups.

This first bar chart includes parent/child through first cousins relationships, or 1st through 3rd degree relatives. You can see that the first cousin maximum range (purple) overlaps the aunt/uncle, grandparents and half-sibling minimum ranges (green.) Half sibling max and full sibling minimum are very close.

The balance of relationships are a bit small to view in one chart, but the ranges do overlap significantly. Unfortunately, Excel does us the favor of skipping some labels on the left side of the chart.

Removing the legend helps a bit, but not much. Please refer to the color legend in the same graph above. I’ve further divided the groups below.

The same chart, above, with the legend removed to allow for more viewing space. You’ll notice that at the half second cousins level, and more distant, the green minimum disappears, which means that some people have no matches, so the minimum cM shared is zero for some people with this relationship level. However, based on the average and maximum, many people do share DNA with people at that relationship level.

The chart above begins with 7th degree relatives, half second cousins, where you share less than 1% of your DNA.

In many cases, the purple maximum range for one relationship category overlaps Blaine’s average and the expected values for other categories. For example, in the chart below, you can see that the maximum purple bar for the various 5th cousin ranges is higher than the third cousin, twice removed red shared cM average, and significantly higher than the blue expected shared cM value. In fact, the 6th cousin purple max is nearly the same as the blue expected cM for third cousins once removed. Note that Excel showed only every other category on the left hand axis, so you’ll need to refer to the actual data chart from time to time.

I’ve removed the legend again so you can see the actual stacked ranges more clearly. All of the 7th degree relatives have a minimum of zero, so there is no green bar. Furthermore, at 5th cousins twice removed, the expected shared cMs drops to below 1, so the blue bar is nearly indistinguishable.

This last chart shows the smallest group, 9th through 17th degrees, or 4th through 8th cousins.

On this final chart, we clearly see that Blaine’s actual red shared cMs and the purple maximum are significantly more pronounced than the blue expected shared cMs. Some people share no DNA at this level, which is to be expected, but a non-trivial number of people share significantly more than is mathematically expected. There are no absolutes.

Summary

DNA is not always inherited in the fashion or amount expected, and that wide variance is why we see what people believe are “false positive” relationship predictions. In reality, the best the vendors can do is to work with the averages. This also explains why it’s so difficult for us to estimate or determine how a person might be connected based just on the relationship or generational prediction. It’s just that, a prediction based on averages which may or may not reflect reality.

There’s a lot we don’t know yet about inheritance – why certain segments are passed on, often intact, sometimes for many generations, and some segments are not. We don’t know how segments are “selected” for inheritance and we don’t yet know why some segments appear to be “sticky” meaning they show up more in descendants than other segments.

Close relationships are relatively easy, or easier, to predict, at least by relationship degree, but further distant ones are almost impossible to predict accurately based on either academic inheritance models or Blaine’s crowdsourced average cM information.

Here’s a clean copy of the combined chart for your use.

Note: If you downloaded this chart on August 4, 2016, there was an error in the maximum number of first cousins trice removed. On August 5, 2016, it was corrected to read 413.

There are certainly valid cousins in the mix, the challenge of course, especially in endogamous populations, is to sort the wheat from the chaff, genealogically speaking. I have several people with whom I match on a common segment with several others descended from the same family from many generations back. Further than “should” match, but we do and we’ve identified a common ancestor.

About 10 generations, BUT, the problem is that unless everyone’s tree is complete, you really can’t be sure that’s the ONLY common ancestor between multiple people. And very few people, me included, have a compete tree 10 generations back. In some cases, you can rule out missing branches, like if the parents are from Italy off the boat – you can be pretty sure unless your family is Italian that their Italian line is not relevant to you – but in most cases, you can’t be sure there aren’t other common lines too. The best you can do is hope for multiple matches on that segment from the same line so that ambiguity is lessened.

Roberta, excellent summary! Reading it raised two thoughts:
1) What biases are introduced in the sample for Blaine’s data? My inclination is that we have a lot of people a) living in U.S. and b) with deep roots in U.S., introducing endogamy potentially.
2) It seems there is an opportunity for FtDNA and Ancestry to take confirmed matches (as documented in a tree) and incorporate this into their relationship predictions.
What do you think?

Thanks Roberta for a useful (as always) post
Many of us are working on the fringe of IBD/IBS segments .
Here is what a contact sent me, taken from GEDmatch chromosome browser in the tool People that match one or both kits.

Relationship – Unknown but probably more than 10 generations. First, glad the twins have the same segment, but it is longer than the one inherited from me, and mine is almost intact from my mother. Also this tool is not as strict as the one-on-one matches – where there is no match…

My comment here is that the variability seen in the distant relations could be due to such phenomena – adding slightly to the segment in the offspring, hence conserving the segments through generations.
Now we are working on finding a common ancestor – one possible family is where I have FIVE lines connecting to this immigrant. My correspondent is from France and he has two with that surname (Pelletier). We are missing 100 years and 50 km to make the link… So maybe IBS after all.

Roberta, I have been working really hard on my autosomal matches to take advantage of the new phasing tool, and have been adding to a new tree of my ancestors to upload to FTDNA, which will include my known autosomal matches. I don’t want to upload my entire tree, just my ancestors and the matches where they fit it in, and I intend to add to it each time I confirm a relationship.

But, I am finding one thing very puzzling, and a little disturbing. I understood that if a man has an “X” match, then the relationship can only be in the maternal line. My maternal uncle, and my brother, and several paternal cousins match me on their “X.” Several of my maternal cousins don’t, so I am confused. This is affecting where I look for possible common ancestors, naturally. Have you addressed this somewhere that I maybe missed?

I have poured over Blaine’s charts for hours trying to solve several mysteries! One question I have are the test results for my mother and her two maternal first cousins. All three match each other at very low levels (below the minimum reflected in Blaine’s data). I don’t know if they are truly first cousins, but the sample isn’t large enough to account for the low numbers yet, or they are each half first cousins to each other. That would mean that my once married great grandmother had three children by three different men (7 children all together, but descendants of only 3 tested). What seems more likely given these numbers from FTDNA?
L to M- 522cm total, longest 96cm
L to J- 494cm total, longest 43cm
M to J- 379cm total, longest 65cm

Roberta, Thankyou for the article. What puzzles me is why full siblings do not share more cM than a parent/child relationship. They both have a probability of 50%. I have the data from my family which agrees with the chart but why am I finding this hard to understand?

Full siblings, unless they are identical twins, will each inherit half of their parent’s DNA, but not the same half. Some of the parents’ DNA will be the same in the siblings, but not all, so they will have a smaller cM matching count.

In the table, a parent/child relationship is expected to share 50% or 3400 cM. For full siblings, the expected % is also 50%, however the number of cM expected is only 2550. If you look at the 2nd relationship degree, the % expected to be shared is 25% and 1700 cM for each of the three relationship rows.

If you are expected to share 2550 cM, one would normally this equate to 37.5% rather than 50% shown in the table. Can you please explain why for full siblings 2550 cM equates to 50%?

If you look at the ISOGG table, this may answer your question. When you are dealing with siblings, some of the values are fully identical regions and some are half identical regions. The fully identical regions should be counted twice, but in some cases, they are not, they are only counted “as” at match, not as a match to each parents, so counted twice. I think that accounts for the difference you are seeing. http://isogg.org/wiki/Autosomal_DNA_statistics

Why does that logic not apply to a parent/child relationship where as the table shows the full range of inherited cMs exceeds the full range of the inherited cMs of siblings. Sorry but I still do not understand.

If you look at the ISOGG table, this may answer your question. When you are dealing with siblings, some of the values are fully identical regions and some are half identical regions. The fully identical regions should be counted twice, but in some cases, they are not, they are only counted “as” at match, not as a match to each parents, so counted twice. I think that accounts for the difference you are seeing. http://isogg.org/wiki/Autosomal_DNA_statistics

Thank you … If I understand you correctly, then full siblings would be expected to have about 3400-2550 or 850 cM that match both parents (expected amount) with a portion of the 850 cM being IBS. Is this correct?

Does this mean on average one would expect a mother and father to share the same DNA segments and are related somehow in the distant past?

I think we’re having a definition issue here. None of the DNA of a person as compared to their parents or siblings is IBS. IBS is really a combination of two terms. IBC which identical by chance and IBP identical by population which means they all came out of the Jewish population (for example), and carry old segments that can’t be genealogically identified. You can genealogically identify all of your DNA that comes from parents because there are only two people you can get it from. So none of your DNA is IBS when compared to siblings and parents. When you compare your results to a sibling, some of the DNA that matches the sibling, some of those addresses, actually match both parents, not just one parent. So you show as a match, but those segments where you match both parents should be counted twice. That is not uniformly done. One of the reasons I didn’t mention this is because it is very confusing to people. Think of a street with houses on both sides with the same address. One side if your Moms and one side is your Dads. When compared to a sibling, on locations where you match one side or the other, that’s counted as a match. In some cases, where you match both sides is counted at 1 match and in some cases, it’s counted as 2, one for each side. The easy way to tell if someone is a full or half sibling is that full siblings have fully identical regions (FIR) and half siblings do not, because they only match ONE parent.

The definition problem is my understanding. Thank you for your patience again. So, in those sections that are fully identical, one matching the father and the other the mother, they are generally only counted as one parent. In these sections, does the section from the mother match the section from the father; or are they different, assuming the parents are not related in the first 10 generations or so.

No, the section from the mother and father don’t match each other (I’m assuming no endogamy here), but the child matches both parents individually on their individual sides of the street at that address. The parents don’t match each other at that address. But the child always matches both parents at that address. When you are comparing two siblings, on part of those addressed, they will have no matching DNA meaning at that location, they didn’t inherit the same DNA from either parent. In some locations, they will have the same DNA from one parents, and in some locations, they will have DNA from both parents, e.g. both side of the road, that match their sibling.

Roberta,
First, I enjoy reading your Blog even if a lot of it goes way over my head.
I want to ask a question, possibly a entry level one.
I refer to the Blog, “Joseph Rice 1700-1766 – a dissenter…” and the DNA Chart of individuals who have been tested.
My biological brother is Subject # 20980.
Through 67 markers with only one mutation from the others, Subjects # 4897, 20980, 153550 and 163337 seem to be identical [to the untrained mind].
What does this actually mean for the relationship of these four individuals?
Thanks for your Blog – I am learning.

Y DNA proves that men do or do not descend from a common ancestor. Closer tests, meaning those with fewer mutations (genetic distance) generally have a common ancestor closer in time. You can also compare their Family Finder tests to see if they are listed as having matching autosomal DNA which would suggest a relatively close relationship – meaning probably within the past 5 or 6 generations. Notice all the “weasel words” like generally and probably. Nothing is absolute until you find the common ancestor genealogically and that link is confirmed by DNA of one sort or another.

Thank you for your article. It is very interesting. With regard to your table, I was comparing the expected shared as a % and in cM. I noticed for the 1 degree relationships that the expected shared DNA in % is the same, however the expected shared in cM differs; i.e. Parent/child equals 50% and 3400 cM while full siblings show 50% and 2550 cM. If I look at other relationship degrees, the table shows the same % expected shared and the same cM expected shared for each relationship within the same degree; i.e. half siblings, grandparents, aunts/uncles and niece/nephews each have an expected shared of 25% and 1700 cM. Can you please help me understand why the expected share in cMs differs for the 1st degree? Thank you

If you look at the ISOGG table, this may answer your question. When you are dealing with siblings, some of the values are fully identical regions and some are half identical regions. The fully identical regions should be counted twice, but in some cases, they are not, they are only counted “as” at match, not as a match to each parents, so counted twice. I think that accounts for the difference you are seeing. http://isogg.org/wiki/Autosomal_DNA_statistics

Really meaningful work! When we remember that “zeros” here are not real zeros – those just don’t meet the matching criterions – the average cM’s for distant cousins would be even higher than reported here and so the difference between expected and noticed would also be bigger. If I did understand the numbers right…

Shouldn’t crowd-sourced data show a strong bias toward matching? This won’t be an issue with known, close family. But starting with second cousins it likely is, and it must be dominating the results for distant cousins. Who knows their eighth cousins, much less phones and them and asks them to take a DNA test? Instead, we find a few of them because they tested and happen to match us. Eighth cousins will not share an average of 9 cM from their common 7th-great grandparents (though in some populations they might share that from a background of endogamy.) In my own family the matching among a few dozen 3rd-5th cousins is close to the expected averages rather than to the results from Blaine’s responders, I think because I have worked to reduce selection bias by finding testers who don’t match along with those who do.

Of course Blaine asks, but people don’t report what they don’t know. People report the 8th cousin whom they match and not 100 others (among 10,000 8th cousins) who tested and don’t match. No problem for close family, but at 3rd and beyond the crowd-report quickly becomes untenable. How do you find even your non-matching thirds? One way is Ancestor Circles at Ancestry, which display all tested descendants of an ancestor, matching you and not, who match at least 2 in the circle. At last, a simple way to find non-matchers. This removes most of the selection bias (not all). If people reported only from these Ancestor Circles, reporting for all in each circle, I think the averages would be much closer to the expected.

Conversely, there would be more matching if the thresholds were lower, because the testers have to get over the initial threshold before they can be counted as a match on an individual segment. And at Ancestry, some matching segments may well be stripped out with Timber, so their matching DNA amounts will actually be lower than the rest. I think both of these factors are at play here.

The matching threshold surely is part of the calculation as to the expected. The changes with Timber do not appear to have had much effect. These are minor issues that won’t account for a discrepancy that grows to orders of magnitude. I see now that “Blaine’s shared average” in fact is behaving just as we would expect (final graph above). The curve should drop exponentially at distant generations but instead it flatlines, because responders don’t know their exponentially increasing non-matchers and don’t report them. At 3rd cousin, when 90% match, it’s a small problem. At 4th, when ~50% match, it’s already meaningless to report an average because the base isn’t known, and we have a large error. At 5th and beyond it’s absurd, as non-matchers increase exponentially over matchers by close to 4-fold with each degree of cousinhood. Instead, people report the matches they have found and maybe a non-match or two, if they happen to know of any, at a constant ratio that probably reflects the psychology of completing a task more than anything else.

John – I think you’re approaching this from a population genetics standpoint rather than a genetic genealogy standpoint. Including reports of 0 cM in the data will provide information about the average across the entire population (which is not the number we need or care about), but it won’t provide the information we really need: what the average is for people who DO share DNA at that relationship.

Accordingly, we intentionally do not want to include reports of 0 cM when calculating the average, because that will distort the data for genetic genealogists, who are looking at cases where they DO share DNA (the likelihood of sharing DNA with a 6C or 8C is an entirely different data set). Indeed, the Shared cM Project states that for relationships greater than 2C, “the average was calculated only for those sharing DNA.” A genetic genealogist will use this chart when they find that they share DNA with person X, and want to find which relationships the shared amount of DNA most closely match, as a clue for additional research.

I’ll also note that the expected averages don’t account for 0 cM sharing, either. They are based solely on a 50% inheritance at every generation.

The matching bias here results from the testing company thresholds, but that is unavoidable using this crowd-sourced data.

Very helpful discussion here, and this has been an issue on my mind when comparing the tools. I have been of the same mind as John, considering that the sample in Blaine’s data is biased toward those with more matching DNA in the more distant relationships, as it was likely the DNA match drove the discovery of the genealogical relationship, and considering that problematic. Blaine’s post has helped me to reframe and reconsider. Would you agree on the following:

The ISOGG tool provides estimates based on POPULATION genetics. Blaine’s tool does not, by design, but instead provides utility in terms of interpreting potential relationships between those people called as “matches” by testing companies [in other words, the two tools have different denominators: (population at large) versus ((population of DNA testers) – (truly related folks that aren’t called as matches and the testers are unaware of their genealogical relationship)) ]. It is therefore EXPECTED, and not suggestive of some misunderstanding of DNA inheritance or weird biological phenomenon (cue conspiracy theorists?), that Blaine’s data (speaking mostly, say, about 3rd cousin and beyond relationships here) might exhibit a mean shared DNA above that predicted in the ISOGG tool.

I like to think through how this might be explained to a client in understandable, but generally accurate, terms…

Actually the ISOGG table (you’re referring to the table entitled “Average autosomal DNA shared by pairs of relatives, in percentages and centiMorgans” correct?) is only based on the fact that a child inherits 50% of her parent’s DNA. Period. It does not take into account any averages at all. Further, the ISOGG table completely ignores the possibility of pedigree collapse, endogamy, or any other factor influencing the standard 50% rule.

So the expected starts with 100% for self, 50% for children, 25% for grandchildren, and so on. While it is expected to be the average based on genetic inheritance, the actual average can and probably should vary widely.

So pedigree collapse, endogamy and other factors aside, I guess am operating under the assumption that recombination is truly random (which I question if can be the case? Would love to hear thoughts on that.)… over the entire human population across time, would you not expect that the mean DNA shared amongst all instances of a given relationship to be as stated in the “Average autosomal DNA shared by pairs of relatives, in percentages and centiMorgans,” but with increasing variability (range, 95% confidence interval, standard deviation- however you want to conceptualize it) around that mean as the relationship moves farther and farther away from parent-child, due to more instances of recombination? I appreciate this is moving into the realm of theoretical versus practical…

Yes, I do expect that ranges around the average will vary widely as the relationship moves farther and farther away from parent-child, due to more instances of recombination. But, of course, the ISOGG table does not take that into account in any way.

I do not, however, expect that over the entire human population across time, the mean DNA shared amongst all instances of a given relationship to be as stated in the ISOGG table. I know you said “pedigree collapse, endogamy and other factors aside,” but they are FAR too prevalent to ignore. They have an enormous impact impact on these averages. And, these factors were FAR more prevalent in history (when populations were tiny) than now (when populations are huge).

And recombination is not perfectly random. There are recombination hotspots, recombination suppressors, and possible even other factors that we don’t even understand yet. We are still in the infancy of understanding recombination and inheritance.

Following the discussion started by John Yates: Does that mean that the “Blaine’s Shared cM Min” (0) is not included in the “Blaine’s Shared cM Average” for relationships further out than 2nd cousins? If not, are they included in the number for “Blaine’s Shared cM # Respondents”?

The chart is great. Very helpful. And I am particularly surprised to see that me and my first cousin are the “Blaine’s Shared cM Min” representatives! Being that we are related through a very endogamous population, I had initially expected to see a larger than average match. Oh, fickle recombination!

To Blaine, and everyone-
This has been a great, and very helpful, exchange! I especially appreciate your input on the lack of randomness of recombination, Blaine… that has been on my mind, as intuitively it seemed doubtful. I need to find some good references to educate myself on the hotspots and suppressors.

Delving into some of the nuances of this underscores the importance of the shared centiMorgan work… I am curious if there are any potential animal models (with short reproductive timeframes) that could inform our understanding of the relevant influences on recombination. So, I guess I am curious and impatient… : )

Ha-ha, yes. I shouldn’t post so late at night. My apology for misunderstanding the point. I sensed a comparison being made, when the article’s intent was to contrast apples and pears. As pointed out, the theoretical average over all cousins isn’t so relevant. An apples-to-apples comparison versus Blaine’s data would use a theoretical average of those who match above 7 cM, or the threshold used by the testing compaines. In this case the theoretical blue bars would approach the threshold, say 7 cM, always remaining above it. The stickiness, or lumpiness as i call it, would still be apparent, but wouldn’t appear so extreme.

Oh yes, when you divide by two every generation (by 4 for every degree of cousin-hood) you are getting the average sharing over all cousins, and this average indeed includes people who don’t share at all. It is a combinatorial probability covering all possible outcomes. It wouldn’t be an average if it excluded outcomes, surely not the zeroes, who come to dominate for distant cousins.

I just want to add my thanks – most of what I understand about genealogical uses of DNA I have learned from you. You are very generous with your knowledge and patient with your pupils. It’s very much appreciated!

Thanks for a very nice blog post!
In thinking about the “Degree 4” cells which you marked in red, there are 839 respondents for 3rd category and 6 for the first. You should maybe compare the maximum in category 1(580) with the top 16.67%tile =(1/6) of category 3 not the top 0.1% =1/832 of category 3. The 16%tile will be closer to 580. A check on this is that the observed max is increasing with the number of respondents.
Thanks again for this blog post…..

This is more on the 3 cells that Roberta marked red.
These actually agree, and I’ll try to explain the math.
One has 6 members and a maximum of 580
One has 23 members and a maximum of 704
One has 839 members, a maximum of 753,and Blaine includes the histogram.
If we take the distribution with 839 members and try to predict the maximum we would see in a sample of 23 or 6, you do it this way:
1. Calculate 1/23 * 839 = 36.5. Then you count backwards in the 839-person histogram 36.5 spots and check where you are.I think this is in the range 694-717 so it checks with 704.
2. Calculate 1/6*839 = 139.8. Then count backwards in the 839 person 139.8 people and check where you are. I think this is near the boundary between 588 and 545 so it checks again.
If Blaine would provide the 839 numbers, the 23 numbers,and the 6 numbers one could see this more clearly after converting to percentiles.

This is a very very interesting study! Thank you Roberta for summarizing it and also for that extra column of degree of relationship!

Thanks so much for the clear and very useful discussion. I’ve saved the charts and no doubt will be referring to them often! Nice to know that somebody’s brain is still functioning in an orderly and relevant manner while mine goes astray 😉

It seems to me, that a significant factor in all of this is, as atDNA recombination is chunky (typically 2, sometimes 3 chunks per chromosome–not finely interleaved like the shuffling of playing cards), beginning at some point (roughly 4 generations?), the ancestral atDNA chunks probably do not just keep getting smaller, but some of the remaining sub-chunks have a greater tendency to either be passed on intact, or not at all, ever-more-so with each succeeding generation.

As more and more sub-chunks are not passed on, room is left for the sub-chunks that do make it through to be above the expected average amount, sometimes well above.

This is what makes it possible, though probably rare, to have a reported detectable atDNA match with an 8th Cousin of at least 5 cM to 7 cM, up to 140x what the projected expected averages predict would be a shared amount of ~0.05 cM.

So, when someone has a reported detectable atDNA match beyond about a 3rd Cousin, chances are the shared amount will be higher than average (but might not). This might explain why most people are so frustrated by trying to figure out how atDNA matches who are estimated to be 2nd to 4th Cousins are not, and who might actually be much more distant (possibly 5th to even 10th cousins, or further?), with no way to gauge it from only the amount of shared atDNA between the two atDNA test takers. Other evidence is hopefully available that can be drawn in to try to narrow the range of possibilities.

I constantly have to remind myself: the smaller the amount of shared atDNA, the greater the range of possible relationships.

In this endeavor of relationship prediction, as most genetic genealogists have figured out, the DNA testing companies’ estimated predictions are not all equal or done in the same manner. Even the method of measuring the shared amount varies. In my limited experience, FTDNA relationships have the greatest likelihood of being too optimistic, more so than ancestry.com or 23andme, but I could be wrong, and this could change and probably will. Others can probably offer a more useful observation here.

Segments are certainly chunky, but in a random way.
Another blogger, Jim Bartlett, has set this out at segmentology (Google it).
I think the posting that explains this best is called “..from the bottom up”.
And don’t forget the corollary of the porcupine effect.
The further back you go, the more likely you are to inherit NOTHING from some people – the Porcupine Effect.
This is a zero-sum game. That deleted DNA will be balanced by inheriting more from other ancestors. THAT is why some matches appear closer than they are. It’s the flip-side from inheriting no DNA at all from other ancestors.

Please Note: If you downloaded this chart on August 4, 2016, there was an error in the maximum number of first cousins trice removed. On August 5, 2016, it was corrected to 413 and the graphs were updated as well.

I would like to pick up the point about endogamy here – not so much first cousins marrying each other, but what I would call the village effect, families interconnecting with each other’s through several generations. On Anthrogenica, we had the discussion as to who was most inbred, the village in Quebec or the one in France. What we used as a measure was the number of ancestors our respective parents had in common. I have 9 couples, but my French counterpart has almost twice as many. Also the same couple can appear four to five times. Would that have more of an effect than having more couples each appearing once?

The question is, how could this be applied to the expected inheritance? Some sort of correction factor?

Yes, the village effect is shat we see in the Ashkenazi Jewish population. Not only is the same DNA being passed around, smaller segments recombine to look like larger segments. One of the struggles the companies have is that the optimal marching threshold for an endogenous person is significantly higher than for one who is not. The village factor is truly endogamous where the cousins marrying is consanguity.

My phone was misbehaving and I didn’t finish my comment. Jewish people have roughly three times the matches that nonendogamous people do. FTDNA does apply a corrective factor but it’s not enough. A non relevant match for them may be a relevant match for someone else. Where cousins are involved, it just a matter of who they match, not if matching everyone who descends from the village. We see this same issue in Acadians, the difference being that we know who the Acadians were and those families are so intermarried we still can’t figure out whose DNA is whose.

I have small village in Quebec plus multiple Acadian ancestry. So, you can imagine how much fun I am having. I am down to ascribing a common DNA segment to the village instead of a couple. If I see Rivière-Ouelle, I give up!
I was all excited to find a common segment with five other people using GEDmatch. Comparing our trees, the first respondent identified 9 couples. With the next one, we are down to four – two Quebec, two Acadians. The third one did not have a full tree, but because she is from Louisiana, I decided Acadian more likely – we are down to the two Acadian couples as being the source. Of course pioneers, nine to ten generations back.

In a way I was glad to hear that the Cajun/Acadian DNA has not been figured out yet. I looked at it for a while with my dad’s DNA results but I gave up. Once I’d go back far enough to find a common couple, there would always be 4 or more common couples along with them. And it never was the same four, of course. All the trees go back to the same people, there are just different branches along the way. On the other hand, I was hoping to hear that there was a way to figure it all out.

To Van Landry
If you are a Landry, it’s easy, I would say it’s all done. One of the best site is this one : http://mwlandry.ca/genealog/surnames.php more than 40,000 Landry listed – with sources.
That site even includes some Y-DNA data to unravel the two René Landry.
If you can’t link into it, you can always sent Marcel Landry an e-mail.

I have just one 1st cousin 3x removed. They’re actually younger than my grandmother. Long story…However, we share 220 cM which is double what would be expected. This just shows how hard relationship predictions are to make.

Roberta, in cases like jbower14 who shows more dna than expected in the relationship, could the excessive amount of dna be because there are also other common ancestors behind bricks wall that neither match knows about which contributes to the excessive dna? Just curious.

That is an interesting thought, caith. However, in my case that would be impossible, as the other line on my mother’s side is entirely Irish. On the other line, there is no Irish at all, However, there may be a hidden Irish ancestor…

I looked through Blaine’s paper. I was wondering if you knew if this is taking the total cam above some individual segment length (such as 7cM) or just the total including all segments (including possibly IBS or IBC below 3cMs). Also, do you know if it makes a difference in estimating the closeness of the relationship if a match has two 12cM segments matching on two different chromosomes as opposed to a single 25cM segment match?

Thanks for responding. Unless I misinterpreted your response, those two rules could lead to vastly different cM levels for the same kit. For example, a match could read 36 cM , 8 cM longest segment at FTDNA and only 8cM at gedmatch using that filter level in the one-to-one comparison tool.

Regarding the second part of my question, does your study take the number of matching segments into account (2 12cM matches vs. 1 25 cM segment) when gauging the closeness of relationships? Or were you bound by the figures people sent to you alone?

Yes, there will be company differences for the same relationships. And I would like to report that someday, but unfortunately not soon. Note, however, that all variations will be within the histogram; in other words, if I only showed results for 2C from FTDNA, it would be a subset of the overall 2C histogram. Same if I showed only the Ancestry or only the 23andMe. I have not seen any evidence that differences in company reporting of cMs shifts a prediction from one relationship to another relationship except for very distant (5C and beyond), where the relationship predictions aren’t very valuable anyway.

The study accepts data about # of segments, but I have not collated or analyzed that data.

Thanks for the combined charts but it still leaves me scratching my head as much as when I just try to use Blaine’s chart because of the wide ranges and the fact that I’m about 90% Ashkenazi. FTDNA has identified someone as my 2nd to third cousin. Here is the data:
2nd Cousin – 3rd Cousin
249 total cm
42 longest cm
X-Match

We think we connect through his grandmother who had the same maiden name as my mother. But his mother was born in Moscow and he believes that both his grandmother and great grandfather were as well while my family seems to be from the Mogilev, Belarus area. I don’t have much to confirm that though. Blaine’s chart suggests multiple paths that are compatable with the above data. Do you have any idea on how I might determine the most likely one?

You just have to use genealogy now. These numbers only give you average and ranges which may or may not be relevant in your own situation. It does give you a place to begin searching, but that’s about it.

Roberta, thank you so much for this article. Your article on the wide variations of DNA shared between two relatives has partially answered a question that i have been wrestling with for a couple months, and which neither FTDNA nor Ancestry have answered. Perhaps you would be so kind as to consider my case and tell me what you think. I am an adoptee. I found my mother’s family in 1998 (she died in 1991) and since then have been busy researching my maternal tree and with DNA testing of maternal family members. The Court refuses to reveal my natural father’s identity, and the man indicated by my aunt and others has – through DNA testing of his grandson – proven not to be my father. I have tested with all 3 major companies. A few months ago i had a big breakthrough regarding my father’s identity through the help of DNA matching at FTDNA and AncestryDNA, and through the comparison of the family trees of paternal DNA matches. I now know that my father was descended from the Jung family of Siefersheim, Rheinland-Pfalz, Germany, and the Whittaker familiy of Manchester, England. Both families emigrated first to New York, then to southern Michigan, where i was born. The only marriage that i have been able to find between representatives of these two families is that of Frank Casimir Young (surname was changed from Jung to Young) and Nellie Anna Whittaker. They would be my paternal grandparents. From the Jung side, the amount of cM shared by me and my three Jung matches would make my descent from Frank Casimir Young very likely. The common progenitor was Conrad Jung, Senior. Two of those DNA matches descend from his son Conrad Jung, Junior, , one match descends from Casimir Jung, Frank’s grandfather. I share much more cM with the descendant of Casimir Jung, than with the other two. So that it is logical that i too descend from Casimir Jung rather than from Conrad Jung, Junior. But, on the Whittaker side there is a cM problem that causes me to doubt that this particular couple – Frank Young and Nellie Whittaker – are actually my grandparents, rather than some degree of Cousins. I have five clear Whittaker matches. They share 117, 103, 40, 33 cM.respectively (The fifth is a son of the one who shares 103 cM, and i donot count him here). 117, 103 and 40 are descendants of John B. Whittaker. The first two – 117 and 103 – are descendants of his son John David Whittaker, whereas 33 cM is a descendant of another son, James Barry Whittaker. At first i thought that i too must be descended from the son John David Whittaker, but Nellie Anna Whittaker – my likely grandmother – is a daughter of James Barry Whittaker, with whose descendant i share only 33 cM. If Nellie Anna Whittaker and Frank Young are my grandparenst, then the match with 33 cM would be my 2nd Cousin Once Removed, and the other two matches – 117 and 103 – would be my 3rd Cousins Once Removed, if i’ve counted correctly! Is this possible? Is this likely? I have strained the interent for marriages involving descendants of the two families – Jung and Whittaker – and the only one i can find – and which is in the right place and time as well – is between Frank Casimir Jung and Nellie Anna Whittaker.http://www.wikitree.com/wiki/Young-18815http://www.wikitree.com/wiki/Whittaker-1453
Their son Howard looks very much like me: http://www.wikitree.com/wiki/Young-18873
What are your thoughts on this problem? I ‘d appreciate your attention to this question very much. Thank you in advance,
Albertus

Ok, interesting. What’s weird is that on AncestryDNA, it says we share 10.5 cM, but then, on DNA.Land and GEDmatch, it says we share about 21.9 cM. It looks like we may have a common ancestor though, as she has a dead end on a line with a surname that I have in my tree, and they were living in the same county.

Just to be overly technical, the statistical analysis determined the outliers. I didn’t curate or edit the data subjectively, it was an objective statistical analysis. No system is perfect, but to make this as rigorous and scientific as possible, I tried to remove as much subjectivity as possible.

If your cM value was too low, I would at least investigate whether the relationship was as expected. If your cM value was too high, I would investigate whether there is sharing on multiple lines.

My mom shares 535.1 cM with her first cousin, twice removed, which is way outside the range. The match seems to only be on this line, because both lines are traced back really far and there is really no room for overlap. Is this possible?

Roberta
I have several cases of “add-on” segments due to identical by population. One thing I always wonder when people tell me that the lines go very far back and there is no overlap, is how WIDE have they gone? In genetics, the fan is of utmost importance. Most of the genocousins who contact me link through these odd paths.

Hi Roberta, a group of us in a surname project have been using the chart frequently, thanks so much. But a question has arisen that I could not answer.

Is Blaine’s chart based on FTDNA’s total shared cMs, using only their blocks of over 7? We’ve been uploading to GEDmatch and when we don’t see a match, lower the threshold to say 3. We show different total shared cMs on every threshold we set, obviously. My closest match on FTDNA is a confirmed second cousin with whom I share 175 cMs. If I do a one-to-one on GEDmatch with a distant cousin at a very low threshold, I can come up with over a thousand shared cMs! It’s clearly comparing apples to oranges. Using FTDNA’s criteria seems to eliminate any block under 7 from consideration.

Hi Mark! Great question, and one I get often. Roberta’s answer is absolutely correct. The data for each relationship includes AncestryDNA, 23andMe, Family Tree DNA, and GEDmatch data. For the GEDmatch data, the collection form specifically requires the threshold of 7 cM. I would like to eventually break down each relationship into the 4 different sources, and into endogamous versus non-endogamous, but of course that would only break down the given ranges into sub-ranges; it wouldn’t extend the range out any further.

Also, note that you must be very careful recommending that people drop the matching threshold. The best research we have so far shows that 33% of 5 cM segments are false segments, and that 67% of segments of 4 cM and smaller are false segments. So while you can lower the threshold and find matching, unless the segments are above 5 cM you can’t be reasonably sure that they’re even real segments.

Thanks to both of you and for the invaluable chart ! Great research, Blaine.

We’ve only been using somewhat smaller segments to confirm a shared segment that first appeared at 7.8 cMs on chromosome 15. I like to see if the others in the surname project subgroup, all confirmed relations through Y-DNA testing and pedigrees, share part of the same segment to a lesser degree. It appears so.

Just a note to add. We tend to focus on cousins, but all these other relationships… I had been searching hi and low for a first cousin of my mother – showing as anonymous. After she transitioned on 23aM, it turned out it is a great-niece. The plus is, I guess, my mother 94 now knows the whereabouts of all her cousins (mostly, but not all deceased). Since she has close to 100 great nephews and nieces with whom she shares DNA, if she had decided to remain anonymous, I had no hope.