Category Archives: Math

Autosomal DNA testing has opened up the brave new world for genealogists. Along with that opportunity comes some amount of frustration and sometimes desperation to wring every possible tidbit of information out of autosomal results, sometimes resulting in pushing the envelope of what the technology and DNA can tell us.

I often have clients who want me to take a look at DNA results from people several generations removed from each other and try to determine if the ancestors are likely to be brothers, for example. While that’s fairly feasible in the first few generations, the further back in time one goes, the less reliably we can say much of anything about how DNA is transmitted. Hence, the less we can say, reliably, about relationships between people.

The best we can ever do is to talk in averages. It’s like a coin flip. Take a coin out right now and flip it 10 times. I just did, and did not get 5 heads and 5 tails, which the average would predict. But averages are comprised of a large number of outcomes divided by the actual number of events. That isn’t the same thing as saying if one repeats the event 10 times that you will have 5 heads and 5 tails, or the average. Each of those 10 flips are entirely independent, so you could have any of 11 different outcomes:

0 heads 10 tails

1 head 9 tails

2 heads 8 tails

3 heads 7 tails

4 heads 6 tails

5 heads 5 tails

6 heads 4 tails

7 heads 3 tails

8 heads 2 tails

9 heads 1 tail

10 heads 0 tails

What the average does say is that in the end, you are most likely to have an average of 5 heads and 5 tails – and the larger the series of events, the more likely you are to reach that average.

My 10 single event flips were 4 heads and 6 tails, clearly not the average. But if I did 10 series of coin flips, I bet my average would be 5 and 5 – and at 100 flips, it’s almost assured to be 50-50 – because the population, or number of events, has increased to the point where the average is almost assured.

You can see above, that while the average does indeed map to 5-5, or the 50-50 rule, the results of the individual flips are no respecter of that rule and are not connected to the final average outcome. For example, if one set of flips is entirely tails and one set of flips is entirely heads, the average is still 50/50 which is not at all reflective of the actual events.

And so it goes with inheritance too.

However, we have come to expect that the 50% rule applies most of the time. We know that it does, absolutely, with parents. We do receive 50% of our DNA from each parent, but which 50%?. From there, it can vary, meaning that we don’t necessarily get 25% of each grandparent’s DNA. So while we receive 50% in total from each parent, we don’t necessarily receive every other segment or location, so it’s not like a rifle card shuffle where every other card is interspersed.

If one parents DNA sequence is:

TACGTACGTACG

A child cannot be presumed to receive every other allele, shown in red below.

TACGTACGTACG

The child could receive any portion of this particular segment, all of it, or none of it.

So, if you don’t receive every other allele from a parent, then how do you receive your DNA and how does that 50% division happen? The bottom line is that we don’t know, but we are learning. This article is the result of a learning experience.

Over time, genetic genealogists have come to expect that we are most likely to receive 25% of our DNA from each grandparent – which is statistically true when there are enough inheritance events. This reflects our expectation of the standard deviation, where about 2/3rds of the results will be within the closest 25% in either direction of the center. You can see expected standard deviation here.

This means that I would expect an inheritance frequency chart to look like this.

In this graph above, about half of the time, we inherit 50% of the DNA of any particular segment, and the rest of the time we inherit some different amount, with the most frequently inherited amounts being closer to the 50% mark and the outliers being increasingly rare as you approach 0% and 100% of a particular segment.

But does this predictability hold when we’re not talking about hundreds of events….when we’re not talking about population genetics….but our own family genetics, meaning one transmission event, from parent to child? Because if that expected 50% factor doesn’t hold true, then that affects DRAMATICALLY what we can say about how related we are to someone 5 or 6 generations ago and how can we analyze individual chromosome data.

I have been uncomfortable with this situation for some time now, and the increasing incidence of anecdotal evidence has caused me to become increasingly more uncomfortable.

There are repeated anecdotal instances of significant segments that “hold” intact for many generations. Statistically, this should not happen. When this does happen, we, as genetic genealogists, consider ourselves lucky to be one of the 1% at the end of spectrum, that genetic karma has smiled upon us. But is that true? Are we at the lucky 1% end of the spectrum?

This phenomenon is shown clearly in the Vannoy project where 5 cousins who descend from Elijah Vannoy born in 1786 share a very significant portion of chromosome 15. These people are all 5 generations or more distantly related from the common ancestor, (approximate 4th cousins) and should share less than 1% of their DNA in total, and certainly no large, unbroken segments. As you can see, below, that’s not the case. We don’t know why or how some DNA clumps together like this and is transmitted in complete (or nearly complete) segments, but they obviously are. We often call these “sticky segments” for lack of a better term.

I downloaded this chromosome 15 information into a spreadsheet where I can sort it by chromosome. Below you can see the segments on chromosome 15 where these cousins match me.

Chromosome 15 is a total of 141 cM in length and has 17,269 SNPs. Therefore, at 5 generations removed, we would expect to see these people share a total of 4.4cM and 540 SNPs, or less for those more distantly related. This would be under the matching threshold at either Family Tree DNA or 23andMe, so they would not be shown as matches at all. Clearly, this isn’t the case for these 5 cousins. This DNA held together and was passed intact for a total of 25 different individual inheritance events (5 cousins times 5 events, or generations, each.) I wrote about this in the article titles “Why Are My Predicted Cousin Relationships Wrong?”

Finally, I had a client who just would not accept no for an answer, wanted desperately to know the genetically projected relationship between two men who lived in the 1700s, and I felt an obligation to look into generational inheritance further.

About this same time, I had been working with my own matches at 23andMe. Two of my children have tested there as well, a son and a daughter, so all of my matches at 23andMe obviously match me, and may or may not match my children. This presented the perfect opportunity to study the amount of DNA transmitted in each inheritance event between me and both children.

Utilizing the reports at www.dnagedcom.com, I was able to download all of my matches into a spreadsheet, but then to also download all of the people on my match list that all of my matches match too.

I know, that was a tongue twister. Maybe an example will help.

I match John Doe. My match list looks like this and goes on for 353 lines.

I only match John Doe on one chromosome at one location. But finding who else on my match list of 353 people that John Doe matches is important because it gives me clues as to who is related to whom and descends from the same ancestor. This is especially true if you recognize some of the people that your match matches, like your first cousin, for example. This suggests, below that John Doe is related to me through the same ancestor as my first cousin, especially if John matches me with even more people who share that ancestor. If my cousin and I both match John Doe on the same segment, that is strongly suggestive that this segment comes from a common ancestor, like in the previous Vannoy example.

Therefore, I methodically went through and downloaded every single one of my matches matches (from my match list) to see who was also on their list, and built myself a large spreadsheet. That spreadsheet exercise is a topic for another article. The important thing about this process is that how much DNA each of my children match with John Doe tells me exactly how much of my DNA each of my children inherited from me, versus their father, in that segment of DNA.

In the above example, I match John Doe on Chromosome 11 from 37,000,000-63,000,000. Looking at the expected 50% inheritance, or normal distribution, both of my children should match John Doe at half of that. But look at what happened. Both of my children inherited almost exactly all of the same DNA that I had to give. Both of them inherited just slightly less in terms of genetic distance (cM) and also in terms of the number of SNPs.

It’s this type of information that has made me increasingly skeptical about the 50% bell curve standard deviation rule as applied to individual, not population, genetics. The bell curve, of course, implies that the 50% percentile is the most likely even to occur, with the 49th being next most likely, etc.

This does not seem to be holding true. In fact, in this one example alone, we have two examples of nearly 100% of the data being passed, not 50% in each inheritance event. This is the type of one-off anecdotal evidence that has been making me increasingly uncomfortable.

I wanted something more than anecdotal evidence. I copied all of the match information for myself and my children with my matches to one spreadsheet. There are two genetic measures that can be utilized, centimorgans (cM) or total SNPs. I am using cM for these examples unless I state otherwise.

In total, there were 594 inheritance events shown as matches between me and others, and those same others and my children.

Upon further analysis of those inheritance events, 6 of them were actually not inheritance events from me. In other words, those people matched me and my children on different chromosomes. This means that the matches to my children were not through me, but from their father’s side or were IBS, Inherited by State.

This first chart is extremely interesting. Including all inheritance events, 55% of the time, my children received none of the DNA I had to give them. Whoa Nellie. That is not what I expected to see. They “should have” received half of my DNA, but instead, half of the time, they received none.

The balance of the time, they received some of my DNA 23% of the time and all of my DNA 21% of the time. That also is not what I expected to see.

Furthermore, there is only one inheritance event in which one of my children actually inherited exactly half of what I had to offer, so significantly less than 1% at .1%. In other words, what we expected to see actually happened the least often and was vanishingly rare when not looking at averages but at actual inheritance events.

Let’s talk about that “none” figure for a minute. In this case, none isn’t really accurate, but I can’t be more accurate. None means that 23andMe showed no match. Their threshold for matching is 7cM (genetic distance) and 700 SNPS for the first matching segment, and then 5cM and 700 SNPS for secondary matching segments. However, if you have over 1000 matches, which I do, matches begin to “fall off,” the smallest ones first, so you can’t tell what the functional match threshold is for you or for the people you match. We can only guess, based on their published thresholds.

So let’s look at this another way.

Of the 329 times that my children received none of my DNA, 105 of those transmissions would be expected to be under the 700cM threshold, based on a 50% calculation of how many cMs I matched with the individual. However, not all of those expected events were actually under the threshold, and many transmissions that were not expected to be under that threshold, were. Therefore, 224, or 68% of those “none” events were not expected if you look at how much of my DNA the child would be expected to inherit at 50%.

Another very interesting anomaly that pops right up is the number of cases where my children inherited more than I had to give them. In the example below, you can see that I match Jane Doe with 15.2cM and 2859 SNPs, but my daughter matches Jane with 16.3cM and 2960 on the same chromosome.

There are a few possibilities to explain this:

My daughter also matches this person on her father’s side at this transition point.

My daughter matches this person IBS at this point.

The 23andMe matching software is trying to compensate for misreads.

There are misreads or no calls in my file.

There of course may be a combination of several of these factors, but the most likely is the fact that she is IBS at this location and the matching software is trying to be generous to compensate for possible no-calls and misreads. I suggest this because they are almost uniformly very small amounts.

Therefore when my children match me at 100% or greater, I simply counted it as an exact match. I was surprised at how many of these instances there were. Most were just slightly over the value of 2 in the “times expected” column. To explain how this column functions, a value of 1 is the expected amount – or 50% of my DNA. A value of 2 means that the child inherited all of the DNA I had to offer in that location. Any value over 2 means that one or more of the bulleted possibilities above occurred.

Between both of my children, there were a total of 75, or 60% with values greater than 2 on cMs and 96, or 80%, on SNPs, meaning that my children matched those people on more DNA at that location than I had to offer. The range was from 2 to 2.4 with the exception of one match that was at 3.7. That one could well be a valid transition (other parent) match.

There has been a lot of discussion recently about X chromosome inheritance. In this case, the X would be like any other chromosome, since I have two Xs to recombine and give to my children, so I did not remove X matches from these calculations. The X is shown as chromosome 99 here and 23 on the graphs to enable correct column sorting/graphing.

In the chart below, inheritance events are charted by chromosome. The “Total” columns are the combined events of both my son and daughter. The blue and pink columns are the inheritance events for both of them, which equal the total, of course.

The “none” column reflects transmissions on that chromosome where my children received none of my DNA. The “some” column reflects transmission events where my children received some portion of my DNA between 0 (none) and 100% (all). The “all” column reflects events where my children received all of the DNA that I had to offer.

I graphed these events.

The graph shows the total inheritance events between both of my children by chromosome. Number 23 in these charts is the X chromosome.

These inheritance numbers cause me to wonder what is going on with chromosome 5 in the case of both my daughter and son, and also chromosome 6 with my son. I wonder if this would be uniform across families relative to chromosome 5, or if it is simply an anomaly within my family inheritance events. It seems odd that the same anomaly would occur with both children.

What this shows is that we are not dealing with a distribution curve where the majority of the events are at the 50% level and those that are not are progressively nearer to the 50% level than either end. In other words, the Expected Inheritance Frequency is not what was found.

The actual curve, based on the inheritance events observed here, is shown below, where every event that was over the value of 2, or 100%, was normalized to 2. This graph is dramatically different than the expected frequency, above.

Looking at this, it becomes immediately evident that we inherit either all of nothing of our parents DNA segments 85% of the time, and only about 15% of the time we inherit only a portion of our parents DNA segments. Very, very rarely is the portion we inherit actually 50%, one tenth of one percent of the time.

Now that we understand that individual generational inheritance is not a 50-50 bell curve event, what does this mean to us as genetic genealogists?

I asked fellow genetic genealogist, Dr. David Pike, a mathematician to look this over and he offered the following commentary:

“As relationships get more distant, the number of blocks of DNA that are likely to be shared diminishes greatly. Once down to one block, then really there are three outcomes for subsequent inheritance: either the block is passed intact, no part of it is passed on, or recombination happens and a portion of it is passed on. If we ignore this recombination effect (which should rarely affect a small block) then the block is either passed on in an “all or nothing” manner. There’s essentially no middle ground with small blocks and even with lots of examples it doesn’t really make sense to expect an average of 50%. As an analogy, consider the human population: with about half of us being female and about half of us being male, the “average” person should therefore be androgynous, and yet very few people are indeed androgynous.”

In other words, even if you do have a segment that is 10 cMs in length, it’s not 10 coin flips, it’s one coin flip and it’s going to either be all, nothing or a portion thereof, and it’s more than 6 times more likely to be all or nothing than to be a partial inheritance.

So how do we resolve the fact that when we are looking at the 700,000 or so locations tested at Family Tree DNA and the 600,000 locations tested at 23andMe, that we can in fact use the averages to predict relationships, at least in closely related individuals, but we can’t utilize that same methodology in these types of individual situations? There are many inheritance events being taken into consideration, 600,000 – 700,000, an amount that is mathematically high enough to over overcome the individual inheritance issues. In other words, at this level, we can utilize averages. However, when we move past the larger population model, the individual model simply doesn’t fit anymore for individual event inheritance – in other words, looking at individual segments.

Dr. Pike was kind enough to explain this in mathematical terms, but ones that the rest of us can understand:

“I think that part of what is at stake is the distinction between continuous versus discrete events. These are mathematical terms, so to illustrate with an example, the number line from 0 to 10 is continuous and includes *all* numbers between them, such 2.55, pi, etc. A discrete model, however, would involve only a finite number of elements, such as just the eleven integers from 0 to 10 inclusive. In the discrete model there is nothing “in between” consecutive elements (such as 3 and 4), whereas in the continuous model there are infinitely elements between them.

It’s not unlike comparing a whole spectrum against a finite handful of a few options. In some cases the distinction is easily blurred, such as if you conduct a survey and ask people to rate a politician on a discrete scale of 0 to 10… in this case it makes intuitive sense to say that the politician’s average rating was 7.32 (for example) even though 7.32 was not one of the options within the discrete scale.

In the realm of DNA, suppose that cousins Alice and Bob share 9 blocks of DNA with each other and we ask how many blocks Alice is likely to share with Bob’s unborn son. The answer is discrete, and with each block having a roughly 50/50 chance we expect that there will likely be 4 or 5 blocks shared by Alice and Bob Jr., although the randomness of it could result in anywhere from 0 to 9 of the blocks being shared. Although it doesn’t make practical sense to say that “four and a half” blocks will likely be shared [well, unless we allow recombination to split a block and thereby produce a shared “half block”], there is still some intuitive comfort in saying that 4.5 is the average of what we would expect, but in reality, either 4 or 5 blocks are shared.

But when we get to the extreme situation of there being only 1 block, for which the discrete options are only 0 or 1 block shared, yes or no, our comfortable familiarity with the continuous model fails us. There are lots of analogies here, such as what is the average of a coin toss, what is the average answer to a True/False question, what the average gender of the population, etc.

Discrete models with lots of options can serve as good approximations of continuous situations, and vice-versa, which is probably part of what’s to blame for confusion here.

Really DNA inheritance is discrete, but with very many possible segments [such as if we divided the genome up into 10 cM segments and asked how many of Alice’s paternal segments will be inherited by one of her children, we can get away with a continuous model and essentially say that the answer is roughly 50%. Really though, if there are 3000 of these blocks, the actual answer is one of the integers: 0, 1, 2, …, 2999, 3000. The reality is discrete even though we like the continuous model for predicting it.

However, discrete situations with very few options simply cannot be modelled continuously.”

Back to our situation where we are attempting to determine a relationship of 2 men born in the 1700s whose descendants share fragments of DNA today. When we see a particularly large fragment of DNA, we can’t make any assumptions about age or how long it has been in existence by “reverse engineering” it’s path to a common ancestor by doubling the amount of DNA in every generation. In other words, based on the evidence we see above, it has most likely been passed entirely intact, not divided. In the case of the Vannoy DNA, it looks like the ends have been shaved a few times, but the majority of the segment was passed entirely intact. In fact, you can’t double the DNA inherited by each individual 5 times, because in at least one case, Buster, doubling his total matching cM, 100, even once would yield a number of cM greater the size of chromosome 15 at 141 cM.

Conversely, when we see no DNA matches, for example, in people who “should be” distant cousins, we can’t draw any conclusions about that either. If the DNA didn’t get passed in the first generation – and according to the numbers we just saw – 58% doesn’t get passed at all, and 26% gets passed in its entirety, leaving only about 15% to receive some portion of one parent’s DNA, which is uniformly NOT 50% except for one instance in almost 1000 events (.1%) – then all bets for subsequent generations are off – they can’t inherit their half if their half is already gone or wasn’t half to begin with.

If I’m reading this right, a 10 cM block has a 10% chance of being split into parts during the recombination process of a single conception. Although 10% is not completely negligible, it’s small enough that we can essentially consider “all” or “nothing” as the two dominant outcomes.

This is the fundamental underlying reason why testing companies are hesitant to predict specific relationships – they typically predict ranges of relationships – 1st to 3rd cousin, for example, based on a combination of averages – of the percentages of DNA shared, the number of segments, the size of segments, the number of SNPs etc. The testing company, of course, can have no knowledge of how our individual DNA is or was actually passed, meaning how much ancestral DNA we do or don’t receive, so they must rely on those averages, which are very reliable as a continuous population model, and apparently, much less so as discrete individual events.

I would suggest that while we certainly have a large enough sample of inheritance events between me and my two children to be statistically relevant, it’s not large enough study to draw any broad sweeping conclusions. It is, after all, only 3 people and we don’t know how this data might hold up compared to a much larger sample of family inheritance events. I’d like to see 100 or 1000 of these types of studies.

I would be very interested to see how this information holds up for anyone else who would be willing to do the same type of information download of their data for parent/multiple sibling inheritance. I will gladly make my spreadsheet with the calculations available as a template to anyone who wants to do the same type of study.

I wonder if we would see certain chromosomes that always have higher or lower generational inheritance factors, like the “none” spike we see on chromosome 5. I wonder if we would see a consistent pattern of male or female children inheriting more or less (all or none) from their parents. I wonder what other kinds of information would reveal itself in a larger study, and if it would enable us to “weight” match information by chromosome or chromosome/gender, further refining our ability to understand our genetic relationships and to more accurately predict relationships.

I want to thank Dr. David Pike for reviewing and assisting with this article and in particular, for being infinitely patient and making the application of the math to genetics understandable for non-mathematicians. If you would like to see an example of Dr. Pike’s professional work, here is one of his papers. You can find his personal web page here and his wonderful DNA analysis tools here.

Averages. We all know what that means, conceptually. You add a group of numbers together and divide by the total of the numbers you added together. For example, 9 number locations that have a value of 10 each totals 90. If you divide 90 by the number of number locations, 9, you get 10 as the average. Of course, that’s a very simple example, but the concept applies no matter how many number locations or how big or small the numbers.

Often, we don’t grasp a good working knowledge of how to apply that math concept as it relates to our DNA results.

What I’m referring to here is the TIP calculator provided by Family Tree DNA, but this concept applies equally as well to any TMRCA (Time to Most Recent Common Ancestor) calculation, regardless of who is calculating it. The underpinnings, are, by necessity, the same.

At Family Tree DNA, the TIP calculator, the little orange button above, is available to you to compare Y-line results to matches and it will give you a rough idea of how long ago you can expect to have a common ancestor.

One of the most common questions I receive reads something like this:

“The TIP calculator says that we should be related at 99% within 12 generations, but my genealogy shows that it should be 8 generations. What is wrong?”

Or something like this: “The TIP calculator says we are related, but I have no idea how to interpret any of these numbers.”

The answer is that nothing is wrong and these are ranges of possibilities, based on average mutation rates of individual markers. Having said that, we know absolutely that mutations are random events. You can see this demonstrated in the Estes project where Abraham Estes (born 1647) who had 12 sons produced one line who has several people with no mutations as compared to Abraham, and another descendant whose line from another son has 8 mutations in the same timeframe. Now it’s obvious that both of these are on the outer bands of the spectrum, and the average is 4, which really is not reflective of either of these lines, but is dead center accurate for two of Abraham’s other sons’ lines.

Recently, I was working with the Nemaha Half-Breed Allottee, a list of names of mixed European/Native American individuals who received individual land allotments in 1860 in Nebraska from the government as a result of an 1830 treaty. When analyzing the 365 people who had European names, I realized that this is the perfect example of averages and how they do, and don’t, work. So let’s visit the Nemaha for a minute.

There are 122 different surnames represented, and the average then is that 2.99 people should carry each surname. 365 divided by 122=2.99. So let’s say 3 people, as it’s very close.

In reality, here’s how the surname distribution breaks down.

Number of People Carrying Surname

Number of Surnames

1

54

2

18

3

10

4

12

5

8

6

6

7

4

8

3

9

2

10

0

11

1

12

0

13

0

14

0

15

1

16

0

17

0

18

1

You can see that only 10 surnames actually have 3 people who carry them, for a total of 30 people, or about 12%. For the remainder, 90 surnames have fewer than 3 people, for a total of 25%, and 63% of the surnames have more than 3 people who carry that surname.

Stated a little differently, this average is accurate for 12% of the people, and inaccurate for 88%. It is close for many. About 23% fall directly on either side, meaning 2 people or 4 people carry that surname.

So what is the message here? Averaging tools, TIP included, do the best with what they have, which includes results at both ends of the spectrum. In this case, it includes the 54 surnames with only one person each, and the 3 surnames who each have over 10 people each, 11, 15 and 18, totaling 44 people. If these people were trying to make sense of these averages, 3 people per surname, these numbers would be totally irrelevant to them.

So the lesson here is to use these tools as a guideline, and nothing more. You could be in the middle and these tools could apply to your family exactly, or you could be in the family who has 18 people carrying one surname instead of the “average” of 3.

This reminds me very much of the ‘one size fits all” nightshirt that got passed around for some years at home when I was a kid. “One size fits all” really meant “fits no one” and translated into “no one was happy.” Of course, if you don’t understand the meaning of “one size fits all” and averages, you might be happy and think you have an answer that you don’t.