Pages

Wednesday, April 15, 2015

For years, genealogists have been able to use Y-DNA to validate paternal pedigrees and sort surnames into family groups. This has been a great advantage for the world of genealogy, but it has been restricted to men and paternal lines. Autosomal DNA is more inclusive. Both women and men can take this test and it illuminates the entire family tree as opposed to just the male line. For those of us that have taken an autosomal test, there are a number of tools that help find cousin matches. When we find multiple cousins matching the same chunk of DNA, we reach out to our new cousins and attempt to find a common ancestor in our trees. Many times this is unsuccessful due to incomplete trees. This is what is called a bottom-up approach.

What if we used a top-down approach? What if we started with your 10th great-grandmother? You’d say autosomal DNA can’t go back that far. That’s 12 generations ago and the DNA would be diluted to less than 1% of the original amount. If autosomal DNA behaved mathematically, you’d be correct. Autosomal DNA behaves more like Legos. When we inherit DNA from our parents, it’s true that we get 50% from mom and 50% from dad. That’s where the fairness ends. When we look at what we inherit from our grandparents (through our parents), it is never 50/50.

Instead, what we get from our grandparents is a random split. In the case of the illustration above, this grandchild received a 54/46 split. This is not uncommon. See this Slate article.

Our chromosomes behave like building blocks. There is a tendency for genes located closely on a chromosome to be inherited together in a block. This is called gene linkage. There is no set size for these blocks; size is completely based on the genes that tend to stay together. Segments around the 2 cM (centiMorgan) size have been found consistently (American Journal of Human Genetics). The DNA we get from our grandparents come to us in large contiguous sections of hundreds of these blocks. From generation to generation, the large sections are inherited randomly and unfairly, but the building blocks have a tendency to stay intact and not recombine. With each generation, there is 50% chance of inheriting or not inheriting a specific block.

It’s possible that these 2 cM building blocks are about 25 generations old. So, when we start with our 10th great-grandparents, they have lots of these blocks that they inherited from their parents and gave to their children. What we can expect is that their descendants will have an assortment of these blocks from them and other ancestors. When we examine the autosomal DNA for two dozen of their descendants, we find a set of genetic blocks in common. No one descendant will have all the available genetic blocks an ancestor has left in the gene pool. We may find five descendants sharing a block on chromosome one and seven descendants sharing a block on chromosome 12. With DNA samples from two dozen descendants, about 15 ancestral genetic blocks can be identified. All of the ancestral genetic blocks taken together uniquely identify your 10th great-grandparents as a couple. Only their descendants would have this genetic block combination. (Except in the situation where one set of siblings marries another set of sibling from a different family.)

When we take the process a step further and analyze the next generation, we start to build a genetic family tree.

The table above shows the genetic blocks identified for Stephen Hopkins and each of his children that had descendants. For simplicity, only one individual is listed for each column. Remember that each column of genetic blocks actually represents a married couple: Constance Hopkins and Nicholas Snow, Deborah Hopkins and Andrew Ring, etc. Each genetic block has a chromosome number and start and end locations. Blocks in green represent inherited blocks from Stephen to his children. As we build a genetic family tree, it now becomes possible to take a DNA sample from a living individual and match with Stephen Hopkins. Once a match with Stephen is found, matches to his children can be checked to see which child the sample descends from. Generations can be added to the genetic tree until known descendant DNA data has been exhausted. In the Hopkins family, I was able to extend Constance’s line by a generation to Mary Snow and then then to Mary’s daughter, Mary Paine, before the data ran out.

Similar to Y-DNA, these sets of genetic blocks (autosomal haplotype) can be used to identify genealogical relationships and sometimes the lack of relationships. John Hopkins of Connecticut has often been connected as a son of Stephen Hopkins. When we generate the autosomal haplotype for John and compare it to Stephen, we can see that there is no relation across the board.

The red blocks indicate John’s DNA segments that have no corresponding segments with Stephen. The yellow blocks indicate a similar chromosome location, but no genetic match. Y-DNA gives us the ability to use DNA to see how brothers are potentially connected. Now autosomal DNA gives us the ability to see how brothers and sisters are potentially connected.

The autosomal haplotyping process is not a silver bullet that will solve all of our genealogy problems. It will add to our toolkit as we validate family trees, work through brick-walls and attempt to solve genealogy mysteries.

Reference:

Maglio, MR (2015) Autosomal Haplotypes and the Genetic Reconstruction of Family Trees (Link)

Thursday, March 19, 2015

Autosomal DNA segment matching is a complex issue. Through testing and observation, it is
obvious that some segment matches are false positives. Computer algorithms will detect any matching
allele with no knowledge that the allele is of paternal or maternal origin.

If we said
that the left columns are from the father’s sides and the right from the mother’s,
we would see that none of the columns match.
Obviously, we can’t just draw a line down the middle and say one side is
the mother’s DNA. To determine which DNA
came from mm and which came from dad, the autosomal results would need to be
phased. To phase the results of an
autosomal sample it must be compared to at least one parent result. By difference, the child result can be split
into its paternal and maternal contributions.

If it were possible to phase every sample to
be matched, false positives by computer algorithm would be eliminated.Unfortunately, phasing every sample is not
always possible.A person’s parents may
be deceased or even unknown.

Another
method of reducing or eliminating false positives is to triangulate each
matching segment. If a segment from autosomal
sample A matches the corresponding segment from sample B and sample B
matches sample C and sample C matches the original sample A, then the segment
is considered triangulated and identical by descent. How confident are we that the triangulated matches
aren’t just a circular series of false positives?

Let’s look at segment on chromosome 3 that
starts at rs6796502 and is 2.5 cM and 946 SNPs.
For this exercise, any chromosome segment could be used.

Table 1. Allele frequencies of 20 loci on chromosome 3.

On that segment, there are 20 published locations
with allele frequencies (NCBI). Table 1
shows the how often a certain allele combination (AA, AC, AG etc.) appears for
a European population.Based on allele
frequency, the most common combination of alleles in this section of chromosome
3 for a population of European descent is listed in Table 2.I have artificially selected the most common
combination to simulate a large portion of the population with European
descent.About 1 in 3,400 or about or
about 300,000 people should have this combination.

Table 2. Predicted allele combination.

Imagine
for a moment that you roll six dice.The
first die comes up with a one and the second is a two and so on.The probability of rolling a one on the first
die is 1/6 (one side up on a six-sided die).The probability of rolling a one and then a two is 1/6 times 1/6 or
1/36.It will happen once every 36 rolls.The combination illustrated on six dice would
happen once in every 46,656 rolls.Now
imagine that is your DNA and we are looking for a match.The other person would need one through six
in the same order.To calculate that
probability we multiply 46,656 by 46,656 and get 2,176,782,336.DNA matching actual has a better probability of
matching.

Table 3 lists the most common alleles again
along with potential alleles that would generate a half match and the
corresponding summed frequency. The
probability of the set of 20 potential combinations existing is equal to the
product of the frequencies - 0.759.This
probability has to be extrapolated from 20 loci to 946, giving us 2.45x10-6
or 1 in 400,000.There is a 1 in 400,000
chance of a completely random match on this section of chromosome 3 for the
alleles with the highest frequency.It
is well within reason to expect false positives for this one-to-one match.

Table 3. Probability of a half match within a European population.

In the
event of a three-way match (triangulation), we multiply by 2.45x10-6
again, giving us a probability of 1 in 167 billion.Now we are outside of what is statistically
reasonable.

The
most common set of European alleles doesn't produce the highest probability of
a random match. When the alleles are not
the same (AC, AG, CT etc.), there is a higher chance of an autosomal half
match. Table 4 shows an actual set of
alleles and the corresponding set of alleles to generate a half match.

Table 4. Probability of a half match within a European population using actual sample.

This
actual sample takes us from a false positive probability of 1 in 400,000 to 1
in 5,900 (0.000169).A probability of 1
in 5,900 indicates that we should be seeing completely random matches that have
no genetic relationship on a regular basis.Considering a population of about 1.6 million autosomal tests taken,
each of us would have 270 false positive matches on a segment similar to the
one shown.

Triangulated
matches exist for this segment of chromosome 3.
For the probability of this triangulated segment, we multiply by 0.000169
again, giving us 2.87x10-8 or about 1 in 35 million. Considering the number of results available
for matching (about 1.6 million), it is not realistic that we are matching randomly. In fact, most triangulated matches involve
more than three test results. If four
test results are triangulated, the probability goes to 1 in 205 billion. These probabilities indicate that triangulated
results cannot be random and are matching due to common genetic descent.

I have
intentionally used two examples that have a higher probability of having false
positive matches. As soon as we look at
matches that don’t have the higher frequency European alleles, the probability
of a false positive diminishes.

Table 5. Probability of a half match within a European population with a Mediterranean sub-component.

Table 5
shows a typical set of alleles.There
are two alleles at rs7630053 and rs4558783 that are not typical European and may
indicate a Mediterranean ethnicity.The
probability of a one to one match on this segment being a false positive
calculates to be 1 in 7 quadrillion.

Currently,
we cannot examine the allele frequency for every SNP in every match we
attempt. When looking for autosomal
matches consider phasing or triangulation.
Phasing the data is very valuable, yet the resources are not always
available. I’ve shown that triangulation
eliminates false positives and those matches are statistically identical by
descent. Triangulated small segment
matching is very valuable in our research.

References:

Maglio, MR (2015) Autosomal DNA and the Triangulation of Small Segments: A Statistical Approach (Link)

Thursday, March 5, 2015

There has been much debate over the use of
small autosomal DNA segments. It is
important to understand where they come from and how they can be used for
genetic genealogy. Small segments are
considered noise and false matches. There
are too many small matches to make sense out of, but they are not necessarily
false matches. These segments have been
in the population for longer than we thought.
When I match someone at 2 cM it is very likely that they are a 12th
cousin, not a 5th cousin.
There is no reason for us to look for small segment matches until we
understand where these segments originated.

When we talk about autosomal DNA, we often
over simplify the process of genetic inheritance. The simple answer is that we inherit half of
our DNA from dad and half from mom. The
common message is that with every generation the DNA contribution from an
ancestor is randomized and reduced until it is insignificant. Genetic inheritance is actually much more complex
than that. Complex in a great way. There is a tremendous amount of ancestral
information that we are just beginning to tap into.

We inherit DNA from our parents and their
ancestors in large sections. Take a look
at the graphic below. Each example is
the comparison of a grandchild to a set of paternal grandparents. You can see in the first example that the
grandchild inherited over two-thirds of their grandfather’s first chromosome
intact (blue bars). The remaining
section of the first chromosome is from their grandmother. In the third example, the grandchild has
inherited the entire chromosome 14 from their grandmother. It is physically possible that this
grandchild could someday give one of their children the grandmother’s complete
chromosome 14.

In an effort
not to over simplify, this is just half the story.That grandchild has an equal contribution
from their maternal grandparents.

In the examples above, we can visualize what
happens when DNA recombines. The first
example shows where one section of the grandfather’s DNA swapped places with
the grandmother’s DNA before it was inherited by the grandchild. This is called crossover. In the examples, a) is a single crossover, b)
is a double crossover and c) has no crossover.
On average, each of our chromosomes experienced 2 or 3 crossovers before
we inherited them.

Where DNA
crossover takes place on a chromosome is not random. There are approximate locations where the
chromosome is more likely to split.
These locations are cleavage
sites.

These
locations exist because there are groups of genes along a chromosome that have
a tendency to stay together. These
groups are part of gene linkage. These linked genes only allow for chromosome
splits at either end of their linked section.
In my research, the minimum size for one of these gene-linked sections
is about 2.5 cM. These small segments
then travel in larger groups.

In the graphic above, the blue bar
represents about a 60 cM match. The
intersection between the black and orange ovals is about 2.5 cM and represents
a minimum segment. In this crossover
recombination, the large segment actually split to the right of the minimum
segment. In a future crossover, the
chromosome could split on the left side of the minimum segment, giving a large
segment bound by the orange oval.

Why are these minimum segments
important? My research shows that these
segments stay in the gene pool for dozens of generations. Over time, naturally occurring SNP mutations
take place. These minimum inherited
segments (MIS) can be differentiated into family groups.

In my research, I started with 28 well known
US colonial surnames and 393 autosomal kits.
For each surname, the associated kits were triangulated. If three or more kits match on the same
segment, you can deduce that it came from a common ancestor. Each of the surnames investigated had 6 to 13
distinct triangulated segments. Taken
together, these triangulated ancestral segments represent an autosomal
haplotype that can be used to identify a descendant’s genetic connection to an
ancestor. Across all of the surnames,
these distinct segments appear at recurring locations on each chromosome. I have listed 21 of these ancestral loci in
my paper.

Not all ancestral segments are the same
type. The segments can be categorized
into three groups. The first category is
Common to All. The surnames in this study are predominantly
European. One segment has been
identified on chromosome 2 that triangulates across all surnames. This segment correlates to a Western Atlantic
ethnicity and I call it the Western Atlantic Autosomal Haplotype (WAAH). The Western Atlantic Autosomal Haplotype
should not be confused with ancestry informative markers (AIMs). The WAAH is composed of about 800 SNPs and
there are only about 100 AIMs SNPs in that same stretch of chromosome 2.

The next category is Shared. Some segments can be
attributed to two or more surnames.
There was considerable intermarriage between US colonial families. That period was a bottleneck genealogically
and genetically. As two major families
married, their combined DNA segments entered the gene pool and were reinforced
as their descendants intermarried.

The third category is Unique. These shared
segments cannot be attributed to intermarriage of families. Yet the resulting familial autosomal
haplotypes are not composed of a single surname. In the case of Benjamin Franklin, the genetic
proximity to his wife, Deborah Read and his mother, Abiah Folger, may make it
impossible to distinguish between Folger, Franklin and Read DNA. Therefore, the haplotype represents the
combined inheritance.

Here is one of my case studies. Augustine
Bearse was born in England in 1618 and died in Barnstable, MA before 1697. The Bearse family was chosen due to my
familiarity with the genealogy and the debate surrounding Augustine’s wife. His wife Mary was supposedly the
granddaughter of the Chief of the Cape Cod Native American tribes. The goal was twofold; to identify the autosomal haplotype for the
Bearse family and determine whether any of the ancestral segments had Native
American ethnicity.

The Bearse study was composed of 48 autosomal
samples. These samples were collected
based on claimed genealogical connections.
The triangulated samples generated 8 ancestral loci and indicated an
additional 5 loci that had the potential to triangulate with more samples. The resulting Bearse autosomal haplotype is
found below.

Bearse Autosomal
Haplotype

The Bearse haplotype contains the Western
Atlantic Autosomal Haplotype (chromosome 2) which is common to all haplotypes
in the study. The other 12 loci are more
valuable for genealogical validation.
One of the Bearse descendants triangulates on six of the ancestral segments. It is highly unlikely that a descendant would
match on all of the segments. Although
ancestral segments survive over the generations, the randomness of their
distribution makes it difficult for any one person to have received them
all. Yet, triangulating on just one
segment unique to Bearse is enough to indicate and validate a
relationship. Lack of a match could mean
that an ancestral segment was not inherited or that a non-familial event
(adoption, infidelity, etc.) has occurred and the individual’s family tree is
incorrect.

In order to investigate the origins of
Augustine’s wife Mary, each ancestry segment from the haplotype was evaluated
for ethnicity. Only the segment on
chromosome six at location 55850885 had any Native American ethnicity. This ancestral segment had not fully
triangulated, yet a few of the samples match exactly on Native American
SNPs. With additional samples, the
segment could triangulate. Once
validated, the segment might be shared across multiple surnames or unique to
Bearse, indicating Native American genes in the Bearse descendants.

While the amount of autosomal DNA received
by each successive generation is only half from each parent, that does not mean
that given enough generations a distant ancestor’s genetic contribution will
become negligible. Through genetic
linkage, portions of DNA are inherited intact.
Naturally occurring cleavage sites allow for ancestral segments
averaging 2.5 cM to be passed from generation to generation as a minimum
inherited segment (MIS).

Ancestral segment analysis is invaluable for
the identification of distant ancestors.
All of the triangulated ancestral locations combine to become a Familial
Autosomal Haplotype (FAH) that can be used to validate family history.

Since finishing my initial research, I have
gone on to identify over 50 ancestral loci and over 700 autosomal haplotypes
for US colonial ancestors. Stay tuned
for further advances in autosomal research.

Wednesday, January 28, 2015

In 2006, Laoise T. Moore and the folks at Trinity College in Dublin published a paper famous for identifying the modal haplotype of Irish High King Niall of the Nine Hostages. In their work, they used seventeen Y-DNA STR markers. While time to most recent common ancestor (TMRCA) calculations have accuracy issues, having only 17 markers gives a common ancestor over 2,000 years ago. What the Trinity folks really accomplished was the identification of Niall’s paternal ancestor from over 400 years earlier. The media in 2006 had a field day in their interpretation that most of Ireland is descended from Niall. “Niall may be the most prolific male in Irish history.” Also at 17 markers, there is a very high probability of convergence. Through normal mutations, haplotypes can change over time to appear similar or identical to other haplotypes. The lower the number of markers, the higher the chance of convergence. At that time only high level SNPs were tested to determine haplogroup. Without terminal SNPs it would have been impossible to recognize convergence, if it existed in the samples.

In my research on the Kings of Ireland, I have used 67 markers to reduce the chance of convergence and to calculate the age of common ancestors on the descendant side of the target rather than the ancestor side. I will demonstrate traditional median-joining networks and novel “tribal” markers for the identification of four historic Kings of Ireland. Did Trinity get Niall’s haplotype correct with the limited data they had at the time?

Ghost: a manifestation of a dead person

Modal haplotype: a derived haplotype based on the DNA tests of a group of people

A modal haplotype is a ghost of a person. When we look at multiple DNA test results and calculate the mode, by definition we are just taking the values that appear most often. There is no way to determine if the modal haplotype is the actual haplotype of the historic individual we are researching (short of historic samples). While the modal is not perfect, it will be close enough at 67 markers for us to determine the genetic “ghost”.

The septs of Ireland provide us an opportunity to develop genetic genealogy techniques and processes. Irish surnames are typically patronymic. The surnames generally take the form of Mac Cárthaigh (McCarthy), meaning son of Cárthaigh or Ui Néill (O’Neill), meaning grandson / descendant of Néill. Irish septs serve as a collective of related families with shared ancestry and patronymic surnames. Multiple septs then belong to larger dynasties such as the Eóganachta and the Dál gCais.

If septs are patrilineal, then Y-DNA haplotypes should be consistent across sept surnames. Research on the Uí Néill haplotype started with a geographical selection and then a subsequent reduction by sept surnames (Moore et al 2006). For each target sept, affiliated surnames were identified. In the case of Uí Néill, the following surnames and associated Y-DNA STR records were accessed from Family Tree DNA projects: O’Neill, Gallagher, Doherty and O’Donnell. The selection includes 600 records and 5 common European haplogroups.

Median-joining networks have been in use for over a decade for the visualization of genetic relationships. The use of them at 67 STR markers has been rare, but it should be the norm. This first image has the central cluster of a median joining network based on 25 STR markers from the Uí Néill group. It is just a single cluster with no differentiation.

Figure 1 - Using only 25 STR markers, the Uí Néill network collapses to a single cluster.

When we look at the same group using 67 markers, we get four distinct clusters, each with their own SNP. The cluster at the far right is predominantly R-L159 and the cluster at the lower right has R-P311/R-L151 nodes. The cluster at the left contains all of the Uí Néill dynastic surnames, has the majority of nodes and is SNP R-M222, which is consistent with earlier studies.

Figure 2 - View of the Uí Néill network torso showing four distinct clusters. Three groups on the right are O’Neill only.

As a double check to make sure that I wasn’t seeing some other phenomena, I analyzed three random Irish surnames; Duffy, Kelly and McCormick. The random sample produced over ten unique clusters with no surname overlap. This comparison shows that septs are patrilineal and that Y-DNA haplotypes are consistent across sept surnames.

A different technique that I’d like to illustrate involves the fact that not all STR markers are created equal. This method takes advantage of “slow” mutating STR markers. Each marker has its own mutation rate. By selecting the 15 “slowest” markers with an average mutation rate of 0.00024, a virtual tribal haplotype is created that would be stable within the last 2,000 years (90% probability of 80 generations). This is an order of magnitude lower than the average rate of 0.0029 used as a constant in typical TMRCA calculations. The “tribal” markers isolated are DYS426, DYS388, DYS392, DYS455, DYS454, DYS578, DYS590, DYS641, DYS472, DYS594, DYS436, DYS490, DYS450 and DYS640.

To manipulate the “tribal” haplotype of 15 microsatellites faster the resulting values are concatenated into a string – ex. 12121411119168108101212811. The “tribal” haplotypes are summarized per surname and plotted to illustrate majority and affinity.

The Uí Néill dataset resolved into 37 unique “tribal” haplotypes. Figure 5 shows that haplotype 12121411119168108101212811 is the most dominant across the Uí Néill surnames. As with the median-joining network analysis, this “tribal” haplotype is consistent with SNP R-M222.

I repeated these two techniques for the Uí Briúin sept using the following surnames and associated Y-DNA records: O’Brien, Hogan, Kennedy and McMahon. The selection includes 615 records. The Mac Cárthaigh dataset has the following surnames: McCarthy, Callaghan, Donovan and Sullivan. The selection includes 319 records. The Ua Conchobhair data has the following surnames: O’Connor, McManus, Reilly and Rourke. The selection includes 352 records.

Here are a couple of interesting insights from my research. Niall Noígíallach was High King of Ireland around 378 CE and founder of the Uí Néill dynasty. Historically, his half-brother Brión, was one of the founders on the Connachta dynasty and an ancestor of the last High King of Ireland, Ruaidrí Ua Conchobair. If their genealogies are correct, the evidence is in their descendant’s DNA. The data shows that Uí Néill and Ua Conchobair share the same SNP, R-M222. The Uí Néill and Ua Conchobair modals are a 6-step match at 67 markers. There is a 99% probability of a relationship not further than 1,260 years ago. The results make a strong case for the validity of this historic genealogy.

Brian Boru, High King of Ireland in 1002 CE, belonged to the Dál gCais dynasty and Tadhg Mac Cárthaigh, the first King of Desmond, belonged to the Eóganachta dynasty. Ancient genealogies have the Eóganachta and Dál gCais dynasties descended from Ailill Aulom, the son-in-law of legendary king Conn of the Hundred Battles. The Mac Cárthaighs and Uí Briúins do not share the same SNP (R-L226 vs. R-CTS4466), but by descent they would share a common R-DF13 ancestor. The Mac Cárthaigh and Uí Briúin modals are an 11-step match at 67 markers. There is a 99% probability of a relationship not further than 1,920 years ago. This puts a Mac Cárthaigh-Uí Briúin common ancestor as a contemporary of the legendary Conn.

New and improved genetic genealogy techniques are invaluable for the identification of historic individuals and the reconstruction of distant family trees at the macro level.