Pages

Friday, 4 August 2017

Autosomal DNA Discussions - and some statistics for my kits

There have been some interesting discussions on the mailing
lists recently*, which have caused me to look at some statistics for the kits I
manage. On the one hand, there were the, seemingly straightforward, questions concerning the best strategy for dealing
with autosomal DNA results, and how to manage the ever increasing influx of new results. Answers to these questions tend to include the importance of sharing multiple segments and of limiting the minimum length of the segments worked with, as well as focusing on names and locations
relevant to one’s own family history.

But, on the other hand, the ongoing debate, predominantly between two people who
I regard as genetic genealogy experts, Debbie Kennett and Tim Janzen, shows
that things can be far from “straightforward” when dealing with DNA. Alongside issues of terminology (what do we
actually mean when we say “identical by state”, or “identical by descent” etc.), and how far back shared ancestry might be for particular levels of shared DNA (even up to 10 or 20 generations), such
discussions often revolve around the problem of “triangulating groups”* (TGs) –
what causes them, how relevant they are (or aren't), and the factors that affect them (such as segment size, phasing, haplotype frequency, and the population that’s involved).

Fundamentally, the problem seems to be that
scientific modelling suggests TGs shouldn’t exist, as it’s thought to be “mathematically
impossible for so many people to share the same segment by virtue of sharing a
single ancestral couple.”* But many people's results seem to indicate that they do exist – so why?

I don’t have the answer to that question, obviously, and I've written before about the two differing theories (at http://notjusttheparrys.blogspot.co.uk/2016/11/dna-update.html) But two comments in particular struck me, as I realised that I hadn't specifically examined my kits with these issues in mind. First was Tim’s comment that half identical regions (ie matching segments) that are at least 15 cMs in length and
contain at least 2000 SNPs will almost always be "identical by descent" (IBD) and, secondly, Debbie’s comment that, in her experience with UK matches, the only segments that fall
into triangulated groups are small segments under 15 cMs, and that we would be better off focusing our attention on
matches that share over 15 cMs.

Debbie and I have discussed the numbers of TGs we have before, so I know my results show a few more than hers do, but this has prompted me to take a detailed look at my kits, to see the effect of applying such thresholds.

I began with FTDNA, where I have access to seven UK kits. The following graph show the numbers of matches I have with
particular “longest segment” lengths, annotated for any known relatives:

These graphs shows a group of four siblings and the numbers of matches they each have with particular “longest segment” lengths, annotated for any known relatives:

And finally in this section, graphs for the three other kits I have access to:

The following table summarises how many matches each of the
above kits would have to work with, if either a 15cM or a 20cM threshold was applied:

So applying such thresholds would certainly reduce the number of matches regarded as 'relevant' and make working with the results more manageable. But would we be missing useful information, as demonstrated by the 4c1r matching kit N?

I had hoped to produce similar graphs for the two kits I manage at 23andMe but, as the download file doesn't include details of the "longest segment" for matches sharing multiple segments, the following graphs include all segments for those matches my mother and I are sharing with (or who are "Open Sharing"). (I have removed the parent/child segment data to avoid an 'extended tail' in the graphs.)

The "curves" of the 23andMe graphs are much more irregular than the FTDNA kits, which could be a feature of the differences in the nature of sharing between the two companies.

But, once again, it is clear that applying thresholds of 15cM or 20cM would dramatically reduce the number of segments left to work with.

As a slight sidetrack, in view of another question on a mailing list, concerning the numbers of matches that don't match parents, I just thought I'd add in a graph to show the numbers of N's segments that are from matches identified as also matching N's mother.

As you can see, the number of non-maternal matches is generally greater than the number of maternal, which possibly indicates that there is some level of false positives in the results. However, it could also just be a sign that N's paternal side of her family has more matches in the databases - something which is supported by the higher numbers of matches N's paternal relatives have at FTDNA in comparison to N and her mother. But the important issue is that the segment lengths for the matches identified as maternal do go all the way down to 5cM. It seems to me therefore, that it would not be easy to distinguish which segments may be false positives (ie people identified as matches who are not genuine matches), based just on maternal/paternal matching. With such short segment lengths, it is possible that the parents' results are showing false negatives (ie genuine matches not identified as matches in the parent for some reason.)

Back to the original questions. Of course, segment length isn't the only consideration - Tim's criteria included the numbers of SNPs as well. The following scattergram shows numbers of SNPs per segment length, with the shaded area being those who would meet the criteria of being "at least 15 cMs in length and containing at least 2000 SNPs".

There are 165 segments (out of 1920) that would meet Tim's criteria to be genuine 100% of the time (and 32 segments, if the segment length used was 20cM.)

The 23andMe graphs for N shows an unexpected peak at 27cM, which the scattergram indicates is made up of some segments with less than 2000 SNPs. Closer analysis shows these are predominantly at the start of chromosome 15 and, using the ADSA tool*, it can be seen that all but one of the segments fully triangulate as a non-maternal TG.

Is it a genuine segment (ie descended to all the matches from a shared ancestor)? The low SNP count might imply not, but the apparent phasing and the fact that it is at the start of the chromosome (where recombination is perhaps less likely), as well as it being over 15cM, may be factors in favour of it being so.

But the honest truth is, I currently don't know - with factors both for and against it, I often think the only way to tell if a segment is genealogically relevant is if one finds a genealogical connection!

So, what about any other triangulating groups I might have? I started by using the more restrictive thresholds of 20cM and 2000 SNPs. At these levels, my FTDNA kit showed one TG:

However, three of the matches are clearly related to each other, so the TG actually only consists of three separate ancestral lines (theirs, mine and the fourth match's). When I added my close relatives in, all four of these matches show as paternal matches. Reducing the threshold to 15cM (but maintaining SNP threshold at 2000) picks up another member of the one family, and reducing it to 10cM picks up one other match (10cM, 2700 SNPs), who triangulates with all of the others.

On two other chromosomes, at 20cM, there is a match who shows as matching my 1c1r so, whilst not creating a TG, these do give me hints as to the relevant ancestral lines there.

In addition to the above TG, reducing the threshold to 15cM produces TGs on ten other chromosomes with my FTDNA kit. These can be identified as either paternal or maternal based on matching to relatives (who aren't shown, in order to keep the diagrams easy to read):

I do think some of these TGs look "too perfect" - for example, see chromosome 8, where twelve people show identical figures.

Decreasing the thresholds below 15cM, increases the numbers of matches in these TGs, as well as producing more TGs, but many look too regular, given the random nature of DNA transmission. The use of matching to close relatives to 'phase' the segments should indicate a genuine matching sequence on one chromosome out of a pair (rather than a "match" being created from criss-crossing between SNPs on the two chromosomes in a pair). But I do have a nagging suspicion that something may not be right, when all of the matches over any particular segment seem to be on just one chromosome, rather than there being overlapping maternal and paternal TGs - although, occasionally, that pattern of two overlapping TGs can be found, as in this example from chromosome 4:

Moving on to my 23andMe kit, at 20cM and 2000 SNPs that shows two TGs:

Chromosome 4 (maternal TG)

And on the X chromosome, a paternal TG:

I did think there was a third TG, on chromosome 11:

But, on checking the profiles, I discovered the two matches are identical twins, so that means there's just two ancestral lines involved (mine and theirs), and so this doesn't make a TG.

It is also a timely reminder that DNA results should always be analysed in conjunction with the genealogy!

Rerunning the 23andMe data using thresholds of 15cM and 2000 SNPs produces TGs on an additional eleven chromosomes. This time, I have included details of how my mother matches, since she's the only close relative at 23andMe, so it doesn't complicate the images too much and does make the phasing more obvious.

So, in my results, I do have some TG's above 15cM and 2000 SNPs. But I am not convinced that they are all valid, based on what they look like in comparison to what I understand about the random nature of DNA transmission. I do need to work through the groups, to see if there are any obvious explanations for the anomalies and "overly perfect" matching (as in the case of the identical twins above.) There are probably some other investigations I could do with the data, for example, checking for runs of homozygosity (sequences of identical SNPs on both chromosomes), which might be affecting matching.

However, I don't think there's much that I, as an individual test taker, can do to find out about how issues such as endogamy, haplotype frequencies, and population segments (which are some of the possible reasons given for why TGs may not be valid), might affect the validity of the TGs appearing in my results.

But trying to test the validity of the comments in the discussions wasn't the point of this post. My aim was purely to examine my results in the light of those comments, to see what doing so showed, and I feel that carrying out this analysis has been very useful. It has been helpful to look at ways to make the numbers of matches more manageable and to think about what information I might lose by doing so. Focusing on these aspects of my results has also caused me to notice things about the data that I had previously missed. I'm sure the results will also be helpful as I continue to work on the visual phasing and looking at how the segments shared with matches correlate with what might be predicted from that. There are clearly other aspects of my results that I also need to consider, such as matches sharing multiple segments and the company predictions about relationships levels, which I haven't taken account of here.

But, hopefully, all of this combined will enable me to work out some more effective strategies for dealing with my results - which, of course, must include one of the main things I have been reminded of during this process, which is the importance of checking out the genealogy of my matches!

Three discussions on the ISOGG lists (which are not public so I won't post the links - membership of ISOGG is free though, so please join - see https://isogg.org/ ) The discussions are in the threads [ISOGG] Autosomal Survey, [DNA-NEWBIE] Spreadsheets and new matches, and [DNA-NEWBIE] Re: Single Large Segments

5 comments:

You've covered most of it. The general rules are when single matching, you likely need a 15 cM to be likely IBD. If you triangulate, Jim Bartlett is sure that you only need 7 cM (possibly even 5 cM) and you're likely IBD.

The trouble with the shorter segments is that they're generally not genealogically relevant, and may originate 10 generations or 20 generations back or even longer. There are of course exceptions, and you're parent may have had a 30 cM match that got chopped to 5 cM in you when it got chopped by a crossover the other 25 cM got your other grandparent's chromosome instead.

The key point is that for a segment to be IBD, it must triangulate. The opposite is not true. A triangulated segment is not necessarily IBD. Chance matches can happen on shorter IBD segments. But by using triangulated segments, you are eliminating a lot of false matched that are guaranteed to not be IBD.

Yes, that makes sense - I think you've summed it up well. Some small segments can be relevant, as I showed in my post at https://notjusttheparrys.blogspot.co.uk/2017/06/ where a 6.9cM segment I have triangulates with my father's siblings (and has to be genuine for the visual phasing to work). But just because a segment triangulates, it doesn't necessarily make it genealogically relevant, even for the longer segments. There's still a lot we don't know!Kind regardsBarbara

Barbara, I'd be interested to know about the pattern of matching where you've found connections on the smaller segments. Were this single pairwise matches between you and your cousin or did you find other people matching on the same segment. You make a good point about the number of SNPs in the segment. I was intrigued by your big pile-up on chromosome 15. I wonder if that area would still be a match if there the SNP density was higher in that area on a different chip.

Thanks, Debbie. That is a good question - one of the things I noticed when I identified my first shared ancestor with a DNA match (a 4c1r) was that they were the only person matching me over that segment. (It was particularly interesting as, at that time, I was working on looking for TGs!) I'll go through all the connections and let you know what patterns there are. Regarding the pile-up on chromosome 15, yes, I agree. I'll check my data from the other companies, to see how that area compares on their tests (although I think some of them use the same chip, anyway). Do you know whether the new 23andMe test will have better coverage in that area?

Barbara, Yes that's my experience too. All the scientific papers tell us that the more frequent a haplotype is the less likely it is to be recent. It takes a long time for selection to take effect, and for a haplotype to reach a high frequency in a population. I don't know how the new GSA chip, used by both 23andMe and Living DNA, is going to work. It would be good if it could provide better coverage in some of the SNP-poor and pile-up-prone regions.