HapFABIA: Identification of very short segments of identity by descent
characterized by rare variants in large sequencing data

This site contains supporting material to the
manuscript "HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data".

Summary

Our method HapFABIA identifies
short identity by descent (IBD) segments
that are tagged by rare variants in
large sequencing data.
Two haplotypes are identical by descent (IBD)
if they share a segment that
both inherited from a common ancestor.
Current IBD methods reliably detect long IBD segments
because many minor alleles in the segment are concordant
between the two haplotypes.
However, many cohort studies contain unrelated individuals
which share only short IBD segments.
Short IBD segments contain too few minor alleles
to distinguish IBD from random allele sharing by recurrent mutations.
New sequencing techniques improve the
situation by providing rare variants which
convey more information on IBD than common variants,
because random minor allele sharing
of rare variants is less likely than for common variants.

IBD segment (yellow) that descended from a founder to different individuals.

Short IBD segments are of interest because
(i) they resolve the genetic structure on a fine scale
and (ii) they can be assumed to be old.
In order to detect short IBD segments, both the information supplied by rare
variants and information from more than two individuals should be
utilized.
These two characteristics
are the basis for detecting short IBD segments by HapFABIA.
We propose biclustering
to detect very short IBD segments that are shared
among multiple individuals.
Biclustering simultaneously clusters rows and columns of a matrix.
In particular it clusters row elements that
are similar to each other on a subset of column elements.
A genotype matrix has individuals (unphased) or chromosomes (phased)
as row elements and SNVs as column elements.
Entries in the genotype matrix usually count how often
the minor allele of a particular SNV is present in
a particular individual.
Alternatively, minor allele likelihoods or dosages may be used.
Individuals that share an IBD segment are similar to
each other at minor alleles of SNVs (tagSNVs) which tag
the IBD segment (see Figure below).
Therefore an IBD segment that is shared among individuals
corresponds to a bicluster because these individuals are similar to
one another at this segment.
Identifying a bicluster means identifying tagSNVs
(column bicluster elements) that tag an
IBD segment and, simultaneously, identifying
individuals (row bicluster elements) that possess the IBD segment.

Biclustering of a genotyping matrix.
Left: original genotyping data matrix with individuals as
row elements and SNVs as column elements. Minor alleles
are indicated by violet bars and major alleles by yellow bars for each
individual-SNV pair.
Right: after sorting the rows, the detected bicluster
can be seen in the top three individuals.
They contain the same IBD segment which is marked in gold.
Biclustering simultaneously clusters rows and columns of a matrix so
that row elements (here individuals) are similar to each other
on a subset of column elements (here the tagSNVs).

Analysis Box for doing your own analysis of IBD segment
sharing between human, Denisovan, and Neandertal:Analysis
box for your analysis:
All the data has been prepared for analysis of short IBD sharing between
human, Denisovan, and Neandertal. All R-scripts, which are used to
generate the results and the plots of the manuscript and the
supplementary information, are provided. It is
very simple to do your own analysis. GO AHEAD!

Examples of Short IBD Segments in Chromosome 1 of the 1000 Genomes Project

Figures 1-6: Examples of IBD segments that were
extracted from
chromosome 1 of the 1000 Genomes Project. For these phased
genotype data,
phasing errors can be seen (yellow lines from the left hand
side). Click on any of these thumbnails to view full-size images.

Fig. 1:
IBD segment exclusively found in Africans. The third
and fourth line very likely show a phasing error as both
chromosomes belong to the same individual. Analog the last
but fourth and last but fifth line.

Fig. 2:
IBD segment observed in all populations including
one African. However this might also
be a region of sequencing errors because the tagSNV pattern is
not very clear.

Fig. 3: IBD segment observed in all populations.

Fig. 4:
IBD segment shared by Africans and one admixed
American. Again phasing errors for the last two
lines (NA20299) and lines 11 and 12 (NA19248).

Short IBD Segments Found in Data from the Korean Personal Genome Project (KPGP)

The Korean Personal Genome Project (KPGP)
is part of the international Personal Genome Project (PGP) established by Genome Research Foundation (GRF).
39 Human genomes were sequenced on an Illumina HiSeq 2000 platform with 30x to 40x coverage.
The genotypes of these 38 Koreans and one Caucasian female are combined
with the genotype data of the 1000 Genomes Project to extract short
IBD segments by HapFABIA.

The KPGP data contains two twin pairs (KPGP88/KPGP89 and
KPGP90/KPGP91) and a family (KPGP1-KPGP12). KPGP10 is a Caucasian
female from US. The relations are
given in the following pedigree charts:

Pedigree
charts for the KPGP individuals. Click on thumbnail to view full-size image.

Figures K1-K7: Examples of short IBD segments from
chromosome 1 of the KPGP combined with the 1000 Genomes Project. Click on any of these thumbnails to view full-size images.

Fig. K1:
IBD segment caused by systematic sequencing errors.
Note that
this segment is observed in all KPGP individuals and only
those, though KPGP10 is a Caucasian female.

Fig. K2:
IBD segment with sequencing errors
for KPGP individuals at the right hand side.
Some Koreans are classified to have this segment
because they only agree to other Koreans at the
sequencing errors.

Fig. K3:
IBD segment that
matches the Denisova genome and
shared among Asians, in particular Koreans.

Fig. K4:
Another IBD segment that matches the Denisova genome and
is shared by Asians, in particular observed in Koreans.

Fig. K5:
IBD segment exclusively shared by Koreans.

Fig. K6:
IBD segment that is shared by both Korean twin pairs.
Sequencing errors can be seen as twins should have the
same IBD segments.

Fig. K7:
IBD segment which is shared by
many members of the Korean
family. The IBD segment is descended from KPGP1 to
all her children (KPGP3, KPGP5, KPGP9) and some of her
grandchildren (KPGP7, KPGP11, KPGP12).

Correlation between population proportions and ancient genomes
based on short IBD segments

Persons correlation between the
Denisova genome and different populations.

Fisher
test for dependencies between the Denisova genome and
different populations.

Persons correlation between the
Neandertal genome and different populations.

Fisher
test for dependencies between the Neandertal genome and
different populations.