search this blog

Wednesday, September 30, 2015

The 1000 Genomes paper

To be honest, I'm really looking forward to some papers based on the new Simons Genome Diversity Project dataset. Unlike the 1000 Genomes, it includes samples from a wide range of West Eurasian populations sequenced to at least 30x coverage (see here). But for now, open access at Nature:

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

5 comments:

That Simons project has a great range of samples but the sizes seem small - most European samples for instance have just 1-3 individuals. Possibly price to be paid for high coverage, I wonder if that will make big enough difference to make up for it.

Are the Simons genomes experimentally phased at all? Are there parent/child pairs in the data?

I think this is very important information, especially as we start getting into quite rare variants for which imputation will not work.

Millions of small rare haplotype blocks will be able shed light on admixture between otherwise very closely related groups. They could even be used to trace back direct lineages like mt and Y haplogroups do.

I think that a few thousand worldwide single-sperm genome sequences would be very informative. You can only get so much data from the comparison of diploid genotypes.

I do find this 100 Genome paper to be informative. Because the results are based on complete sequences and not pre-determined SNPs (Like Illumina chip data), it doesn't suffer from the ascertainment biases that the chip data are plagued with. Hence from K=12 admixture analysis in extended figure 5, one can readily see the separation of Tuscans and Iberians from CEU and Great Britain samples and how New World Columbians and Mexicans share with the Southern Europeans. Also, Yoruba separates from far West Africans, and you can see how African-Americans (ASW) have a somewhat larger ancestry component from Yoruba than from say Senegal. This differentiation is difficult to see with say 23andMe data.

@DavidskiAre you able to run the ancient DNA samples with these 1000 genome SNPs that contirbuted to these new admixtures?

My 1000 Genomes dataset is based on Illumina SNPs that are also found in the 23andMe genotype files. I'd have to download everything again and pull out different SNPs that overlap better with the ancient genomes published so far. I can't do that at the moment.

But the Human Origins dataset that I have shouldn't suffer from ascertainment bias. Human Origins SNPs are specifically picked for population genetics.