Sunday, October 04, 2015

Genetic variation in human populations

The Human Genome Project produced a high quality reference genome that serves as a standard to measure genetic variation. Every new human genome that's sequenced can be compared with the reference genome to detect differences due to mutation. It's possible to build large databases of genetic variation by sequencing genomes from different populations. Genetic variation can be used to infer evolutionary history and to test theories of population genetics. Detailed maps of genetic variation can also be used to infer selection (genetic sweeps) and distinguish it from random genetic drift.

In addition to this basic science, the analysis of multiple human genomes can be used to map genetic disease loci through association of various haplotypes with disease. The technique is called genome wide association studies (GWAS). The same technology can be used to map other phenotypes to identify the genes responsible.

The 1000 Genomes Project Consortium has just published their latest efforts in a recent issue of Nature (Oct. 1, 2015) (The 1000 Genomes Project Consortium, 2015; Studmant et al., 2015). They looked at the genomes of 2,504 individuals from 26 different populations in Africa, East Asia, South Asia, Europe, and the Americas.

The idea is to identify variants that are segregating in humans. Single nucleotide polymorphisms (SNPS) are difficult to identify because the error rate of sequencing is significant. When comparing a new genome sequence to the reference genome you don't know whether a single base change is due to sequencing error or a genuine variant unless you have a high quality sequence. Most of the 2,504 genome sequences are not of sufficiently high quality to be certain that the false positive rate is low but by sequencing multiple genomes it becomes feasible to identify variants that are shared by more that one individual within a population.

Recall that every human genome has about 100 new mutations so that even brothers and sisters will differ at 200 sites. The 1000 Genomes Consortium looks at the frequency of alleles in a population to determine whether the genetic variation is significant. They use a preliminary cutoff of 0.5%, which means that a variant (mutation) has to be present in 5 out of 1000 genomes in order to count as a variant that's segregating within the population. They estimate that 95% of SNPs meeting this threshold are true variants. For small insertions and deletions the accuracy is about 80%.

For variants at lower frequency, additional sequencing to a depth of >30X coverage was done and the putative variant was compared against other databases of genetic variation. The predicted accuracy of variants at 0.1% frequency is about 75%.

Given those limitations, the results of the studies are very informative. Looking at single base pair changes and small indels (insertions and deletions), the typical human genome (yours and mine) differs from the standard reference genome at about 4.5 million sites. That's about 0.14% of our genomes. Humans and chimpanzees differ by about 1.4% or ten times more.

SNPs and small indels account for 99.9% of variants. The others are "structural variants" consisting of; large deletions, copy number variants, Alu insertions, LINE L1 insertions, other transposon insertions, mitochondrial DNA insertions (NUMTS), and inversions. The typical human genome has about 2,300 of these structural variants of which about 1000 are large deletions.

Most of these variants are in junk DNA regions but the typical human genome carries about 10-12,000 variants that affect the sequence of a protein. Many of these will be neutral and some of the ones that have a detrimental effects will be heterozygous and recessive. The average person has 24-30 variants that are associated with genetic disease. (These are known detrimental alleles. If you get your genome sequenced, you will learn that you carry about 30 harmful alleles that you can pass on to your children.)

The Consortium reports that the the typical genome has variants at about 500,000 sites mapping to untranslated regions of mRNA (UTRs), insulators, enhancers, and transcription factor binding sites. I assume they are using the ENCODE data here so we need to take it with a large grain of salt. Most of these sites are not biologically relevant.

As expected, common variants are distributed in populations all over the world. These are the result of mutations that arose several hundred thousand years ago and reached significant frequencies before the present-day populations separated. However, 86% of all variants are restricted to a single continental group. These are the result of mutations that occurred after the present-day populations split.

The African populations contain more genetic variation than the Asian and European populations. Again, this is is expected since the European and Asian groups split from within the African group after Africans had been evolving on that continent for thousands of years. The differences are not great—Africans differ at about 4.3 million SNPs while the typical Europeans and Asian differ at only 3.5 million SNPs.

Only a small number of loci show evidence of selective sweeps, or recent selection (adaptation). It indicates that most of the differences between local ethnic groups are not associated with adaptation. The exceptions are SLC24A5 (skin pigmentation), HERC2 (eye color), LCT (lactose tolerance), and FADS (fat metabolism).

50 comments
:

With 10-12000 variations in the protein coding region from standard (if I have this right) being on average 10% of expected value based on the predicted 1 to 2% of the genome being protein coding regions. (6000 av variations in CR divided by 4.5M in NCR) Do we think this is because of error detection and correction in protein coding regions?

Hi Aceofspades: Attached is a paper that involves an experiment that exposes different grasses (corn wheat rice etc) transposable elements to their genomes. The percentage that land in the protein coding region is consistent with the percentages above. This experiment is w/o selection.

I think protein coding regions are more susceptible to receiving TEs because it's easier for TEs to get into euchromatin. I don't think (though could be wrong) that mutation itself is distributed asymmetrically wrt coding/non-coding. Hard to see a mechanism, at least, since exon boundaries are only apparent during transcription/translation, not during replication/recombination where most mutation takes place.

Hi Allan: The data in this experiment was quite clear that the introns were receiving a disproportionate number of mites(TE's). Also the data showed that expression was increased when mites were installed in introns. I am thinking possibly error correction, especially if you think protein coding regions are more susceptible but not sure.

The experiment shows insertions that are already in the genome and presumably have been there for many generations. Consider an insertion even 1000 generations ago. If it's been selected against, it probably didn't survive to be observed today. There's your selection: not in the present, but in the past. And that's what makes transposons more common in introns than in exons, not some unknown error correction mechanism.

Whoa. I looked at the abstract. That isn't even an experiment. It's a comparison of the genomes of four different grass species. The abstract isn't about shared TEs but about common features of TE distribution, and it never mentions exons at all. At any rate, the TEs in those four species would almost all have been exposed to selection (if there were any) for a very long time, so the distribution reflects selection as well as mutation.

Hi John the attached paper has data that is closer what caught my eye. I will contact Susan to see if I can get more detail. The charts I found interesting are just after the abstract.Naito K, Zhang F, Tsukiyama T, Saito H, Hancock CN, Richardson AO, Okumoto Y, Tanisaka T, Wessler SR. (2009). Unexpected consequences of a sudden and massive transposon amplification on rice gene expression. Nature. 461(7267):1130-4.

Hi Allan and JohnIf you look at the 2009 paper the data that caught my eye was the insertion bias in the 5' and 3'( as Allan mentions) regions as well as introns and exons. I have written Susan and asked more about her methods to understand what she means by expected value. Allan: I am trying to figure out why the old insertions (blue pink) and new insertions show the same tendency.

Hi Allan John: Here is Susan's response. I am looking through the experiments to see if this is consistent w the data. The conclusion is from multiple papers on her website.

With regard to the low % of exon insertions for the mPing element - this is due to the sequence preference of mPing - that is, it prefers to insert into AT rich DNA and rice exons are GC rich. As mentioned in the talk - mPing is a "successful" TE - meaning that it can attain high copy numbers because it causes little harm. One of the reasons it causes little harm is that it avoids insertion into rice exons. In contrast, mPing transposes more frequently into eons in transgenic soybean. The reason for this is that soybean exons are less GC rich than rice exons. So….if mPing were a TE in the soybean genome it is unlikely to be as successful as mPing is in rice. Does that make sense?

It does make sense. And it refutes your claim about repair mechanisms. So, thanks for bringing up an interesting phenomenon, but it shows natural selection acting on TE insertion preferences, as a parasite that doesn't kill its host becomes more successful. Was there some other point you wanted to make about this?

Hi John...this quote from the 2009 paper is a loose end that may be solved in later papers...if I cannot find an answer I will contact Susan again.

,55%)17. However, mPing does not avoid the (G1C)-rich 59 un- translated region and is enriched just upstream of the transcription start site. An understanding of the mechanisms underlying these preferences is beyond the scope of this study as they may be influenced by other factors such as chromatin structure18,19, which, so far, has not been thoroughly characterized in rice.

John Allan: this caught my eye from Jan 2011 Wessler paper...Taxonomic Distribution of Superfamilies. The mapping of the su- perfamily presence or absence along the eukaryotic tree of life (32, 33) revealed that 15 of the 17 superfamilies exist in at least two of the "ve eukaryotic supergroups surveyed here (Fig. 3 and Table S2). Because there is little evidence for the horizontal transfer of TEs between eukaryotic supergroups, this distribution strongly supports the view that the origin of most, if not all, superfamilies predates the divergence of eukaryotic supergroups (34)

Just sharing interesting stuff. I agree error correction is unlikely at this point because the TE's appear to migrate to the proper location. Here is the link:http://wesslerlab.ucr.edu/pdf/yuan_pnas.pdf

Identical twins would have the same mutations but each individual child would have it's own set of mutations because the eggs and sperm are each the product of many independent replications. They might share a few mutations that occurred before segregation of the germ line cells and in some cases the two egg cells or the two sperm cells might have a very recent ancestor in the germ line.

However, as a general rule, each brother and sister will have about 100 new mutations and they would differ at about 200 sites.

But Larry, you forget to count the genetic variation due to meiosis (that is, a random sampling of maternal and paternal alleles), which it seems to me is going to completely swamp those new mutations. The brother and sister will differ at 200 sites from new mutations, but lots more from simple inheritance.

the typical human genome (yours and mine) differs from the standard reference genome at about 4.5 million sites. That's about 0.14% of our genomes. Humans and chimpanzees differ by about 1.4% or ten times more.

So the differences between humans and chimps are 10x larger than differences of the average person to a referent genome. If Young Earth creationism were true, and all variations within humans evolved since Adam and Eve 6,000 years ago, then it would take at most 10x longer for chimps and humans to diverge from a common ancestor. So if YEC were true, it would take just 60,000 years for human and chimp to diverge from a common ancestor.

Ah but even though we humans differ among ourselves by 10% of the human-chimp difference, we are all still humans, aren't we? This PROVES that genes don't matter, as comrade Wells has been saying all along, therefore GOD!

Are they talking about the same part of the genome? That is, the 1.4%% refers to 'protein coding' regions, and the 0.14% refers to the whole thing? I read somewhere that the ape genome is 10% larger than the human, and most comparisons just look at the subset of DNA that is 'protein coding'. I could be wrong, just wondering.

As a yEC I can address one point.If humans change because of different mechanism(s) then the genetic fingerprints would show this also.So its not the only option that african genes are most different because of time.it would also be that way is , upon migration to africa from somewhere else, they changed the most. They had the most attributes change relative to a original common human single tribe. simply that. The rest of mankind changed but less so due to the areas influence.all that is seen is a genetic score. The reason for it is not seen in the score. There are creationist options to explain these things.

Every creationist of every flavor throughout the history of humans has conjured up a version of impossible fairy tale "creationist options" or believed in a version that someone else conjured up. Your version of "creationist options" is just one of billions, and there's no evidence that supports yours or any other.

The Whole TruthI wish it was billions but not my crowd. The only evidence or data anyone works with is a genetic score. A summery of genes place right now. Then its speculated/investigated/concluded this is how and why there are genetic differences in different degrees of comparison.Thats all.So some say Man came from Africa and time is the origin for more gene differences in Africa and in comparison to the rest of mankind.yet the obvious thing is how different Africans look. All men look different YET there is, I say, more of attributes in looks amongst Africans compared to mankind.then with the idea of a common single human group and then having the Africans actually migrate into Africa and be influenced by the area.This influence by mechanism(s), producing more genetic change quickly and so more gene difference.another point would also be that it was not a single group that moved to africa but many groups of people with already different languages. This also would make more gene diversity.The rest of us also were influenced into genetic change but it was less as the areas we migrated to didn't have such needs for adaptation.i think this makes more sense, debunks this gene uniformity concept, and is including genesis boundaries.Anyways its all a genetic result we look at. Then figuring it out.Other option(s).

I've been trying to read your blog and keep my mouth shut but it's hard. Professor Moran what is the difference in the Discovey Institute and the Sandwalk blog? Neither of you do any science you can claim as your own. You both take the work of others and twist it to fit your take on the discussion. You have about 50 consistent followers, as do they, all the while the real scientific community doesn't care. It's a miracle from God himself that glass houses could withstand this punishment. You all are smart men, much smarter than me. Why are you wasting your gifts? This is crazy, you talk about God's people more than door to door Mormons. So weird.

One of the huge differences is that the IDiots have only one thing in their mind: God-did-it! God-did-it! They're delusional buffoons (and/or make a living out of lying about science).

Then, Larry is a professor who makes a living out of understanding science. Larry writes a textbook on biochemistry for actual students in science. So Larry carries a huge responsibility to present the science as it is, while the IDiots' responsibility is to present twisted representations of science (and other kinds of religious propaganda) that have to please their religious beliefs and those of their followers and donors.

Robert said The rest of mankind changed but less so due to the areas influence.all that is seen is a genetic score. The reason for it is not seen in the score.There are creationist options to explain these things.

Let's see you explain those things. Although that will only be your personal "lines of reasoning" - as always far removed from reality.

Did they find any parts of the DNA that fixated differently in one group? That is, I have seen ancestry deduced mainly by non-degenerate correlations, not just because it's convenient, but because there is no precedent for a subset of DNA that would prove you are a pygmy, or Khoisan, or Australian aborigine (all mere examples), with 100% certainty? I would think there might be some subset (though, they aren't big among 23andMe users).

Hi Larry, I see you highlight the error rate of the sequencing methods and how the rror rate can affect the validity of the results. I have a couple of questions; first, where did you find that “They estimate that 95% of SNPs meeting this threshold are true variants. For small insertions and deletions the accuracy is about 80%”? I have been looking through the paper but can’t find those numbers …

Since the sequencing was done at an average 7.4x depth (meaning they read 7.4 times every single nucleotide on average for every single person) and 65.7× depth for the exome (meaning they read on average 65.7 times the same nucleotide that encodes for a protein of every single person), how important do you think the error rates are? Do you think their algorithms call for variants even when the data cannot confidently call it a variant?

Also, do you agree with the definition that a SNP is just a single nucleotide variant (SNV) with a frequency higher than 1%? Other people use the 0.5% threshold for calling a SNV a SNP, and others don’t even take into account the frequency of that variant to call it SNP. Thanks for your interesting post!Fernando.

Do you think one of the take home messages of the articles is that there should not be only one reference genome? I think many of the figures (specially this one http://www.nature.com/nature/journal/v526/n7571/fig_tab/nature15394_SF5.html ) suggest that more info about your genome can be inferred if we used a "regional" or "local" reference genome ...

In the line of error rates, how many of the called "single cell mutations" do you think were due to error in this paper?http://www.sciencemagazinedigital.org/sciencemagazine/2_october_2015?sub_id=A6HNdXNVpcYV&folio=94&pg=110#pg110

Well, I think we are talking about different things.To start with, it is not true that copying fidelity has to be "nearly perfect" for survival. If fact many viruses such us HIV or HCV base their survival in a high mutation rate that make the immune system fail because of new viral quasispecies.In any case, I was talking about the error rate of the sequencing method (which was address in this post as a way of saying the results should be taken with care). At least that was my understanding of this post, and the reason why I asked.In the Science paper I linked the error rate has to be way higher because they had to amplify single cell genomes.

Laurence A. Moran

Larry Moran is a Professor Emeritus in the Department of Biochemistry at the University of Toronto. You can contact him by looking up his email address on the University of Toronto website.

Sandwalk

The Sandwalk is the path behind the home of Charles Darwin where he used to walk every day, thinking about science. You can see the path in the woods in the upper left-hand corner of this image.

Disclaimer

Some readers of this blog may be under the impression that my personal opinions represent the official position of Canada, the Province of Ontario, the City of Toronto, the University of Toronto, the Faculty of Medicine, or the Department of Biochemistry. All of these institutions, plus every single one of my colleagues, students, friends, and relatives, want you to know that I do not speak for them. You should also know that they don't speak for me.

Subscribe to Sandwalk

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake.
Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory.
Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change.
Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance.
Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change.
Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat.
Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is TrueI once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000
It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma
One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick
There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner
An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins
Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod
The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.