[National Library of Medicine. Biotechnology Seminar Series. An Update]
[The Human Genome Project. What Are We Hoping to Learn? Dr. Eric Lander, March 28, 1988]
In a very precise sense, human genetics has come of age in the past couple of years. Now humans actually have genetics, of course, has been long known. Indeed, as early as 1900 with the rediscovery of Mendel, it was clear that humans follow precisely the same laws of heredity as all other organisms.
In fact, it was the study of human families, of diseases in human families and their segregation patterns, their inheritance patterns that provided some of the first early convincing proofs of Mendelism. Unfortunately, human genetics has lagged far behind ever since.
The reasons for it from the point of view of a geneticist are rather clear. For one thing, a lack of an abundant supply of polymorphic genetic markers that you could use to study inheritance in humans the way you do in fruit flies or Drosophila. Second reason from the point of view of a geneticist, the complete inability to arrange crosses at will.
Even if it weren't unethical to do so, it would be completely impractical to do so. Your F2 generation would have to square the F2 generation from the experiment. So clearly, this would not be a practical approach. For these reasons, people turn to fruit flies, to maize, to other organisms. And of course with the development of molecular biology, human genetics had the opportunity to lag even further behind.
[?] the first slide. How do I do the first slide? Do I do that? This platform, it's..forward? Ah, good, fine. Human genetics could lag a bit further, but--you can leave them with some light actually. That's okay.
'Cause of course compared to small genomes like a bacteriophage or E. coli, the human genome is a huge genome and has three billion base pairs of DNA, and it's considerably less convenient to work with than prokaryotic genomes because of the amount of repetitive sequence and other nasty things about it. So, for these reasons, humans, in a sense, have never really been an organism of choice to work on.
But, recently, with a number of important technological advantages, essentially these limitations have been removed in principle. The sorts of technological advantages--advances I'm thinking about include restriction fragment length polymorphism mapping, pulsed-field gels, rapid fingerprinting strategies for physical maps, rapid sequencing.
And so, in recent years, it's become apparent that it's now possible to map out the human genome in some real detail. First, at the level of, say, Lewis and Clark, but proceeding all the way straight down to the level of local zoning boards. And that's, in some sense, what the Human Genome Project is about.
Basically, the Human Genome Project, as you heard discussed, is to construct three things: a genetic map, a physical map, and a sequence map, but I'm only going to talk very, very briefly about what these are so I can come back and refer to them.
I know that Charles Cantor in his talk later is going to describe in more detail how one does some of these things. A genetic map would consist of polymorphisms, DNA variations, spaced out along the human chromosome, one every--say, one percent recombination, a physical--so this will just genetic landmarks spaced out along the chromosome.
A physical map would actually be rounding up all the DNA in the genome in the form of overlapping clone, say, for example. And a sequence map would of course be the ultimate high resolution map of all the base pairs. Just very briefly to sort of orient you since this is a very diverse audience. The genetic map would be composed by getting DNA sequence variations, polymorphisms, variations. Here you see, for example, a simple DNA sequence variation.
I'm not sure how to control this but I'll try. There we go. Got a little spot here in the joystick to play with. You'll see that there are different patterns of hybridization in different individuals in this family. I'm not so good at playing with this. And you'll notice that they are inherited in the Mendelian fashion here, grandpa passes it on to grandma--to his daughter and that some but not others of the children inherit that pattern there.
And by working out the inheritance pattern in a simple Mendelian fashion of these sorts of markers, one can assemble maps of human chromosome. Here's a map, for example, of human chromosome 1. I'll come back to later in the talk and tell you about how maps like these are actually constructed for human chromosome. But not just humans, this has been done now in the past couple of years for a number of organisms.
Here, for example, on the same idea of using DNA differences, is a map based on this restriction fragment like the polymorphisms of Arabidopsis thaliana, a small plant that's becoming increasingly important as a genetical organism. And indeed, within the last year, RFLP maps of this sort, genetic maps of this sort had become available for humans, for Arabidopsis, for maize, for lettuce, for green pepper, rice, and tomato.
And I'm sure I have missed some already. And in fact, later in the talk, I'll talk to you about tomatoes to some extent 'cause I think tomatoes have much to say about the human genome. There is also the question of physical maps. And here, again, I'll just say a few words about physical maps. Physical maps largely can be made in a whole bunch of ways and Charles Cantor will say a lot about them.
For the moment, suffice it to say that if one knows a lot about the physical structure of various pieces of DNA, for example, here, if you knew detailed restriction maps of those species, you can assemble like a jigsaw puzzle, a physical map of a genome.
This has not yet been done for any part of the human genome in detail, but it's done--been done in its entirety for E. coli, mostly done for yeast, and a very large part of it done for Caenorhabditis elegans, the nematode worm, and projects are underway and Charles will tell you a lot of it making physical maps in humans and other organisms. And then finally, the most detailed map of all, a sequence map.
Here is random sequence from a sequencing gel in the human genome. This contains--or if you have enough of these, will contain the three billion base pairs of DNA. Now, of course to do this, there's been increasing interest in building machines and automation and technology to do this sort of sequencing. Here, for example, is a picture of a sequenator. This one comes from DuPont. Applied Biosystems made one.
Other companies are making sequenators. And in some sense, the Human Genome Project has the quality of a factory to it, in a sense, you might think, turning out maps with these genetic markers and then fingerprinting lots of clones to make the jigsaw puzzle of the physical map. And then, running through zillions of sequenators and getting all the ATCGs of the human genome.
And that's, in some sense, what the human genome is about, characterized by this--the Human Genome Project is about, characterized by the sequenator there. But that's not what I want to talk about this morning. What I'd like to talk about instead is why do we want to do this? What in the world are we hoping to learn by running this sequenator overnight for zillions of nights?
Well, in short, what we're really hoping to learn through a project like this is the location of all the genes underlying the physiological traits of interest, developmental traits of interest in humans and other higher organisms. And for each one of these genes, once we want to learn--once we learn where the genes are, we want to understand exactly the biochemical mechanisms by which they function, by which the gene products function.
That's a tall order. And unfortunately, to be perfectly honest, the Human Genome Project provides none of the answers to that question. In fact, if you had the entire sequence map of the human genome available today, you could not simply, by looking at it, know where the gene for cystic fibrosis was.
And if I told you where the gene for cystic fibrosis was, you could not by looking at that gene, in all likelihood, tell me what its function was, and how in the world various mutations produce various different deleterious consequences in individuals. What's really needed is to recognize that the Human Genome Project itself and what it's going to produce is simply a tool.
And there's another machine that still has to get built in addition to the sequenator. That machine is a little bit less well defined. But here's my version of it. It's the functionator. That's what we really gotta build, is the functionator.
It's shown here in a--in just a slightly schematic form with the little chromosomes coming off the conveyor belt going to the functionator, bells and whistles turn, and out pops function from the other end. What I'm going to try to do is give you a sense of what the components of the functionator are.
But that's really, I think, the challenging project that a conference such as the one organized here today is facing is how to assign function to various parts of the genome. Now, I'm going to describe two ways, two components to building the functionator. One concerns genetic mapping.
Using the tools of the human genome, the genetic map, the physical map and the sequence map to figure out what parts of genomes are interesting, what parts of genomes contain genes that affect various traits of interests. And since my own lab does work primarily on genetic mapping, I'll concentrate from the majority of this talk on the subject of genetic mapping.
But the other subject I'll touch on in terms of an important component to the functionator is sequence mapping as well, recognizing patterns in DNA and protein sequence. And in the latter part of the talk, I'll describe at least what I think the challenges are for building the functionator as it pertains to sequence.
Well, okay, let's just dive right in and I'll tell you about genetic mapping and how one may be able to use some of the tools that we're hoping to build with Human Genome Project in mapping interesting traits and finding out where things really are for traits of interest. Now to do this, I've got to sort of back up and give you a little bit more background on these restriction fragment length polymorphisms, these are RFLPs.
It was proposed in 1980 in a very important theoretical paper that what we could use for genetic markers in humans would be a little different than what's used in, say, fruit flies or other organisms where one uses visible mutations, curly wings in Drosophila, or yellow bodies--yellow body in Drosophila for example as visible mutations.
It was proposed in humans, probably the best genetic markers that could be used would be DNA sequence itself. That in fact while our DNA sequences are extraordinarily similar, we do differ at about one base in a thousand. And those differences are easy to recognize when they land in restriction sites and--excuse me, destroy those restriction sites.
I've shown here an example of a chromosome, a pair of chromosomes, one maternally inherited, one paternally inherited. I'm trying to find my little red dot. Oh, there's my little red dot.
I may give up on the little red dots here. You can see there that on one chromosome, there's a single base pair change compared to the other chromosome that destroys the site for restriction enzyme, and so the restriction enzyme would cut the top chromosomes into two pieces, the bottom chromosome into one piece. And if somebody had two copies of the light gray chromosome, you would see two little bands.
Two copies of the dark gray chromosome, you'd see one big band. And one copy of each, a heterozygote, you'd see one big band and the two little bands there. The various different alleles produced by the RFLP. And that these are perfectly normal Mendelian genetic traits that'll be passed on to the children. On average, about half the children will get the light gray or the dark gray pattern there.
They're just simply more boring to score than your traditional genetic markers, but they work just fine. In fact, here again is an example of just that. The previous slide showed a Macintosh version. This is a real version of such markers segregating in a family with four grandparents, two parents and a bunch of children.
Well, in fact, why are these markers useful to use in the functionator? Well, if we had a family that had a genetic disease, for example, a dominant genetic disorder here. Dad is affected with a dominant genetic disorder up there. You'll notice that four of his seven children have inherited this genetic disorder from him.
And you'll also note that conveniently, dad is heterozygous for that restriction fragment length polymorphism, that genetic variation I showed you. He's got one copy at the top allele, one copy at the bottom pair bands, which is the other allele on the other chromosome. Mom conveniently is a homozygote.
We're not interested in her in this particular example. And you'll note that all of the children who got the genetic disease also got those--the bottom pair of bands, that bottom allele from dad. None of the children who didn't get the disease picked up that. They all picked up the other allele.
And so, out of seven opportunities for the RFLP and the disease to separate from each other, to segregate away from each other--I've just lost my image here, good--to separate away from each other, there were no crossovers observed. And so in fact, one suspects that this restriction fragment length polymorphism, this genetic marker is probably nearby the disease gene.
So that's the first basic simple use of RFLPs in the functionator is that by being very patient and one at a time trying RFLPs, you'll eventually stumble upon one that maps nearby to a disease gene and you'll recognize it by the correlation of inheritance patterns here. Well, in fact--in fact I'll mention, because this is a conference partly of computer scientists, that there's actually a mathematical formalism for properly doing this.
We go back to the previous slide. This goes reverse--yeah, it does. You'll look seven opportunities for a crossover, three--no crossovers observed, perfect segregation. Is that enough to be sure that the disease gene maps here? Well, in fact, the formalism for doing this really wouldn't be necessary in an organism like fruit flies where you can just count crossovers.
And that would be a very easy way to follow inheritance. But in human genetics, because it's rather complicated to follow inheritance in families, a formalism was developed by R.A. Fisher, J.B.S. Haldane and various mathematician geneticists over the course and it looks something like this.
It's based on the method of maximum likelihood. You say, suppose the genetic marker that we observed in the previous slide is actually linked to the disease. Well, if it is linked to the disease, you work out the probability that the data you saw would've arisen if the markers are linked. Then you also say, well suppose they're not linked, let's work out the probability, I would've seen this data if the markers are not linked.
And you say, maybe there's a 1 in 100 chance I would've seen this exact data if the markers are linked. Maybe 1 in 10,000 chance I would've seen this exact data if the markers are unlinked.
And you just compute the odds ratio. The ratio of how much more likely it is that this data would've arisen from linked markers and then from unlinked markers. In that case, it'd be about 100 to 1. Now, human geneticists essentially simply keep collecting information until the odds ratio is overwhelming for the hypothesis that these markers are in fact linked.
In order to confuse people, human geneticists report not the odds ratio but the log base 10 of the odds ratio. And in biology, that's usually sufficient to confuse most people. So in fact, the only real question is how high does this odds ratio or this log called the LOD score have to get before one is impressed that one's found linkage.
And here, there is a hard and fast standard in order to publish in nature, you need a LOD score of 2, corresponding to 1000 to 1 odds, that's the convention everyone's agreed upon. And it's worth noting, it would be obvious perhaps to this audience, people constantly write in their papers when they get a LOD score of 3, "Linkage is therefore a thousand times more likely than none linkage because we got a LOD score of 3."
That's of course is not the case. What it really means, of course, is that the data is 1000 times more likely to have arisen from a pair of linked markers than non-linked markers. But you should always remember that non-linked markers are about 50-fold more common than linked markers. So if I pick two markers at random in the human gene, they're likely to be on separate chromosomes, i.e., I'm 50-fold enriched for false-positives.
And so, in fact, if I have thousands to 1 odds ratio but I'm 50-fold enriched for false-positives, really after I get a LOD score of three, what I really have is about 20 to 1 odds in favor of linkage. So, to put it in straightforward terms, about 1 in 20 of those papers published in Nature would need to be retracted. It's all a probabilistic business in essence.
But if you get to a LOD of score of say four, it means you've got 10,000 to 1 odds ratio or about 200 to 1 odds truly in favor of linkage. So that's essentially the mathematics underlying it. It's a very simple version of it. But in fact, that's what one really needs to evaluate the significance.
The previous family we looked at is not quite significant, but three such families would be more than enough to declare a linkage to this dominant disease.
And by straightforwardly doing it, people have been able to map, in this fashion, going one at a time with RFLPs until they were lucky, Duchenne muscular dystrophy, Huntington's disease, cystic fibrosis, retinoblastoma, adult polycystic kidney disease, familial Alzheimer's disease, manic depression in a single large Amish pedigree, familial colon cancer, von Recklinghausen neurofibromatosis, bilateral acoustic neurofibromatosis, and multiple endocrine neoplasia type 2A.
And in fact, this list is out of date already when Hippel-Lindau syndrome was mapped last week and a couple of other things have been mapped since I made this slide only three weeks ago. This has, in fact, been a tremendously productive period. Once one has genetic markers linked to a disease, what can one do? Well, in fact, it makes possible prenatal diagnosis of these diseases, or presymptomatic diagnosis of these diseases.
In addition, it provides you with pieces of DNA nearby the disease so that you can actually go and begin to try to clone the disease gene itself. And here, I'm sure Charles will talk about the use of physical maps in moving from a linked marker down the chromosome to a disease gene itself.
And even before you can do any of those things, you'll learn something just from where the position of the disease gene is in the human genome. What can we learn from position? Well, we can learn, for example, that Alzheimer's disease is not caused by a defect in the beta amyloid protein. In fact, it was suspected about a year ago that this might be the case.
Because biochemists had isolated the gene--oh wow, yes. Biochemists had isolated the gene encoding the beta amyloid protein which accumulates in large amounts in the brains of individuals with Alzheimer's and they found it mapped on chromosome 21, the gene was there.
And in fact, some various evidences including the fact that individuals with Down syndrome show a condition that looks like Alzheimer's when they get to about age 40, made people think that probably the beta amyloid gene must in fact be the gene that causes Alzheimer's. But in fact, it's not.
Careful genetic mapping showed that while they're nearby each other, the gene for familial Alzheimer's is perhaps 20 to 30 percent recombination, that's why it's north on the chromosome there. And so, we know immediately that no matter how good an argument the biochemist can marshal for the fact that this gene is on chromosome 21 like you'd hope and like--and that it's there in high amounts in Alzheimer's brain, it cannot be the gene for Alzheimer's.
That's one thing we learned from position. Another thing we learned from position is that manic depression, bipolar affective disorder, is not a single gene disease. It is not a homogeneous disorder with a single cause. In fact, genetic markers, these are RFLPs were shown to be linked to manic depression on human chromosome 11 in a large Amish pedigree.
Very large family of Amish people with bipolar affective disorder 1 markers on chromosome 11p were shown to be linked to the disease in that family. But in fact, in three other large pedigrees, linkage to bipolar affective disorder for these markers was ruled out. So we know that another gene must be causing the disease in those pedigrees.
And in a collection of Jerusalem pedigrees, linkage to some markers on the X chromosome was found. And so, without having any idea what the cause of bipolar affective disorder or manic depression is, we know automatically there are at least two genes, probably three genes that have forms that can produce manic depression. And this is, of course, not to say there aren't non-genetic forms as well.
So we learn this all from position right away. Well, given those sort of successes of being able to map the fundamental occasions of a lot of single gene disorders, being able to discriminate various hypotheses about diseases, like, for example here, noting the heterogeneity in bipolar affective disorder, seems appropriate to ask the question, how far can we hope to go in unraveling human heredity, in understanding human heredity based on these genetic markers.
How powerful a functionator can we hope to build? Well, this is a page taken from Victor McKusick's catalog, "Mendelian Inheritance in Man." It's a marvelous book. It's about 4,000 page--it's not 4,000 pages, but it's about this thick. But has 4,000 genetic diseases in it that are traits or diseases known or suspected to show Mendelian inheritance.
Now, this is a fabulous catalog. It includes cystic fibrosis and some of the diseases I put up there. But you will note if you flip through this catalog that the vast majority of the diseases in the McKusick catalog don't show the simple genetic inheritance of cystic fibrosis or Huntington's disease.
In fact, they show genetic complexities in their inheritance. Some of them are highly heterogenous. Others have incomplete penetrance or involve interactions between genes at different loci. There are some which are, in essence, predispositions or susceptibilities. Some of them refer to clear and simple traits.
Some of them to more complex traits like psychiatric dis--excuse me--psychiatric disorders or physiognomy. Some of these traits are of tremendous medical importance. Some have no medical importance whatsoever. My favorite trait in the entire McKusick catalog, the genetic inability to smell cyanide. It's a fascinating trait. And you should--I think about how one picks up mutants in the ability to smell cyanide.
In fact--I mean, you laugh. It's a serious thing. Humans are probably the best system in which to study the molecular biology of smell. Very little was known about the molecular biology of smell. But humans are in fact quite polymorphic for their ability to smell things and you don't have to train them to tell you as you do with say a rat or a monkey or a fruit fly whether they smell something, they'll simply tell you.
So in fact, if one wanted to work out the molecular biology of smell, humans would be the right organism to do it in, but I digress. In order to map genetic complexities, in most other organisms, one would set up elaborate crosses to do it. There are ways geneticists have, but they do involve setting up crosses and we can't do that in humans.
And so, one question we've been very interested in is building a stronger functionator, building a stronger method for actually being able to map complex traits. Not only do you have to set up crosses, but of course, you often need large samples, large supplies.
And another problem we've noticed by looking through the McKusick catalog is that a lot of the most interesting diseases here are quite rare. There are a couple of cases. Some isolated instances here, a small family there, and it's going to be hard to round up very large pedigrees of the sort I showed you before even where you have seven children descended from an individual with a genetic disease there.
So, some of the questions that we've been thinking about is how is it possible to map traits when you can't set up crosses? How can we exploit the power in a genetic map to do that? Well, the key idea underlying any of the things I'll mention are that instead of using RFLPs, these genetic markers one at a time and looking for a simple correlation with a single marker and a disease like for example,
that dominant disease slide I showed you before, a more powerful approach would be to use an entire linkage map simultaneously of the human genome. That is, genetic markers spaced out evenly throughout the human genome such as the Human Genome Project is aiming to construct and that there are ways to squeeze more information out of such a linkage map than by using markers one at a time.
Let me tell you what I'm thinking about in terms of ways to exploit the full power of linkage maps. Let's see. What are those lights doing up there? Are those the slides? [?] Oh, okay. Maybe we could bring up the house lights a little bit 'cause they're glaring and I can't read anything.
Who's controlling back there? May we can bring up lights a bit? Well, maybe not. We'll see. Okay. The sorts of complexities I'm thinking about. The sorts of challenges I'm thinking about for mapping, using these sorts of maps, include complex diseases in humans. And let me just immediately turn to tell you about that.
The complexities I think about are genetic heterogeneity, incomplete penetrance, gene interactions, predispositions. And just to give you an example, let me talk to you about the problem of genetic heterogeneity.
We saw an example of it with manic depression already. When you think about genetic heterogeneity, what I mean is mutations at many different loci. Any one of these different loci could cause the disease we're thinking about.
So for example, in a biochemical pathway where product A goes to product B goes to product C goes to product D, mutations in any one of the gene products underlying any of those steps would disrupt the pathway and produce the same disease to a first approximation.
It's a very common thing in lower organisms. In yeast, the inability to grow without histidine, for example, is a heterogenous genetic disorder. That's what a human doctor would say about it. It's a human genetic-- it's a heterogenous genetic disorder.
A yeast geneticist, of course, would be able to know what the problem was because he'll be able to perform a cross between the two yeasts that can't use histidine and be able to say whether or not it was the same gene or not, by virtue of whether or not they could complement each other.
But the problem is, what's difficult in mapping this in humans as opposed to any other organism is that when I have two families that come in with the same disorder, I don't know if it's the same gene that's causing it and I can't perform a cross to find out. Well, that means that the evidence, this sort of LOD score evidence, this likelihood evidence I build up for linkage in some families will be offset by evidence against linkage in other families.
And so, I really won't make very much progress in mapping my disease. Well, here's an example of where one can use some mathematical tricks to help increase the power of the functionator.
Let me give you a simple example of it. Here's some phony data from a phony genome. I have--suppose I constructed a genetic map of the human genome already and I have intervals A, B, C, D, all the way up to I, and I can follow the most genetic markers flanking them, such that I can see their inheritance and see that when the flanking markers are inherited together without a crossover between them or a combination,
the only way a disease gene couldn't be inherited with them, if it was in the middle to start with, would be of a double crossover occurred to take it out, and that's going to be a relatively rare occurrence.
So that's means that if I see 20 opportunities in which I can study the inheritance of a disease gene, and my markers are fairly close together, the chance that there'll be a double crossover in the disease gene won't co-segregate, but it's very small. So I would expect to see no crossovers. But here's the data. I see several crossovers. It's pretty lousy actually. The best is interval F there. I see four crossovers, but that's an extraordinarily high number.
I'd expect to see zero, possibly one, but not even likely to be one. What could be going on here? Probably, genetic heterogeneity. See, if in fact the disease gene mapped in interval B, then I would expect no crossovers with interval B, and 10 on average out of 20 with all the other intervals.
If it mapped in interval F, I would expect no crossovers with interval F, and on average about 10 with the other intervals. But what if half the families I was looking at segregate for a disease gene in interval B and the half of them segregate for a disease gene in interval F?
Well, then I'd expect, the bottom here, about five crossovers with interval B, about 5 with interval F, and I look at the bottom line and I look at the top line and I say, "Eh, similar, " and I'd have a hard time convincing anyone except my mother that this was in fact the right explanation for the data.
But that's of course 'cause one's asking the wrong question. One's asking the traditional question, how tightly linked does the disease stay with interval B, or how tightly linked does it stay with interval F? But in fact, a better question to ask with, how often does the disease segregate away from intervals--from both intervals B and F?
In other words, simply make a joint hypothesis, not about just one interval of the genome, but about two intervals of the genome. And there we see that it--while it gets away from B sometimes and it gets away from F sometimes, never gets away from both. So one more powerful use of genetic map is it begins to let you make simultaneous hypotheses, joint hypotheses about multiple regions of a genome.
And that's, in fact, a very powerful thing. One can actually roll up one's sleeve and do a calculation about how much that helps. And I've simply got a graph here of how many families one would need to collect with three affected individuals for a dominant disease in order to map a heterogenous genetic disorder, one which in fact say was caused by any of four different genes.
You'll see there on the graph, I don't think I can get my joystick to--oh, maybe I can. There we go. You'll see that that is--in fact, it was caused by four genes but equally often. You'd need about 20 families if you could form joint hypotheses about multiple regions of the genome.
But in fact, if you're having to follow markers one at a time through the genome, one would need somewhat over 80 families in order to map a heterogenous genetic disorder. Well, okay. That's one use of genetic maps to find function in a more powerful way and we--to make simultaneous hypotheses, and I'll call it simultaneous search of a genome.
Another challenge that's important for mapping, we'll come back to simultaneous search in a second, is that even with simple inheritance, not complex inheritance where one would have to wield the genetic map in fancy ways, even with simple inheritance, we frequently have very few cases. We have isolated instances of an individual here or an individual there, but not the sort of large families we need to do mapping.
How can we make a more powerful functionator? How can we more powerfully exploit the information in a human genetic map? Well, population genetics, long the province of the non-experimental biologist, is in fact a very, very powerful way to do such gene mapping once maps of the human genome are available. Let me tell you what I'm thinking about as an example of how a genetic map can be used to unlock information in human populations.
Here's a whole bunch of pedigrees of individuals with Bloom syndrome. Bloom syndrome is a rare recessive disorder characterized by a great deal of chromosome breakage and reunion. It's common among Eastern-European Jews...not common, i's only found among Eastern-European Jews and Gypsies.
And in fact, it's a rare disease there, but it's the only population that has it. And you'll note that all the pedigrees I've shown you show consanguineous marriages. Marriages between two relatives producing children affected with Bloom syndrome. Well, in fact, this is a very common observation that rare recessive diseases frequently involve inbreeding.
In fact, in 1902, the physician Archibald Garrod noted--oops--noted that a very large fraction of his patients with a disease, alkaptonuria, were the products of marriages between relatives. And Garrod said, "Hmm, I have no explanation for this," but it was 1902 and one could write to the Lancet in those days, and say, "Dear Lancet, a very large fraction of my patients with alkaptonuria are the products of mrriages between relatives."
And it was a good thing he did, because of course 1902 was the rediscovery of Mendel in Britain, and Bates then reads the letter in the Lancet and replies to Garrod a few months later, and he says, "Garrod, that's right. That's exactly what you'd expect under Mendelism, because of course marriages between relatives are excellent ways to homozygous for recessive disease genes."
And in fact, that observation has held up very well with the corollary that the rarer the disease gene, the more common this occurrence will be because of course the chance the two disease alleles will meet each other in the general population is Q times Q, Q squared where Q is the allele frequency. But the chance that they'll meet each other again in the pedigree, once they're in a pedigree that's inbred, is linear in Q.
So of course, for rare diseases, the chance that they'll find each other being quadratic, the chance--in the general population versus linear in a given inbred pedigree means that inbred pedigrees will be greatly enriched for these sorts of rare recessive diseases, and indeed 40 percent of all albinos are the products of consanguineous marriages.
Eighty percent of all individuals with the disease, Brazilian acropathy, are the products of consanguineous marriages. Twenty five percent of all individuals with Tangier disease are the products of consanguineous marriages.
For a common disease like cystic fibrosis, only about three percent are products of consanguineous marriages 'cause it's not hard for two cystic fibrosis alleles to find each other in the general population.
Well, let's think about what goes on in the case of an RFL--in the case of a marriage between relatives here producing an affected child. We've got a situation here showing a recessive disease allele, it's a capital D but it's a recessive allele, being inherited two different ways down a pedigree and becoming homozygous by descent in this affected individual here, the product of a marriage between first cousins.
Well, note, that disease gene is homozygous by descent. Not only is it homozygous by descent, it's the identical thing that's come down two different ways, but in fact, some whole chunk of the chromosome around it is homozygous by descent. Well, what if we had x-ray vision and we could look at that individual, that affected individual and say,
"Aha! I see a region that's homozygous by descent." How convinced would you be that that must be the disease region, the region containing the disease gene? Well, we can do this likelihood business I told you about before.
Let's compute the chance we would see this data if in fact this is the right region. If this is in fact the region of the disease gene, what's the chance it will be homozygous by descent in an affected individual? Well, almost one. The only way it couldn't be is if the disease gene came in collaterally. But if this is a rare disease gene, the chance is very low. It would have come in elsewhere.
It almost certainly must be homozygosity by descent. So the chance we'd see homozygosity by descent if this is the right region is one, effectively. But now, what if it's the wrong region? What's the chance we would have seen this data here if in fact, this data being homozygosity by descent, if it's the wrong region?
All the kids are inbred after all, there's a chance that some region of the genome will be homozygous by descent. And indeed, that's the coefficient of inbreeding. That's what that measures. The coefficient of inbreeding for a first cousin marriage is one 1/16. That is to say, there's a 1/16 chance than any old portion of this individual's genome will be homozygous by descent.
So let's see, the odds ratio is going to be chance that this would happen if the hypothesis is right, that's one, over the chance this would happen if hypothesis is wrong, that's 1/16. So the odds ratio is 16 to 1.
That's not enough to publish in Nature yet, right? We need 1,000 to 1. But suppose, we see two such children and we see homozygosity by descent, 256 to 1. Three such children, 4000 and change to 1. That's already the moral equivalent of a LOD score of three. In other words, three unrelated inbred individuals contain the full information traditionally used to map a genetic disease.
The only problem with this scheme is we don't have x-ray vision. I can't just look at the kid and say, "Ah, here's the region of homozygosity by descent." But, we do have genetic maps. And if we have a sufficiently good genetic linkage map, we could look for a collection of markers, six or seven in a row, all homozygous, and say, "Well, this probably means homozygosity by descent."
Of course, the only question is how good a measure is finding five or six or seven markers in real homozygous? How good an indicator is that of homozygosity by descent? Of course, that's going to depend on how variable these markers are, how polymorphic they are if they were all HLA, the highly polymorphic histocompatibility locus there'd be no argument that it couldn't be homozygous without being homozygous by descent.
If on the other hand, there were lousy 2 allele polymorphisms, it would be harder to say whether or not it was homozygous by descent. It will depend on how polymorphic the markers are and how close together they are. But based on the quality of the map, the degree of polymorphism to the markers, how variable they are, and their spacing....
One can roll up one's slide--one's sleeves and compute how many such individuals you'd need. And for example, here, you'll see that if we had markers spaced every five centimorgans in the human genome--let's see if I can find my little dot. There's my dot, great. Five centimorgans in the human genome and there were three allele systems, so in other words, there's about 30 percent chance they'd be homozygous just at random.
One would need something like six or seven such inbred children to get the moral equivalent of a LOD score of three to be able to map a disease gene. In other words, six or seven inbred unrelated individuals would be all it would take to map a disease gene there if one could exploit the full power of a linkage map.
If in fact, one builds linkage maps as the Human Genome Project is aiming at, of higher quality, say, markers every 1 centimorgan, even if they are only 50-50 two-allele systems, or if they're slightly better, three-allele systems, one gets down to the neighborhood of four inbred children or so, four or five.
Not quite the magic number of only three that we said before, but still quite good. It means that a vast number of recessive interesting diseases can be mapped. And in fact, there's maybe a method of choice for exploiting genetic mapping because I'll note, for example, that in Bloom syndrome, what I showed you before, the number of families in the world today with two or more living affected children is 11 for doing traditional linkage mapping.
On the other hand, the number of individuals who are the products of consanguineous marriages alive today with Bloom syndrome, first or second cousin marriages is 24.
The 11 is not enough to do traditional linkage mapping, but 24 inbreds is more than enough to do linkage mapping even with a rudimentary map. And indeed, for perhaps, I estimate, about 500 of the diseases in the McKusick catalog exploiting the genetic map in this fashion would make it possible to do genetic mapping of disease genes.
Well, that's another example then. So we got two examples of how we can build a stronger functionator exploiting the information of the human genome, complex inheritance, unlocking information in human populations. Another example I want to talk about is rather different. Has to do with complex inherit--with quantitative inheritance.
And here, I mean things even more complex than genetic heterogeneity where there's just a couple of genes causing it. I mean, for example, this. There is tremendous natural variation in natural populations. For example, in human populations, there are variations for all sorts of fascinating traits.
Most interesting here, go--at least, not the most interesting, but the best slide I can find showing it is this is one here courtesy of American Express. He's a card member since 1967. He's a card member since 1976. And you'll note, there's tremendous variation here in height. Now, we're going to find by mapping standard disease genes, maybe the occasional locus for gigantism and maybe the occasional locus for dwarfism.
But that's not really all we want to know as biologists. What we'd really like to know is all of the genes that could mutate to cause differences in height. Now, of course, height is a terrible example. Really, because height is probably highly polygenic, almost any variation could--almost any change in environment or genes are produced maybe some effect in height. So I'm using it only 'cause I have a slide here that's so pretty.
Really, what I'm thinking of is, say, hypertension. What we'd really like to know is all the genes that can affect the blood pressure system. And that's going to be hard to do in looking at human populations because we'll find the occasional major disease gene, but we'd like to know the subtle of changes as well.
And so, in fact, I'm going to argue that as part of the Human Genome Project, of course the Human Genome Project is a misnomer. It's not just the human genome that's being talked about. We're talking about animal genomes as well, all complex genomes because the tools are the same. And indeed, as I'll argue to just know, looking at animal genomes may be a very powerful way for learning a great deal about human genomes.
So for example, instead of looking for the traditional single gene mutants, let me remind you that during the course of the century, there's been a tremendous amount of breeding for all sorts of traits, selective breeding in the laboratory [?] for example, for traits like hypertension. Here, I show an example of an outbred strain of mouse that's been bred for several generations.
And in each generation, the high blood pressure animals are mated together and the low blood pressure animals are mated together. And after about 12 generations, you have a high blood pressure strain and a low blood pressure strain consistently--that consistently has blood pressure about these levels here.
Well, this has been done for almost any trait imaginable, for hypertension, atherosclerosis, diabetes, predispositions to cancer, alcohol sensitivities, drug sensitivities, levels of plasma components, resistance to infection, ability to run mazes, everything you can think of, people have bred.
And you can do it. The problem is that when people get these strains that differ in these interesting traits, they cross them together and they observe that--I can't find my little dot again. That--there we go. Here's one population, there, that's the high blood pressure strain. That over here is the low blood pressure strain.
They cross them together and they get the F1 population and that's got an intermediate level of blood pressure and they're very happy about that. And then they mate together the F1 population, brothers and sisters, and they get the F2 population here.
And note, it's got the same mean and just somewhat more spread out and they say, "Oops, not a single gene for blood pressure," 'cause if this was a single gene for blood pressure, we would expect classic Mendelian 1 to 2 to 1 segregation, and that's not what we're seeing. It's two--it's schmeared closer to the middle.
In fact, this is the characteristic signature of polygenic inheritance, of quantitative inheritance, of two or three gene segregants. So it doesn't go a quarter, a half, a quarter, might go a sixteenth, a blah, blah, blah, blah, blah, a sixteenth, a sixty fourth, blah, blah, blah, sixty fourth and it schmears out more in the middle.
So in fact, traditionally, this is the end of the paper in the selective breeding because of course there's very little more one can say. By standard transmission genetics, it's very hard to map the individual polygenic factors. Although all through the century, population geneticists have been well aware of these sorts of polygenic factors underlying many traits.
It's been very hard to get our hands on these factors and do anything with them despite the fact that it's very easy to breed strains that differ in them. Well, here's some real numbers so we can talk turkey about this. In those two strains there, the blood pressure is 190 and 140, respectively. There's about a 12-1/2 standard deviation in measuring blood pressure.
And so the strains differ by about four standard deviations for blood pressure. How hard would it be to actually map using a good linkage map, a genetic linkage map, all of these polygenic--all the polygenic components of blood pressure? Well, it depends how many there are.
If there were only two, it might not be so hard. If there were 20, it might be very difficult. So God knows how many different genes there are underlying this difference.
Ah, God and Sewall Wright. Sewall Wright pointed out a long time ago that in fact, the key is how much more schmeared out this F2 population is compared to the F1? If it was a single gene, it would be very schmeared out, 1 to 2 to 1. If it was two genes, somewhat less schmeared out. And in fact, by looking at the difference in the variances between the F1 generation and the F2 generation--oh, yeah go back over the page. Oh yes, good.
By looking at the differences in the F1 and F2 generation variances, one can compute approximately the number of genes that must underlie this trait here. So in fact, looking up there, one can see that probably--well, it's hard to say how much segregation--well, let's actually look.
In this particular example, the variance of the F1 is 155, the variance of the F2 is 240, the difference between the strains is 85 plugging into Sewall Wright's formula, Sewall Wright would tell you, there are about 3.7 genes underlying this difference, 0.7 of a gene. I don't know what 0.7 of a gene is.
It's an approximate formula, we've got to understand. It makes the assumption that all the genes have roughly equal effects, interact additively, and are genetically unlinked. And to the extent, those approximations are violated, this will be an underestimate for the number of genes.
But if Sewall Wright estimates 3.7 genes, probably it's four or five genes underlying hypertension between these two strains here. Well, would it be possible to map those four or five genes underlying hypertension? Well, what we'd do is we'd set up a cross between a high strain and a low strain and we would get an F1 generation.
We'd then make an F2 generation by mating together brothers and sisters, and we'd get blood pressures on all these animals. And then we would make DNA from all the animals. We'd take off--the way you measure blood pressure is you'd wrap a blood pressure cuff around the tail and you'd pump it up and that's how you actually get blood pressure, and you cut off some tail, you make DNA from it.
And suppose we had an RFLP map of the mouse genome, we could do blots on all these tail DNAs and we could look for a single RFLP correlated with blood pressure and say, "Aha! Here's a marker linked to a gene for high blood pressure."
Well, you can actually compute by just doing at how many animals you need to look at. And it turns out that the numbers are okay, they're not great. If the strain is different four standard deviations for blood pressure and there's a four different loci underlying the trait, one finds that one needs something like 100 animals or so, there we go, about a hundred animals in order to detect a locus under the assumptions of perfect additivity, perfectly equal strengths and genetically unlinked.
In fact, those assumptions are probably not going to be true, so probably I'd rather multiply that all by about a factor of three to make up for the fact that the assumptions would be violated. So that's okay. It's perhaps doable. But you see, if the difference gets smaller between the strains like it's only two standard deviations and it's four loci, then it really begins to take off, it'll be nice to do better than this.
Can we exploit a map more powerfully than just looking for a single RFLP marker linked to blood pressure? Well, I mentioned in the case of humans, some of the tricks we would use would be, for example, to follow genes with flanking markers on either side of them, so they'll look at intervals, to do simultaneous search. And so, let's think about exploiting experimental genetics in this way.
If we follow genes with flanking markers, if we make simultaneous hypotheses about multiple regions of the genome, so we say, "Ah, it's not just this one region that's hypertension, but this one, this one, this one, and this one, we already have reasons to suspect it's 4 genes.
Let's try all quadruples of intervals, see how well they fit." And in fact, there's another trick that one can use which is the genotype--not all of the mice you look at, but only the mice with the highest and the lowest blood pressure because they contain the most information.
So from a mathematical point of view, one could compute the expected information contained in a mouse condition on his blood pressure and one would find--so I'm going to go back a second--one would find that the animals with intermediate blood pressure contained very little information, so you'd get everybody's blood pressure,
but you'd only make DNA from the guys in the extremes, the high and low blood pressure. And if one therefore squeezed the mathematical information as much as possible, one would find that instead of this dark curve at the top which we had before indicating how many animals it would take for four loci, it comes all the way down to this dark curve down there.
So in fact, where one before needed perhaps 100 animals or so, now, one can get away with closer to 25 animals, one can get all the way down to about one diff--one standard deviation difference in blood pressure.
And in fact, the holding becomes quite practical for doing in the lab, probably even with the assumptions being violated, if not perfect additivity and not equal strength, one still gets away with fewer than 100 animals if you really exploit all the power, all the information in a genetic linkage map of, say, the mouse genome.
Well, in fact, let me suggest to you that once we have an RFLP linkage map, a genetic linkage map of, say, mouse in place, that sort of selective breeding for traits may become a substitute for the traditional mutagenesis that a geneticist does. Ah, this is a heresy of course, because one of the things ones thought to believe in genetics is the supremacy of single gene mutagenesis.
But in fact, there's tremendous variability present in natural populations. And if one properly wields the power of an RFLP linkage map, it may be possible to map the polygenic factors and, frankly, it's easier to find differences in these polygenic factors than it is to do mutagenesis in an animal like mouse where, in fact, it's a low efficiency process.
And where when you do knock out a gene for hypertension, for example, likely is not when you make it homozygous, the animal will be dead because it will be an important gene in some physiological process.
So our argue at least is the heresy for the morning that selective breeding genetics may indeed undergo a tremendous revival as a serious contributor to molecular biology because one may indeed, with the availability of maps, be able to clone these polygenic factors. Now, this would all be theoretical and know that I could, of course, end it as theoretical.
But in fact, I may need a little bit of data on some preliminary work we've done in this direction. These are tomatoes, okay. So I would love to talk about mouse and hypertension, but in fact, I'm going to talk tomatoes instead because together with Steve Tanksley at Cornell, we've been looking at trying to apply these methods of polygenic mapping to traits of interested tomato growers, figuring this would be a good model system to start with.
You'll recognize these guys as your basic garden variety tomato, great for making tomato paste which is of course what most of the people grow them for, that is commercially grow them for. This is called Lycopersicon esculentum. And the facts about Lycopersicon esculentum, they're 65 grams, so they're good-sized tomatoes.
They taste not bad, but mostly what growers are interested in them for is their soluble solids which are used to make tomato paste and they're about five percent soluble solids. The next slide shows you--let's pull it back--Lycopersicon chmielewskii.
Now, if you look closely, you'll see it. It's right over here. That is a tomato plant, believe it or not, all right. There are some little red dots, apart from my little laser joystick here, that there are the tomatoes.
And these little guys are, honest injun, interfertile with Lycopersicon esculentum. You can cross these guys with garden variety tomatoes and you get tomatoes out the other end. But these guys are little. They're about five grams. But, tomato growers are nuts about this sort of plant because even though it's tiny, it's 15 percent soluble solids.
And so, the dream is to take the genes for soluble solids in this tomato, this chmielewskii, and breed them into esculentum without getting a scrawny little plant. So, keep the genes from being 65 grams but get the soluble solid genes in there. So in fact, Steve Tanksley's Group at Cornell arranged a cross between esculentum and chmielewskii to make an F1 or a backcross to esculentum, measured soluble solids in weight in these, various guys have produced an RFLP linkage map of tomato.
Here's a schematic RFLP linkage map of tomato there, okay. I've indicated the distances or names 'cause it's irrelevant. And then we did the procedures I was telling you about before of trying to do polygenic mapping for soluble solids and there we go. So right in this case would have predicted something like four to six loci.
And indeed, we can find four chunks of the genome, four intervals of the genome there with--for soluble solids in tomato that accounts for 9.5 increase in soluble solids. Explains about 52 percent of the variance and those--I mean, this audience will be savvy to the fact that the amount of variance you explain depends on how much environmental noise there is as well.
And in this case, these are tomatoes grown commercially in commercial fields, no hot house, no loving, tender, care, and it still explains about 52 percent of the variance and about 9.5 percent increase in soluble solids. Remember, where the positions of those guys are because in fact, you can map the genes for fruit weight separating chmielewskii and esculentum, and there they are.
They account for about five loci here, accounts for about 43-gram increase in weight between chmielewskii and esculentum, account for about 47 percent of the variance. And you'll note going back and forth between those that only in one case are those genes in the same interval. So in fact, one can map these polygenic factors using RFLP linkage maps here of tomato.
But in principle, you could do this with mouse as well for a whole variety of traits. And that's another way to increase the power of the functionator. Now, of course, mapping roughly to a region like saying, "This part of the tomato genome, it has genes for fruit weight." Well, that's a good start. You'd really like to be able to clone the genes themselves.
And for that, one would use physical maps to get close and sequence to recognize, and I'll come to that. But even before you've done any of that, simply again, knowing position, knowing roughly where the genes for say hypertension might be in a particular cross would be of great interest because of another important challenge for mapping than we defining the synteny map between mouse and men.
By synteny, what has been recognized is that the mouse genome and the human genome are very similar. You can by breakage and reunion, in essence, by taking a meat cleaver and cutting up the human genome into about 100 to 150 parts and gluing it back together produce the structure of the mouse genome. That is to say chunks of the mouse genome are contiguous in the same way as they are in the human genome.
And you'll note here, for example, this chunk of chromosome 9 human or a chunk of chromosome 4 in mouse that of the same genes in roughly the same order there.
So in other words, if one defined in detail the structural homologies there, what's called the synteny map, the physical adjacency map comparing mouse to men, if one mapped regionally in mouse and knew that there was an interesting genome of region even before you cloned it, you could transfer that information via the synteny map to man and know what regions of the human genome one was thinking of.
So that's really the fourth challenge for mapping that I want to talk about. So there's a sense of tools and fundamentally, they're sort of mathematical, analytical tools, and so I think appropriate for conference like this that let's you try to squeeze a lot more information out of the genetic map, the product of the sort of Human Genome Project that we're talking about here, or mouse genome project or tomato genome project.
Congress, I think would be less thrilled to hear about the tomato genome project, but it's all of a piece and it's all the same sort of methods that need to be develop to make a stronger functionator. Well, let me take just a minute and fill you in about where we are toward having genetics maps of the human genome before I go on to discuss sequence as another tool in the functionator.
The Human Genome Project, as I said, is aimed at producing a really high-resolution map of the human genome from a genetic point of view. Markers every one percent recombination spaced out along the human chromosome, and we're nowhere near that today. That would a total of some 3000 to 5000 genetic markers spaced out evenly along the human chromosomes. But, rudimentary maps of the human genome are becoming available.
There's a large international collaboration involving many groups around the world called the CEPH, the Centre d'Etude du Polymorphisme Humain, or the Center for the Study of Human Polymorphism in Paris, that's collected DNAs from 40 large human families. And these DNAs are available to investigators everywhere in the world to map their DNA markers in so everyone can integrate their data by computer.
And indeed, that's what's going on. I'll describe a project that I was involved in recently, my group was involved in recently, to construct such a rudimentary map of the human genome, but it's just one of many such projects that are going on.
This is a project in which Helen Donis-Keller's group at a small company in Boston called Collaborative Research studied the inheritance of 403 genetic markers in 21 of the CEPH families.
267 of these markers were anonymous RFLPs about which we knew nothing. They're just random pieces of DNA. 208 of them in fact were randomly isolated from human libraries, that is whole genomic libraries of human DNA. Fifty-nine of them came from chromosome-specific libraries, actually chromosome 7 and chromosome 16 'cause they're disease genes of interest.
Fifty-four of the markers that we looked at in the study here are what we'll call anchor loci. Their positions are known, typically, they're cloned genes and their positions in the human genome is known by virtue of say, in situ hybridization to chromosomes or by using cell hybrids, mouse-human hybrids that contain only a single human chromosome.
And showing the power of such an international collaboration, 46 of the markers in the study were in fact markers the inheritance--whose inheritance patterns were already contributed to the CEPH database. And this is going to be a database that I believe will be increasingly useful as more and more of the markers are poured into that database.
For each of the markers, southern blots were done here showing the inheritance of RFLPs just as I showed you a bit earlier. And the data here involves 5000 southern blots of which my group did none of them and so--which can take absolutely no credit for any of the really hard and heroic work here of doing 5000 southern blots.
We only got involved at the level of analyzing the data that comes out of these 5000 southern blots, about these 200--about these 403 markers. So let me roughly run you through how one builds linkage maps of the human genome. Well, you score all these markers in three-generation families.
You then have to make linkage groups set of these markers and the way you do that is by estimating maximum likelihood distances between markers, two point LOD scores, you reach your LOD score of three denoting significant linkage. And you assemble things into two-point linkage groups, clusters on the basis of that. You then got to pin down each of these linkage groups to chromosomes, figuring out which chromosome they're on.
Well, that's what those anchor loci are for because if a linkage group contains an anchor that's already been put on chromosome 4, then you know that the linkage group is on chromosome 4. So the extension of linkage groups that haven't been pinned down to chromosomes, you've got to take at least one, probably two, of the markers and yourself pin them down to chromosomes typically by hybrid panels.
Finally, you now have all your linkage groups nailed down to chromosomes and you've got to actually construct the map on the chromosome. That is, the correct genetic order for all of these markers and the correct distances between the markers. Now, that's an interesting problem, one that the computer scientists may appreciate in a sense, because--see, you could do it on the basis of the two-point linkage distances that are inferred between pairs of markers,
but the problem is there's not a huge amount of data. The two-point distances are relatively inaccurate. In fact, here are 10 markers taken from human chromosome 7 and you'll note, we sort of try to rough in the distances between markers and there are some linear structures, I guess, down here. There's one--it turns out, that's one end of the chromosome there. And the chromosomes go up like that and goes around there.
And this marker up there is the--oops, hold still, hold still, there we go--is the other end of the chromosome. And you can kind of see there are some structure to it. But based on two-point distances, it's kind of hard to get the correct order and you'll appreciate the problem even more if I tell you these are only 10 markers on human chromosome 7.
Actually, we have 63 markers on human chromosome 7. The other 53 are all in between. So the prospects of getting the order right based on just the inferred two-point distances are not very great. Well, geneticists don't like two-point crosses anyway.
Better would be three-point analysis. Look at all triples and try to figure out the right order based on the triples and the right distance. Well, that would be good except, again here, we've got a limited number of families. And if we insisted only doing three-point analysis, see for any given triple of markers, only a small fraction of the families will be informative, that is segregating for each of the three markers, which probably is a lot of data there too.
So really the best thing to do if one wanted to construct a map would be to do full endpoint analysis, full linkage analysis on the pattern of segregation of all N markers there. Now, what do I mean by full endpoint analysis?
Well, it's really a three-step process. Suppose God told you the correct genetic order for the markers and the correct map distances, and I just wanted to convince you of it, that I knew the right answer. I'd convince you by this method of maximum likelihood that I was talking about before. I'd show you that the data was so much more likely to have been produced by this map than by any other map.
That's how I'd convince you that I actually knew the right answer. Now, that means that even if you were privy to the correct genetic order and the correct map distances, you must still be able to compute the likelihood that the data would have arisen.
[End Reel 1]
[Reel 2]
[The Human Genome Project. What Are We Hoping to Learn? Dr. Eric Lander, March 28, 1988]
Now that means that even if you were privy to the correct genetic order and the correct map distances, you must still be able to compute the likelihood that the data would have arisen. Well, it turns out that the likelihood function is a bit of a complicated function.
And that the traditional algorithm that had been available for computing the likelihood function ran in exponential time in the number of markers you were looking at, 63 markers on human chromosomes, seven you can imagine that that begins to be hard. Indeed, five markers began to be hard.
It would take about a day on a VAX. Six markers would take something your neighborhood of a week on a VAX. And 63 markers would take far longer than any other grant proposal current--any grant currently offered. So, that's a relatively difficult problem, is the work on the exponential nature of the standard algorithm for computing likelihoods.
But that's only the beginning of the problem because of course you don't really know the correct order and you don't really know the correct map distances.
See, if you knew the correct order, you'd still have to compute the right distances between the markers. And of course, there the likelihood function is a function of all the distances between the end markers there and you've got to search around N dimensional space to find the maximum likelihood--the position of the point of maximum likelihood there.
And so, you've got to search N dimensional space and there're some problems and that the traditional search methods for doing that take a very long time to stumble upon the right answer. But then the problem is actually a little worse than that because you don't even get the correct genetic order. What you really have to do is you've got to try all possible N factorial genetic orders in principle.
And for each one of them, you have to make the best map and compute its likelihood. So, there are computational problems that were interesting and were involved here. And I'll tell you a little bit about how we and other people over the past couple of years have been coming up with ways around the computational problems.
And it's simply meant to be illustrative of the fact that I think there are a number of instances where different--where improvements in computational methods really do extend the power of our ability to do analysis. Here, for example, let's go back to those three points. I said, we have to compute the likelihood function.
The traditional method for computing the likelihood function runs in exponential time. Well, it turns out that it's possible to use a little bit of the special nature of the problem to convert it from a problem that runs in exponential time the number of markers, to one that runs in linear time in the number of markers.
The trick here is that the usual algorithm runs in exponential time in the number of markers, and linear time in the number of people. One can trade it off and make it a problem that runs in exponential time in the number of people and linear time in the number of markers. All the pedigrees are bounded size and in fact the constant that comes from the exponentiating of the number of people turns out to be workable.
So in fact, one can do linear time calculation. The trick here is, in essence, to use a hidden Markov chain algorithm that might be fun to talk about with some of the folks here later. Mostly, it has to do with the fact that if we really could look at the human chromosome in detail, there's an essence of Markov process that's going on along the chromosome.
But in fact, we only get partial information about it. So one wants to estimate the parameters of a partially observed Markov process and there are some ways one can do that. In particular, one can compute its likelihood in linear time. Then in order to find the best parameters there, in order to figure out--so we can now compute likelihoods using Markov chains.
In order to find the best distances to put there, the maximum likelihood distances, one can exploit what's called the EM algorithm statistics. It was invented independently by a number of people but really integrated and synthesized by Dempster, Laird and Rubin at Harvard in 1977. It's a really beautiful algorithm for doing these sorts of things.
And algorithm is a misnomer here 'cause it's a sort of general procedure and in any given instance, you need an algorithm to carry it out. It's still called the EM algorithm anyway.
It works like this, see the human genome is a part--again, human genome data is a problem because there's lots of missing data. We don't know what markers--whether markers are in the same chromosome or opposite chromosome that is cis or in trans, we frequently have homozygotes where we don't want to, et cetera.
There's missing data. If we didn't have that missing data, it would be just like fruit flies, and we could count recombinants. It would be very easy. So if there was no missing data problem--that's what you think--but if there are no missing data problem, we'll just be able to count recombinants and get a map and get the correct recombination fractions between the markers, or the distances between the markers.
It also turns out though that if we have the right recombination fraction somehow, if God told us the recombination fractions, we would be able to mathematically fill in the expected value of all the missing data. Now, we don't know the recombination fractions. But if we knew them, we could fill in the expected value of the missing data.
If we knew the missing data, we could fill in the expected value or the recombination fractions. We don't know either. Yes? Yes, there are recombination fractions. In fact, yes, they're all equal.
Using them, fill in the missing data that you'd wish had been there on the blots like these markers in cis with that marker, they're on the same chromosome, this one isn't, on a probabilistic basis to when its expected value using that count recombinants, using the recombination fractions you get, fill in the missing data, using the missing data fill in the recombination fractions. Theorem converges and it converges to a local maximum.
And in fact, it converges on a very nice path. Every step increases the likelihood necessarily and it converges to a local maximum. And in many of these problems, the most ones we've looked at through is but a single maximum and it converges to it in about 10 iterations.
So in fact, it's very easy to explore N dimensional space even for very high N, EM algorithm does a nice job of it. And at each step, you're using the Markov chain linear time algorithm to do your work for you. And then finally, of course, I said, got to look at N factorial different orders. Well, in principle, you do have to look at N factorial different orders, but this is not a highly frustrated problem.
This isn't say like the traveling salesman problem. This is a rather cooperative system. It turns out, once you get 7 or 8 or 9 markers down, the other markers tend to agree to cooperate very well. And geneticists, when they map some new marker in fruit fly, don't go back and consider all N factorial possible orders.
They get fairly good confident evidence that their initial order is correct and that they map relative to it. Indeed, we're able to do the same thing here. It turns out, once about six or seven markers are down, which we can do exhaustively, six factorial and seven factorial being a reasonable number, one can then map relative to that.
And to do that, we built us a large computer program called MapMaker with the help of some wonderful MIT undergraduates to construct linkage maps of human chromosomes. Here, for example, six markers, we've asked the computer to look at all possible orders and report back the likelihood that the best map for each order would've produced the data.
There's one map that is--there's 100-fold more likely than all the other maps to have produced the data. We'll tentatively accept it as correct, and it starts right up there--tentatively accept it as correct, and we'll go on and map relative to it. We can map the other four markers that I showed you before in human chromosome 7 relative to it. And you get maps of human chromosomes out the other end.
Well, that sort of just a brief thumbnail sketch of a bunch of computer science and programming that had to undergo--underlie making maps of chromosomes. And at the end, what you get are maps of chromosomes. We've tried zillions of orders here. And in fact, this is a map of human chromosome 1. This map here is the map that's obtained when one--let's see if you can see the top of the slide there.
Oh you can, very good. When one insists that male and female recombination rates be equal, okay. So you hypothesize a single recombination fraction equal for males and females in any given interval. But you could of course say, let me fit twice as many parameters and I will hypothesize that males and females could have different rates of recombination,
and I'll search 2N dimensional space and find separate recombination fractions from males and for females. And in fact, you'll notice that male recombination is less than female recombination.
If we allow them to find their own maximum likelihood values, you'll note that the female genetic length is much longer, more recombination in females. I'll just show you some equivalent pictures of maps. This is human chromosome 2.
Again, the sex average map where we assume everything is equal is on this end here. And the male and female maps, again, male genetic length is shorter than female genetic length. There's chromosome 3. And some of the chromosomes have a tremendous number of markers. Here's chromosome 7.
It has 63 genetic markers on it 'cause it's one of tremendous interest. It's the chromosome that has cystic fibrosis on it. And I can't see on my monitor, but if I can steer this thing right, I can show you--well--oh, I'm not so good at these computer games.
Yeah, well, right around south, south, south. Well, right in there. Oh, I got it. Cystic fibrosis lives right there. That's the gene for cystic fibrosis. Some of the chromosomes have lots of markers like human chromosome 7 there.
Some of them, because this was a completely random search on the genome, have relatively few. Chromosome 17 here has 7 markers on it. The worst case, chromosome 19, has amusingly 5 markers that landed on it. It's a small chromosome.
But nevertheless, it only covers a small bit of chromosome 19. And so far, there are about 400, there are 403 markers put into this map so far. If you wanted to make a rough estimate of how complete the map is so far, well, you could use those 208 randomly-chosen ones I told you about. There are 208 isolated from the whole genomic library.
They're kind of like random shots of the human genome. And you could ask how many of them are linked up so far? Well, it turns out about 97 percent of those markers are linked up so far. And therefore, one comes to the guess that maybe 95 percent or more of the genome is probably linked up since 97 percent of a random sampling of markers was linked up to the map so far. So that's a rough idea of about how complete that map is.
That's not anywhere near as close as we're going to need to pull off some of the tricks I'm talking about of really being able to do complex inheritance carefully or exploit the full information in human populations of, say, looking at inbred children or other such population genetic situations. Or, for example, to do quantitative trait inheritance in any rigorous way, but it's a start. It's about 400 markers. [?] in Utah has another 400 in change markers.
Several other groups have several hundred markers. And it seems likely that by sometime next year, a map of at least 1000 markers will be in place, and they go over the map of say 3 to 5,000 markers is well within prospect in the next, you know, two, three, four, or five years, so that one can indeed begin to think already about preparing and building this functionator to have it in place.
Even before one does, there are some amusing things you do learn about human biology. As I've mentioned already, there's lesser recombination in males than females, that's amusing. There's not actually a good reason for it. J.B.S. Haldane observed generally that the heterogametic sex in human's X-Y has lesser recombination than the homogametic sex that's human's, to female X-X in a large number of species.
And he actually gave a reason for it in 1922. Well, his rule was held up wonderfully well. It's correct for Drosophila and for the moth, and it's for humans, and mouse. Fortunately, the reason he gave in 1922 is completely specious.
And so, in fact, we have a rule that's working marvelously well, lesser recombinations in heterogametic sex, typically males than in females, and no reason for it at all, but it seems to be true. But what's amusing is that, in fact, it's not as if it was simply a stretching out of the female genetic map because there are some instances you'll see in the map with no recombination observed in females and a larger amount of recombination observed in males.
Now, of course, one thing you should say to yourself immediately is, "Could that be a statistical fluke?" After all, you looked at 403 markers, could something like that have happened just by chance?
Well, simple statistics suggests no. But the real proof of it, of course, would be to look in another set of 20 families and show that indeed this is a region with greater recombination in males than in females which indicates for the biologist that it's not just some increase in transacting recombinational machinery snooping up recombination in females,
but there are actually specific sites on the chromosome that must mediate the sex differences in recombination. Other observations that are interesting is that the genetic and physical maps are not perfectly linear with one other.
There are regions particularly near the tips of chromosomes, the distal parts of chromosomes, where the genetic maps are stretched out relative to the physical maps. For example here, 50 percent of the predicted genetic length of chromosome 10 seems to lie in the distal 20 percent of the cytological length of that chromosome.
And the reasons for that are not clear at all, but they have important implications for people who do [?] genomes, and other amusing observations one gets from these sorts of observations here. One completely unexpected was that the most polymorphic, the most variable of the markers tend to lie near the tips of chromosomes, completely unexpected.
We had no reason to believe why--to suspect why the most variable markers should lie near the tips of the chromosomes. But here in chromosome 1, two markers with heterozygosity, chance of being heterozygous over 70 percent, the only two markers with chance of heterozygosity over 70 percent are in the distal tips of chromosome. And that's an observation that's found at a statistically significant frequency throughout the genome.
And we're just at the point of making guesses about what this could mean. It could be a mechanistic fact that it's trying to hint at us about chromosomes. Remember, I said, there was increased recombination out there near the tips of chromosomes?
Well, the increased recombination might cause increased variability either by virtue of the fact that recombination has been implicated in generating mutation, or that by the fact that recombination often might involve elements that are tandem to--that are tandem arrays and you might have a great deal of unequal crossing over. Those are possibilities for it.
That would be a mechanistic idea. It could be telling us an evolutionary effect about the human genome. Namely, that--not that more polymorphism is generated out there on the tips of chromosomes, but less polymorphism is lost out on the tips of chromosomes. So from an evolutionary point of view, someone might point out that when you have selection operating on a species, you frequently will drive one form at a locus to fixation.
So evolution will drive a locus to fixation. One allele will take over. But when it does by hitchhiking, some region of the chromosome around it is driven to fixation because whatever was nearby gets driven to be fixed and be the predominant allele, in fact, the only allele. Well, regions of higher recombination, of course, we'll fix less chromosome because there will be more of a chance to randomize nearby.
Regions of lesser recombination will fix greater chunks of chromosome. So in other words, the regions with the most recombination will lose the less--the least polymorphism. So maybe it's an evolutionary effect, we just don't know. But all sorts of functional things, at least functional ideas are suggested by looking at these sorts of maps of genomes. Well, I've mostly talked--I think there's another carousel to put in.
I've mostly talked about genetic mapping 'cause indeed, that's--it's very close to my heart and it's mostly what we do. And I hope I've given you some sort of a sense of how one can build stronger and stronger tools, partly biological, say, using inbred children or isolated populations, or using crosses between selectively bred lines.
Partly mathematical, making simultaneous hypotheses about regions of the genome or simultaneous hypothesis--about regions of genome either for heterogeneous genetic diseases in people or for quantitative inheritance in animals to give you some sense that one of the challenges is to build a strong functionator for genetic mapping.
I want to close, take the remaining five or ten minutes, and close by talking--oops, did I lose that. Let's go back--and close by talking about--there we go--about other things besides genetic mapping because genetic maps are only one part of the triumvirate that gets described as the Human Genome Project, genetic maps, physical maps, and sequence maps.
Charles Cantor is going to talk a good deal about physical maps and I'll say nothing about them other than that they're extremely, extremely useful, really invaluable for going from linkage to locus. All I've told you about so far involves finding markers nearby and just knowing that you have markers nearby a gene for hypertension or for manic depression.
You'd actually like to get the gene itself. And the way to do that is, of course, to move along the chromosome. And for that one, needs physical maps. You have to traverse mega-base distances, and then, real tricky, you have to recognize the gene itself. I'm not going to talk so much about how you recognize the gene because, well, it's just not clear yet.
There have been only a couple of instances where people who've done it in the human genome, each has involved special tricks and it's not that easy to do. So there aren't general rules yet. But once one gets to gene, one will sequence it or look it up on the sequence map if that map is done already, and one is going to have the problem saying,
"Well, there is the sequence, what does the gene do?" And there are two. We need a considerably stronger functionator. Equally important with a strong genetic functionator, we need a sequence functionator. A machine not just the sequenator that generates sequence, but stronger machines or devices, in a broad sense, that tells us what in the world this means about a gene.
What can we learn from sequence? So I want to close by just asking the question like we asked, what we can learn from position and what are the challenges for genetic mapping? What can we learn from DNA sequence and what are the challenges for sequence analysis? Well, I'll give you a couple examples.
DNA sequence analysis and particularly protein sequence analysis inferred from the DNA sequence has been tremendously important and grows in its importance over the past decade. For example, a structure called the zinc finger was recognized in a transcription factor from Xenopus, the frog Xenopus.
There is TF3A, transcription factor 3A in Xenopus involved in transcription of 5S RNA. And Aaron Klug and colleagues noticed a particular structure there and figured that it would produce a kind of finger that would make a tetrahedral coordination complex with zinc in the middle.
And these fingers would intercalate into the DNA and read the DNA sequence, and that's how the transcriptional activator would work. Well, this zinc finger was also found in other transcription factors. SP1 in animal cells, the ubiquitous one that was recently clone sequence by [] Lab, EDR1 in yeast.
And so, transcription factors, many of them seem to have the zinc finger motif. Well, that's great. We knew those things were transcription factors by function. However, other genes that have been sequenced and were of great interest for other reasons turn out to also have these zinc fingers.
For example, two very important genes involved in Drosophila pattern formation in the early embryo, Kruppel and hunchback are their names, both have zinc fingers. And, by implication, are almost certainly, I don't feel like one we'd have to go out far in a limb and neither did the people who sequenced them through the conclusion say they are most certainly part--one of their functions is to act as a transcriptional factor.
And, a very exciting development recently, David Page, a colleague of mine at the Whitehead, recently cloned a gene in the region of the Y chromosome that determines sex, a gene that's almost certainly the male-determining factor, that test this determining factor in humans.
And it has 13 zinc fingers in it, at least. And so, it too, almost certainly, has an activity as a transcriptional factor. So it suggests all these things have biochemical function. And without that sort of sequence comparison, one would've had no guess as to how this transcript--how this male-determining gene would have acted, or how this pattern formation genes in Drosophila would've acted.
Another example that I think really points out the tremendous power of the sequence analysis comes from Drosophila again. A gene called snake, that's its name, which required for specifying the dorsal-ventral axis in flies turns out to encode a serine protease, a gene homologous--no, no, sorry, similar, we can't use homologous anymore, similar to thrombin. Now, that's an extremely interesting point because thrombin is involved in blood clotting.
And in blood clotting, one has a cascade effect of serine protease upon protease that amplifies a signal. Well, one of the things that mathematicians studying development had always been looking for was a signal amplifier, an activator-inhibitor system.
And indeed, at a conference like this, I should point out, in case people don't know, the person who wrote the first paper on activator-inhibitor models toward development was Alan Turing. And Turing pointed out that if you had an activator and an inhibitor, one would be able to build all sorts of patterns. And people have developed those ideas since and they seem to be quite lively and active ideas for Drosophila.
And lo and behold, someone sequenced snake, it's a serine protease and now a good betting line at least is that there may be a cascade of serine proteases perhaps evolved in development, something people would've never suspected without that sequence comparison. What else can you learn from sequence? Well, frankly, almost all oncogene functions have been inferred by the comparison of amino acid sequence.
The first, to my knowledge, the tyrosine kinase function of Src was inferred by biochemistry. Src is a phosphorylated protein. Ray Erikson figured out, "Well, maybe it auto-phosphorylates," and did this all by biochemistry. But since then, virtually every oncogene that has been sequenced has had a cellular function ascribed to it by virtue of amino acid similarity.
And in fact, if we look at a list of oncogenes here, you'll see that you've got your cytoplasmic tyrosine kinases, your cytoskeletal tyrosine kinases, growth factor receptors, hormone receptors, growth factors themselves, DNA binding proteins, phospholipases, all assigned by virtue of amino acid sequence homology by computer, by strong similarity noticed by computer.
I note I left to one off as soon as I made the slideshow that I realized, of course, the very famous Ras gene, a G protein, also got its cellular function assigned by virtue of amino acid similarity. So in some sense, the oncogene field is utterly dependent upon the sequence comparers and that sort of functionator to assign function because it would be very hard to guess the function of these things otherwise without that sort of similarity.
Well, what are the challenges then for sequence analysis? I'll briefly indicate a couple of the ones I see, but I know that a lot of people here have done much work on these various questions, and I hope that some of the conference will be devoted to the question of building the stronger and stronger functionator for sequence analysis.
One, the ultimate goal, I'd put it first, although it's really the ultimate one, is to really compile a whole thesaurus of genetic elements. So by that, I'm thinking of, for example, the exon shuffling hypothesis put forward by Walter Gilbert some years ago now that says that almost all proteins are built up out of this finite set, perhaps 1,000 or 10,000, pick your number, of fundamental exon subunits that have been shuffled around to produce proteins.
They were the primordial bits of protein and they've been shuffled and reshuffled and used. And so, what we would like to do if exon shuffling really does account for most proteins is get a full deck of exons and therefore be able to recognize them all and see how they shuffle around, similarly you like a full deck of regulatory regions and see how they're shuffled around too.
I'll show you just one example of how that thesaurus seems to be shaping up. A slide here from Joe Goldstein, the LDL receptor, low density lipoprotein receptor here, you'll note has its various exons. You'll see chunks of it there. The pink guys, you see an exon similarity, very strong similarity to complement. EGF precursor shows, again, similar regions, that yellow regions found in factor 9, factor 10 and protein C.
And this is, without going too much detail, a very good example of how a small set of exons are shuffled and reused in a variety of apparently--well, at least on their face seem unrelated genes, but in fact must be having similar parts there.
Another example of a challenge for sequence analysis in the future, of course that business of recognizing the whole thesaurus, that's a big job and one that I think comes only when you have a whole genome sequenced.
But another example, the recognition of functional subunits in DNA and protein at a somewhat less detailed level of a complete--than a complete thesaurus, simply such things as recognition of coding region, being able to run through newly sequenced DNA in an organism and with very high certainty, say,
where coding regions or where splice junctions are, or secondary structure aspects of proteins, or the combinatorial interaction of regulatory element simply to be able to recognize these elements in DNA.
And in fact, of course, while some progress has been made toward understanding of--and coding region, people getting better and better predicting how secondary structure we're actually doing still, to my mind, pretty lousy. One doesn't do that all much better than random prediction at the moment in predicting secondary structure.
To point that out, I'll simply show you a picture from a current Science here. That's the Ras oncogene that I left off my list before. It's a G protein, so a GTPase. We have no idea of how to reliably predict the structure of this molecule.
And in particular, we have no idea how to tell that a single amino acid changed there. How we would know in advance converts this thing from a normal G protein into an oncogene by abolishing its ability to bind GTP and therefore leaving its signal turned in the on position at all times. So in fact, there's very much to be learned about predicting secondary structures there.
And while many methods are in use, they really don't do better than about 60 percent. And at random, one can [?] up to about 50 percent, so there's very much to be learned. Similarly, the detection of subtle similarities to infer function. See all the examples I showed you before, snake and how it's a serine protease, or the zinc fingers, or the oncogenes came from extraordinarily strong similarities.
There are weak similarities we'd like to know as well. And these sorts of weak matches that one would like to find, I think are tremendous challenge for people mathematically to define what exactly it is we mean by this similarity and to understand the statistical significance underlying these sorts of matches. And many people in the audience have worked on these questions of the significance of these sorts of weaker matches.
And then there are, of course, algorithmic questions about how to find them in some efficient way. And it amounts to, in a sense, varying the stringency of hybridization that a microbiologist would use to detect matches of different sorts. And I think there's a lot that needs to be done.
In some sense, almost all molecular biologists rely on a handful of simple matrices and simple programs that have served them extraordinarily well, I mean, they're tremendously important programs, but more work needs to be done on exploring different ways of detecting similarities.
And I'll close the remaining minute or two by simply sketching some things we've been up to there, but it's very much at the early stages and similar to, I think, work going on in many places, much needs to be done. But let me toss out an idea for people to bat about. What we'd really like to draw eventually is a graph, a relatedness graph on all proteins.
That is, put point down all proteins on the page, draw an edge between any two proteins that are highly related, very similar. So I've marked here this edge, say, between w-sis and PDGF, one an oncogene and one a platelet-derived growth factor. And the edge is highly significant. So each edge in this graph really matters.
And you could write a paper in Nature about each edge. Well, it would also be interesting to be a little less discriminating. And for example, to draw in all the edges that are--I mean, say among the top two percent similarity. Now, everything that's in the top two percent similar--or the top ten percent, five percent similarity, pick your level.
Now, in this graph, most--no one edge is so significant that one would write about it in Nature. They're all weak matches that one doesn't have a strong way of knowing whether or not they're important. But notice that in this graph here, one gets certain clusters of edges.
And so one can ask the question, in a random graph of this sort, it's not really a random graph, but let's leave that aside, in a graph sort of this sort, how big a cluster do you have to find before it's significant? And so, in fact, one can work out the mathematics of the significance of finding clusters like this in the relatedness graph of proteins.
Richard Arratia at USC and I have done some of that work to work out the significance of these sorts of clusters. And then one really sets this one goal then to compute that relatedness graph. Compare all proteins with all proteins and start looking at those comparisons with various different tools.
And so, in fact, you can do that. The standard method for doing that is dynamic programming as many people at this conference I think will talk about it. The only thing I called your attention to is dynamic programming mostly involves looking back at some previous cells there and it's something that's highly parallelizable. All the colored-in squares along there can be done in parallel 'cause they're all looking back in a non-interacting way.
And so, that caused us to get involved with some folks at a company in Boston called Thinking Machines and to use what they have there called the Connection Machine which is a highly parallel processor, 65,000 processors arranged to do sequence comparison audit, and we put DNAs--we put protein sequence comparison using the NBRF database through the Connection Machine.
And for at least that small set of 2,000 proteins, a small drop in the bucket compared to the whole human genome of course, built that relatedness graph and we're now at the point of sort of mucking around in the relatedness graph and seeing what we can see.
It would be a graph I'd be delighted to share with people 'cause I think there'll be many people with various ideas on how to screw around in that graph and find clusters. I'll note, right off the bat, finding clusters in such a graph is [?] complete. Needless to say, that's not--that should daunt anyone.
It simply means one has to use Heuristic algorithms, but that's fine. You can, for example, do things you know about. For example, the ferredoxins.
There are two types of ferredoxins, plant ferredoxins and bacterial ferredoxins. If you run bacterial plant ferredoxins through the usual data--through the database, usual programs, they don't recognize the bacterial ferredoxins tremendously well using the standard scoring methods.
And in fact, I did a nasty trick. I took all the ferredoxins out of the database and put them back in with other names and I asked some students to find homology to this new protein I'd found and they only found the plant ferredoxins. Nobody came back and told me they'd found bacterial ferredoxins that were similar.
Well, in fact, you could ask, "If you build a relatedness graph and look for clusters, what do you find?" Well, you find a tight cluster of plant ferredoxins, but you can also easily pick out the next cluster up that contains it which is the one that also includes bacterial ferredoxins. So in fact, those weak links do manage to connect things up very nicely.
There are other things you can do. One example pointed out to me by Gary Otto at the Whitehead, if you put your finger down on platelet-derived growth factor, PDGF, you find a cluster pops out, that's due to Gary Otto, that involves angiotensinogen, apolipo-e and secretin. I've only shown the matches that are identical in each--any pair there has a number of additional matches that are significant.
And these are all small, circulating molecules in the bloodstream not previously known to be related to each other, still not known to be related to each other, but they are close to each other in the cluster graph. And what's particularly amusing about these small circulating molecules in the bloodstream is that in the case of apolipo-e, it's known that this part of the molecule is its binding site to its receptor.
So in fact, it's a guess that one--that might emerge from studying this relatedness graph that maybe these are the binding sites to all these guys to their receptors. The reason the match is weak is that they all bind to different receptors. What's common about it is, is the scaffolding that makes a certain type of receptor site, a receptor binding site. But then what's weak about it is that they're all binding different receptor there.
These are simply ideas I'm tossing out 'cause I think many of these sorts of things will be talked about, but those are sort of my sketch of the challenges for sequence analysis, compilation of an eventual thesaurus of genetic elements, recognition of functional subunits in DNA and protein, and the detection of subtler similarities to infer function by mathematical methods, by the use of new algorithms and exploiting new architectures.
So we're left with the functionator. The functionator is still a hypothetical entity. It will never be built as a machine, per se, but it's a device no less important than the sequenator or any of the other devices people are thinking about building for the genome project.
And it's one that to build right will involve mathematicians, statisticians, and computer scientists, as well as geneticists and biochemists, because the Human Genome Project and what it produces is but a tool. It answers none of the questions. So the Human Genome Project, what are we hoping to learn?
We're hoping to learn the position of all interesting genes underlying physiological and development traits. We're hoping to learn how the gene products function biochemically.
But the Human Genome Project, per se, will tell us none of that. But it will be the tool which we can use in conjunction with a powerful functionator as we together try to build one over the next decade or so to answer most of those questions. I think that's a very exciting prospect. Let me stop there.
[Applause]