The NHGRI Genome Sequencing Program (GSP)

Program History

The NHGRI Genome Sequencing Program (GSP) was initially created to sequence the human genome, as part of the Human Genome Project (HGP), an international collaboration involving more than a dozen centers worldwide. (See: All About the Human Genome Project for an overview of the HGP). In the process of building the technical and intellectual infrastructure to sequence the 3Gb human genome, the genomes of a small number of widely-used biomedical model systems were sequenced, including those of Escherichia coli, baker's yeast (Saccharomyces cerevisiae), the roundworm (Caenorhabditis elegans), and the fruit fly (Drosophila melanogaster).

The HGP firmly established the advantage of the rapid and unfettered release of sequence data to the community. With the completion of the HGP in 2003, two things were apparent in addition to the basic insight gained form having these genome sequences. First, the development and application of large-scale genome sequencing had resulted in significant gains in efficiency, with approximately 2-fold decreases in cost (per amount of effort) attained every ~20 months. Second, it was apparent that there would be an increasing demand for genome sequence over time to attain goals of major significance to the research community.

The reasons that widely-used biomedical model systems were sequenced were varied. For example:

Sequences were obtained from additional mammals and other vertebrates to use sequence comparisons to delineate regions of the genome that have been conserved by evolution. Comparing sequence among mammals (and other vertebrates) is still one of the most effective ways to distinguish regions that are likely to have important function (about 5% of the genome) from the other 95%.

Organisms more distantly related to humans were sequenced to understand the origins of genes and gene families.

Organisms were sequenced to provide added value to all major research model systems, such as mouse, rat, dog, and others.

Sequence information aids basic research into human disease pathogens and their vectors.

Sequence from clusters of organisms related to major experimental model systems (e.g. Drosophila) or pathogens and vectors of human disease helps annotate those model systems to maximize the benefit of their use to the scientific community, to help provide basic biological insight, and to provide simpler systems to test ideas about using comparative sequence to annotate more complex genomes, such as our own.

Sequence from clusters of vectors/pathogens and related nonpathogenic or vectoring strains allows researchers to discern genes responsible for pathogenic or vectoring properties.

More recently, it has become evident that sequencing costs have dropped and capacity has risen. This allows researchers to undertake projects in Human Variation (See: Survey of Human Structural Variation) and Medical Sequencing (See: Medical Sequencing Program and Current Initiatives), where large numbers of human genomes are partially or (eventually) fully sequenced in order to find the genetic variants that underlie human disease.

Truly facing this challenge will require even more sequencing capacity, at significantly improved efficiencies of the kind that currently can be realized only by maintaining very high-throughput, large-scale sequencing capacity. We anticipate that the program will drive, and will be driven by, the advent of new sequencing technologies. This will enable us to approach very significant questions in ways or at scales that could not previously be approached, such as:

What are the sequences and sequence variants that contribute to human health and disease?

What is the range of human genetic variation?

What are the population frequencies of gene alleles that contribute to common diseases such as heart disease? What are the relative contributions of those alleles to disease?

How do somatic mutations correspond to the etiology and behavior of tumors? What are the new somatic mutations that occur during tumorigenesis?

Medical sequence information will lead to the identification of variants responsible for human disease and will facilitate disease stratification, diagnosis, prognosis, treatment response/pharmacogenomics, and identification of critical molecular pathways in health and disease and identification of new drug targets. Ultimately, continued development of medical sequencing, along with improvements to sequencing technology (See: Genome Technology Program), will lead to the ability to use DNA sequencing as part of the routine standard of care in diagnosis, prognosis and treatment of many human diseases.

Each of the large-scale genome sequencing programs described here has, as its motivation, one or more of the rationales described above, and others that are described in the specific initiative descriptions. It is likely that as capacity grows, and as scientists begin to draw conclusions from the current data, new opportunities for the use of large-scale genome sequence data will be proposed. In order to take advantage of newly arising opportunities, the overall NHGRI Large-Scale Genome Sequencing Program is continually evaluated by NHGRI program staff and the National Advisory Council for Human Genome Research with regard to new opportunities and overall effectiveness.