Will Enhance Utility of Human Genome "Working Draft"

July 2000

This collaborative effort takes advantage of the recently announced "working draft" sequence, representing the vast majority of the sequence of the human genome (the full set of genetic instructions, encoded in long strands of DNA, that are contained in the 24 chromosomes). By comparing newly generated sequence data to the "working draft," it will be possible to accelerate the construction of a higher-density SNP map; this map will in turn facilitate identification of genetic variations associated with common diseases from Alzheimer's to heart disease and diabetes. At the same time, the data generated will help improve the "working draft" itself.

Three academic genome research centers - the Whitehead Institute for Biomedical Research in Cambridge, Mass., Washington University School of Medicine in St. Louis, and the Sanger Centre in Hinxton, U.K. - will participate in this collaboration.

The centers will isolate two and one-half million DNA fragments (each about 6,000 base pairs long) from the human genome and determine the sequence of approximately 500 base pairs at both ends of the fragments, resulting in paired sequences of known distance from each other. The sequences then will be compared to human genome DNA sequences already in GenBank, a publicly accessible repository of genome sequence data to identify SNPs. In addition, the paired-end information will help span some gaps in the human genome "working draft," enhancing the value of the draft. This paired-end approach has been used to advantage in the sequencing of the genomes of lower organisms, such as bacteria and the fruit fly Drosophila melanogaster.

"As a physician as well as a laboratory scientist, I am excited about the potential of this collaboration to expedite the discovery of genetic information that will lead to improved diagnosis and treatment of disease," says Francis Collins, M.D., Ph.D., director of the National Human Genome Research Institute (NHGRI) of the National Institutes of Health. "This collaboration will yield a bumper crop of genetic variations. As a bonus, it will also improve the assembly of the human genome sequence so that it is closer to the highly polished 'finished' form that is our goal."

The DNA to be sequenced will come from 24 anonymous, unrelated donors with diverse geographic origins, making the new sequences a rich source of SNPs. As SNPs are identified, they will be validated, mapped, and deposited in the publicly accessible database, dbSNP.

A high-density map of SNPs - the single base pair variations that occur on the average of once every 1,000 base pairs throughout human DNA - is expected to be a valuable research tool that will help scientists pinpoint genetic differences that predispose some, but not others to disease, and underlie variability in individual response to treatment. In turn, novel diagnostics and drugs can be developed that are tailored to patients' genetic profiles. The SNP Consortium will file provisional patent applications on newly identified and mapped SNPs solely to establish the dates of discovery, but no patents will be allowed to issue, keeping the data freely available for the unrestricted use of researchers worldwide.

"The collaboration between the Human Genome Project and The SNP Consortium shows that public-private cooperation can be an efficient means for developing basic research tools essential for the application of genetic information to the understanding and treatment of disease," says Arthur Holden, chairman and chief executive officer of the consortium, formed in April 1999. "Through this collaboration, The SNP Consortium will be able to contribute up to 50 percent more SNPs to the public domain than otherwise would have been possible under our original scientific plan."

The SNP consortium's initial two-year plan had been to identify 300,000 SNPs and map at least 150,000 SNPs, evenly distributed throughout the genome. An exponential increase in the amount of human genetic sequence data that has become available from the Human Genome Project (HGP) over the past 15 months has enabled the consortium to proceed at a much faster pace than originally envisioned. To date, the consortium has identified over 140,000 SNPs and mapped 102,719 SNPs. With the HGP collaboration, the total number of validated and useful SNPs mapped may exceed 750,000 by December 2000.

The HGP is an international research effort to characterize the genomes of human and selected model organisms through complete mapping and sequencing of their DNA, to develop technologies for genomic analysis, to examine the ethical, legal and social implications of human genetics research, and to train scientists who will be able to utilize the tools and resources developed through the HGP to pursue biological studies that will improve human health.

The international Human Genome Sequencing consortium, which has been organized to meet the HGP goal to determine the sequence of the euchromatic portion of the human genome, on June 26 announced that it had assembled a "working draft" of the human genome. On that same day, a private sector effort carried out by Celera Genomics, using a different but complementary strategy, announced their "first assembly" of the human genome. The international consortium is on track to produce the "finished," highly polished reference version by 2003.

The HGP consortium includes scientists at 16 institutions in France, Germany, Japan, China, Great Britain and the United States. Participants in the international consortium have all adhered to the project's quality standards and to the daily data release policy. The HGP is funded by grants from government agencies and public charities in the several countries.

The SNP Consortium is organized as a non-profit entity whose goal is to create and make publicly available a high-quality SNP map of the human genome. The consortium's members include the medical research charity The Wellcome Trust; 10 pharmaceutical companies including AstraZeneca PLC, Aventis Pharma, Bayer AG, Bristol-Myers Squibb Company, F. Hoffman-La Roche, Glaxo Wellcome PLC, Novartis, Pfizer Inc, Searle (now part of Pharmacia), and SmithKline Beecham PLC; Motorola, Inc.; IBM, and Amersham Pharmacia Biotech. Academic centers including the Whitehead Institute for Biomedical Research, Washington University School of Medicine in St. Louis, the Wellcome Trust's Sanger Centre, Stanford Human Genome Center, and Cold Spring Harbor Laboratory, are involved in SNP identification and analysis.

Definitions:Pilot sequencing projects: A set of projects that were initiated in 1996 by the HGP to test the feasibility of deciphering human DNA rapidly, efficiently, and on a large-scale. These projects lasted three years, and their success demonstrated that sequencing the human genome was feasible.

Large-scale sequencing: Cost-effective DNA sequencing conducted on an industrial scale at a rate that is sufficient to generate the sequence of a genome as large as that of the human in a short time. Large-scale sequencing also is characterized by "high throughput." (see "depth of coverage")

"Working draft" sequence: intermediate stage in the generation of a high quality, "finished" sequence. "Working draft" sequence is defined as an average of 4X coverage (see "depth of coverage")

Sequencing rate of HGP: 1,000 bases of raw sequence per second, or 12,000 bases of "working draft" per minute. Twenty years ago, deciphering that many bases would have required one year or more. Three years ago, when pilot sequencing projects to evaluate feasibility of human DNA sequencing were initiated, deciphering 12,000 bases required 20 minutes.

"Depth of coverage": this refers to the number of times the DNA in a chromosome region is sequenced. A depth of 1 (1X) means that, on average, a particular base pair has been sampled once; a depth of 4 (4X) means that, on average, a particular base has been sequenced four times over. Sequencing the same region many times decreases the possibility of errors in the DNA sequence. Current sequencing instruments can decipher about 500 to 800 bases at a time in a single sequencing "run." The results from these individual "runs" have to be assembled into contiguous stretches of sequence to reconstruct the sequence of a chromosomal region. To build up an accurate assembly from the 500-800 base pair stretches of DNA sequence that emerge from the machines, HGP scientists repeatedly sequence random fragments from each chromosome. (See BAC-based sequencing and assembly.) Repeated sequencing allows assembly of much larger regions of DNA because the random individual "runs" overlap with each other, creating areas of commonality that allow the scientists to align the short chunks of DNA sequence into long contiguous sequences. The average "depth of coverage" of the HGP's sequence across the human genome in GenBank is 6 to 7 X. This includes "finished" sequence (9 to 10 X); deep shotgun (8 to 10X) and "working draft" (4 to 5 X). In addition, the average "depth of coverage" of clones across the human genome is estimated to be 32 X.