2001: First Draft of the Human Genome Sequence Released

The Human Genome Project international consortium published a first draft and initial analysis of the human genome sequence. The draft sequence covered more than 90 percent of the human genome. One surprise is that the estimated number of genes was lower than expected, just 30,000-35,000. (The final genome sequence produced in 2003 has further lowered this estimate to the 20,000-25,000 range) The sequence data was immediately and freely released to the world. Researchers can access the data through public databases on the Internet and can use the information without restriction. At the same time, another version of the humane genome sequence was published by J. Craig Venter and colleagues working at Celera Genomics Corporation.

The effort to sequence the human genome has been referred to as biology's moonshot. The International Human Genome Sequencing Consortium included hundreds of researchers working at 20 centers located in China, France, Germany, Great Britain, Japan, and the United States. The centers producing the most sequence data were: Baylor College of Medicine, Houston, TX; Washington School of Medicine, St. Louis, MO; Whitehead Institute/MIT Center for Genome Research, Cambridge, MA; the Department of Energy's Joint Genome Institute, Walnut Creek, CA; and The Wellcome Trust Sanger Institute near Cambridge, England. In the United States the effort was led by the National Human Genome Research Institute and the Department of Energy.

More Information

Summary Data Taken from the Draft Sequence of the Human Genome:

The draft sequence of the human genome contains some small gaps that remain to be filled. Nevertheless, scientists have already begun the process of analyzing the data. Some of the important observations were:

The estimated number of genes is about 30,000 (later revised to about 20,000-25,000). This is only one-fourth as great as previously thought and only a few thousand more than that of the tiny roundworm C. elegans and less than that for the plant Arabidopsis thaliana.

The haploid human genome sequenced contains 2.85 billion bases.

The average gene consists of about 40,000 bases, but gene sizes vary greatly. The largest known human gene is dystrophin (which is associated with Duchenne muscular dystrophy). It runs approximately 2.4 million bases.

The DNA sequence in any two individuals is 99.9 percent identical.

The functions are unknown for over half of the discovered genes.

Less than 2 percent of the genome codes for proteins.

The proteome (the complete collection of proteins in the cell) is larger than the genome. The average human gene produces three different proteins.

Repeated sequences that do not code for proteins make up at least half of the human genome.

The gene-rich regions of the genome are predominantly composed of the bases guanine and cytosine, while in gene-poor regions, the bases adenine and thymine dominate.

Genes appear to be concentrated in random areas along the genome, with vast expanses of noncoding DNA in between.

Chromosome 1 has the most genes (about 3,000) and the Y chromosome has the fewest (about 230).

Over 3 million single nucleotide polymorphisms (SNPs) have been found in the human genome. SNPs are common single-base variations in the genome. They are being used to identify regions of the genome associated with disease. The human genome is estimated to contain approximately 10 million SNPs.

The number of germline (sperm or egg cell) mutations in males is approximately twice that seen in females.

In humans, genes are unevenly spread throughout the genome, while in prokaryotes, genes are evenly spaced throughout the genome.

The human genome has a much greater portion (50 percent) of repeat sequences than the mustard weed (11 percent), the nematode worm (7 percent), and the fruit fly (3 percent).