Analysis of expressed sequence tags indicates there are only 35,000 human genes

METHODS

EST data and assembly. We obtained chromatograms for 1,043,599
human ESTs (essentially all those available prior to Aug. 5, 1999),
from 168 cDNA libraries, which are available from the Washington University
Genome Sequencing Center's
Merck
and CGAP archives.
Chromatograms lacking a corresponding dbEST1 entry
were discarded.
Phred2,3
was used to derive
base calls and quality values (log-transformed error probabilities).

Cloning vector sequences were obtained from ref. 4;
from the
IMAGE,
dbEST,
NCI/CGAP WWW sites; and
by mini-assembly of the first 125 bases of a sample of 500 reads from each
of the EST libraries. The portions of each EST which derive from the cloning
vector were then identified by sequence comparison using
cross_match (see below), and masked.

The high-quality segment of each EST was determined as follows:
The i-th base of the EST is assigned a score
of si = .05 -
ri,
where ri is the phred error
probability of the base.
The score of the read segment spanning bases m to n is
then defined as the sum of si from
i = m to n;
the segment having the highest score is easily
determined via a simple dynamic programming algorithm. Roughly
speaking, this corresponds to the portion of the read for which the
estimated base-calling accuracy is 95% or higher. Reads having
high-quality segments shorter than 30 bases were eliminated from
further analysis. The average length of the high-quality segments in
the remaining reads is 350 bases.

Following elimination of low-quality ESTs and those not in dbEST
there remained 992,353 ESTs. Due to their large numbers and uneven
distribution among genes we carried out the assembly in a stepwise
fashion. A random subset of 300,000 of the ESTs was first selected,
and assembled using phrap. (For assembly, the full
vector-masked sequence of each read was used, and information regarding
EST orientation was ignored). The entire set of 992,353
EST reads was then compared to the contigs using cross_match
(see below). Contigs with 20 or more matching reads were identified,
and all reads not matching one of these contigs were assembled, in a
second pass.

The entire set of reads was then compared to the combined set of
contigs from the two assembly passes. To be accepted, matches were
required to have no more than five discrepancies in the high-quality
portion of a read (not counting discrepancies unconfirmed by any other
read), and to have no more than ten bases at each end of the read's
high-quality segment which failed to align. The set of all reads not
matching any contig by these criteria was then assembled, in a third
pass. The entire set of reads was then compared to the combined set of
contigs from all three passes, and acceptable matches identified.

We then extracted the portion of each contig sequence that was
"confirmed" in the sense that it matched (by the above criteria) the
high-quality regions of reads from two or more distinct clones.
Because of small but non-trivial rates of lane-tracking errors which
occasionally assign the wrong clone identifier to a read
5,6, and
well-to-well cross-contamination within and between microtiter plates,
we considered two reads to be from different clones only if they came
from distinct microtiter plates (indicated by the first four
characters of the read name) and had names differing by at least two
characters.

Approximately 5 to 10% of EST reads are chimeric, due to errors in
cloning or lane-tracking
5,6.
Inclusion of these in the assembly
may create chimeric contigs which spuriously join segments from two
different genes. After each assembly pass we identified contigs which
were potentially chimeric by virtue of their having internal regions
which matched only one read (or several reads all derived from the
same microtiter plate), flanked by regions matching multiple reads.
Such contigs were split into their confirmed pieces, and reads
spanning the internal region were flagged as likely chimeras and
discarded from further consideration. Additional likely chimeric
reads were identified by virtue of being the unique read matching two
different contigs. In all, approximately 35,000 reads were identified
as probable chimeras and eliminated.

Following assembly, 27 contigs found to match E. coli or
lambda, sequence, and 144 contigs consisting primarily
(>= 80%) of reads
from 103 plates known to include contaminating mouse ESTs, were removed
as likely contaminants.

We further eliminated contigs whose confirmed length was less than 150
bases. There remained 62,064 contig sequences, which were used in the
comparisons below. Approximately 795,000 of the ESTs have alignments
to these contigs meeting the criteria above; the remaining 197,000
ESTs consist primarily of unconfirmed singletons (with no matches to
other reads), chimeras, contaminants, unconfirmed splice variants, or
other artifacts.

43,278 of the contigs were 3' contigs (80% or more of the reads
supported a consistent orientation to the contig, and at least 25% of
the reads, and a minimum of two, were derived from the putative 3' end
of the cDNA clones).

mRNA clustering
Human entries of locus type mRNA, RNA, and DNA, and containing
annotated coding sequence in a single contiguous base range were
extracted from Genbank Release 115.0 (Dec. 15, 1999). Entries with
the `pseudo' qualifier or with definitions containing the words
`mutant', `mutation', `variant', and `variation', or for which start
or stop codon placement was inconsistent with annotation, were
excluded. A total of 21,031 sequences meeting these criteria were
identified. These were compared to each other using cross_match
(see below), and then clustered first into "reference sequence
groups" and then genes.

To be assigned to the same reference sequence group sequences were
required to be 98% identical over at least 90% of their length.
There were 13,212 reference sequence groups.
To be assigned to the same gene, allowing for different alternatively
spliced forms, sequences were required to be at least 98% identical
over a 150 base or longer region that overlapped the coding sequence
of the gene and included at least 20% of the untranslated region of
the gene.
Manual inspection of a number of assignments suggest that these
criteria have non-zero, but small (< 5%) false positive and false
negative rates.

There were 9,465 gene clusters produced, many of which however
contained only partial sequence. For the sequence comparisons we only
considered genes which included at least one mRNA with an apparent
full-length coding sequence at least 100 bases long, and at least 20 bases
of 3' untranslated sequence. There were 7,662 genes meeting these
criteria.

Chromosome 22 sequence. The chromosome 22 sequence was taken
from the file Chr_22_analysis_version_22-10-1999.fa, downloaded
from the Sanger Centre.
The estimated number of genes on chromosome 22 was based on numbers
given in ref.
7,
as follows: 545 genes were identified through intensive annotation
efforts. (We do not include pseudogenes, which are irrelevant for the
present purpose). An additional 325 putative genes were predicted by
the program Genscan8,
of which 100 are estimated to be real and the
remainder false positives. The false negative rate is reported to be
< 5%. We therefore take 545 + 100 + .05 × 700 = 680 to be
the estimated number of genes on this chromosome. The authors suggest
that the number of CpG islands they detect may point to a higher gene
number, however a higher number would conflict with the reported false
positive and false negative rates of Genscan. We consider it
more likely that the excess CpG island count is attributable to the
fact that such islands are often associated with repetitive elements,
pseudogenes, and RNA genes
9
and/or may reflect the relatively high G+C content of this chromosome
as compared to the genome average.

Sequence comparisons.
Comparisons between sequences were performed using our program
cross_match, which implements a banded Smith-Waterman
algorithm10.
Cross_match finds perfectly matching
sequence segments of a specified minimum length (15 for the
comparisons to repbase, and 30 for the comparisons between the main
sequence sets) between pairs of sequences, and then searches a "band"
of width 15 centered on the diagonal of the Smith-Waterman matrix
defined by the word match to find the highest scoring gapped
alignment. We used relatively sensitive alignment scoring parameters
of +1 for a match, -2 for a substitution mismatch, -4 for gap opening,
and -3 for gap extension, but imposed stringent additional
requirements (given below) regarding % identity and the proportion of
the query sequence represented in the alignment to filter out
alignments not likely to be true matches.

All sequences were initially compared to the database
repbase
of mammalian repeats (Smit, A., Jurka J., Kapitonov, V., Niak, A.),
using relatively
sensitive parameter settings. Repeats were tagged, rather than
masked. Maximal sensitivity was not required, since the stringent
criteria applied below would eliminate matches between all but the
most evolutionarily recent repeats. In all subsequent comparisons of
sequences, matches lying essentially entirely within a tagged repeat
(allowing at most 20 bases past either end of the tagged region) were
discarded.

For comparisons of the contig sequences to chromosome 22 and to the
mRNAs, we discarded all matches having a greater than 2% discrepancy
rate (i.e. the aligned regions were required to be at least 98%
identical). For the chromosome 22 comparison, we additionally
required matches to include at least 90% of the contig length. For
the mRNA comparison this was relaxed to a minimum of 100 aligned bases
since some of the mRNAs are incomplete, and/or represent different
alternatively spliced forms for the same gene.

There are likely to be some false positives and false negatives in
these comparisons. False positive matches would occur with closely
paralogous genes, less than 2% different at the DNA level, created by
evolutionarily recent duplications (within the last 10 million
years). While there are certainly examples of these
7,
they are not
currently believed to involve a significant percentage of genes. A
likely more substantial source of error is false negatives (missed
matches) arising from incomplete or inaccurate sequences which cause
the alignment to fail to meet our stringency criteria. The effect of
these would be to inflate the gene estimate above its true value.

Technical comments on previous gene number estimates. Using a
method similar to ours in which EST clusters were compared to mRNAs,
Fields et al.11
arrived at an
estimate of 60,00070,000
genes. A major difference with our calculation was that they included
clusters consisting of a single unconfirmed EST. We believe this may
spuriously inflate the estimate by a significant amount due to the
presence of contaminant, artifactual, or inaccurate ESTs. The cDNA
library normalization process tends to enrich for contaminants and
aberrant clones, and even if they still represent only a tiny
percentage of the total EST set they can constitute a large fraction
of the clusters. For that reason we used conservative criteria for
deriving contig sequences. Although many of the unconfirmed ESTs are
undoubtedly real, eliminating them does not affect the validity of our
calculation, because the set of EST contigs is not assumed or required
to be a random sample of all genes and part of it can therefore be
removed without biassing the estimate.

We believe that even some of what we consider "confirmed" EST
contigs are cases in which a single contaminant or artifactual clone
spuriously confirms itself because it has been duplicated due to
library amplification or well-to-well or plate-to-plate
cross-contamination. Such cases may account in part for recent
unpublished estimates of a very high gene number based on counting EST
contigs assembled from large private databases
12. A low rate of
such artifacts still produces large absolute numbers when the data set
is very large.

Antequera and Bird13
estimated that there are 45,000
unmethylated CpG islands in the human genome, which, under the
assumptions that 56% of genes have an island, and that all islands
are associated with genes, extrapolates to an estimate of 80,000
genes. Several aspects of their calculation are open to question and
may have inflated this estimate: (i) In converting from tritiated
counts to genome percentage (Table 1 of ref.
13), it was assumed that
DNA fractions 1 and 2 have the same overall G+C content as
the CpG island fragments in these fractions. However these fractions
were known to contain a significant fraction (28.6%) of non-island
DNA, which may reasonably be assumed to have a G+C content equal to
the genome average (40%) rather than that of island
fragments (67.1%). Making this correction reduces the estimated number of
islands from 45,000 to 34,200. (ii) Our own analyses (unpublished) of
CpG islands suggest that the estimated length of an island is somewhat
sensitive to the precise definition that is applied. Using a
definition more sensitive to CpG dinucleotide frequency (which should
be the best indicator of methylation status) than to C+G content,
we obtain an average island length of about 1.3 kb rather than 1
kb. Use of this value would reduce the island count further, to
26,300, and the estimated number of genes to about 47,000. (iii)
There may be islands not associated with genes. For example, many
transposable elements and pseudogenes are known to have CpG islands
9, some of which
may be unmethylated; and as some
tissue-specific genes have one or more islands associated with cryptic
internal promoters that do not produce translatable transcripts
9, it seems possible
that other islands are not associated with any gene.(iv) A further
possibility is that the size of the gene-bearing, euchromatic portion
of the genome, which must be assumed in the CpG island calculation but
is irrelevant in ours, is substantially smaller than has been assumed,
as in fact was the case with chromosome 22
( ref. 7).