Frederick R. Blattner et al. (1), when describing the complete sequence of the Escherichia coli chromosome, correlated an overall DNA property, “GC skew” [the quantity (G − C)/(G + C) averaged over a sliding window of arbitrary length 10 kb] with the direction of DNA replication. GC skew for replichore 1 (rightwards from the origin on the presented strand) oscillates considerably, yet remains almost entirely positive for its entire length, while replichore 2 shows the opposite behavior. Kunstet al. (2) did not present such an analysis for the sequence of the Bacillus subtilis chromosome, but did note that the GC skew changes sign at the origin, an observation made earlier by Lobry (3), who documented it for the replication origins of E. coli, Haemophilus influenzae,B. subtilis, and Mycoplasma genitalium and for the terminus of H. influenzae.

In contrast to GC skew, which is a derivative function of the base composition along a DNA sequence, we have computed three integral functions of the sequences of nine complete prokaryotic genomes (Table1). Composite graphs for three of these genomes are presented (Fig. 1), and the remainder are available on a linked website. We define “purine excess” as the sum of all purines minus the sum of all pyrimidines encountered in a walk along the sequence up to the point plotted (4). “Keto excess” is the same function calculated for the keto bases (GT) minus the amino bases (AC), and “coding-strand excess” is the sum of all nucleotides encountered along the sequence that are in coding sequences, minus those that have complements (on the opposite strand) that are in coding sequences; bases in non-coding regions add zero to this sum. Graphs of these functions reveal nonrandom patterns, the most striking of which is the clear correlation between purine excess and the origins and termini of DNA replication (Fig. 1). In every case where independent information is available, the minimum in the purine-excess curve corresponds to the origin (Table 1). We suggest that this regularity may hold for most prokaryotic genomes. Conversely, the maxima of the purine-excess curves (Fig. 1) correlate strongly with known or suspected replication termini (5). Keto-excess curves reflect the same correlation, although for most genomes the minima and maxima (thus, predicted origins and termini) are not as sharply defined as for the purine-excess functions. Haemophilus influenzae represents a notable exception to this rule (compare the keto-excess curve in Fig.1B).

Table 1

Completely-sequenced bacterial genomes analyzed for base and coding asymmetries and their origins and termini of replication.

Purine excess (blue curves), keto excess (red curves), and coding-strand excess (green curves) for the complete genomes of (A) E. coli (1), (B) H. influenzae (16), and (C) M. jannaschii (17). Known origins and termini of replication are marked. Abscissa represents the genomic sequence position from the beginning to the end of the genome; left ordinate represents the count of purine and keto excesses; right ordinate represents the Watson coding-strand excess count at a given position. Green histograms across the bottom of each graph display the correlation coefficients between purine excess and coding-strand excess for-25 kb windows. Click on each image to enlarge. Graphs of six additional genomes (Table 1) can be viewed on the Web athttp://bmerc-www.bu.edu/genomeplot/

Other genome features stand out in these graphs. The relatively smooth, featureless curve for E. coli contrasts with the much rougher patterns displayed by H. influenzae andSynechocystis PCC6803 (see linked website for data). This likely reflects a greater tendency of the latter organisms to take up foreign DNA and integrate it into the chromosome (6, 7), a point supported by the correlation of the density of DNA-uptake sequences in H. influenzae (6) with many of the inflection points of the purine-excess curve (8). Likewise, the sites of μ prophage integration in H. influenzaecluster most densely around the pronounced minimum in the purine-excess curve adjacent to the terminus (Fig.1B). The larger megaplasmid (pNGR234a) ofRhizobium sp. NGR234 also displays similar behavior (8), in keeping with its recognized characteristics as a “transposon trap” (9).

Examination of the relationship between base-composition and coding asymmetries at the whole-genome level shows close parallels between coding-strand and purine excess for seven out of nine genomes. E. coli shows typical behavior (Fig. 1A). Haemophilus influenzae and Synechocystis display much weaker correlations on this scale. At a finer level of detail, there are substantial correlations between these functions for all the genomes we studied, but the results for the two archaebacteria, M. jannaschii and M. thermoautotrophicum, are particularly striking (Fig. 1C), showing strong correspondence between coding-strand and purine excess.

What forces might give rise to the long-range patterns of strand asymmetry in bacterial genomes? There is a prominent correlation between purine excess and replication direction, which suggests as an explanation asymmetrical errors in DNA synthesis. In the absence of transpositions and insertions, a bias favors accumulation of purines in the leading strand. However, this contradicts expectations thatlagging strand synthesis should be more error-prone (10), and thus that most purine substitutions (the principal cause of transversions) should occur there. Francino and Ochman (11) have argued, on the other hand, thattranscriptional effects can account for DNA strand asymmetry because transcription-coupled repair will remove the most frequent types of DNA damage (deaminated cytosines and pyrimidine-dimers), thereby reducing harmful mutations. This only occurs on the transcribed (that is, template) strand, which therefore will become pyrimidine-rich. In addition, the template strand is significantly protected against DNA damage during transcription, whereas the coding strand is exposed. Under this model, evolutionary selection should increase the less mutationally vulnerable purine content of the coding strand.

Mycoplasma genitalium conforms to the predictions of the transcription-coupled repair model particularly well: in replichore 1, 85% of the open reading frames (ORFs) correspond to the presented (purine-rich) strand up to the putative terminus (maximum in the purine-excess curve). For the other replichore, 77% of the ORFs occur in the complementary strand. In E. coli, strand preference is less pronounced: only 55% of the genes are aligned with the replication direction (1). However, Francino has analyzed the codon adaptation index (CAI), a measure strongly associated with the extent of gene expression in E. coli, and finds that 74% of the genes with CAI ≥ 0.5 and 84% of those with CAI ≥ 0.6 are situated on the leading strand (11), that is, with the direction of transcription the same as replication (12). In addition to favoring transcriptional repair, a major advantage to this arrangement is that head-on collisions between replication and transcription complexes will be reduced (13).

Functions like those described here promise to be revealing tools for whole-genome analysis (4). For example, in the absence of any other information, the global minimum of the purine excess locates the probable origin of replication, and its maximum is the likely terminus for prokaryotic genomes. Similar regularities may emerge from the impending deluge of eukaryotic DNA sequences. We have already shown that the patterns of purine-excess plots correlate well with phylogenetic position for mitochondrial DNAs (14), and graphs of coding-strand excess in the Saccharomyces cerevisiae genome tend to match the purine-excess curves (15).

Purine excess: χ(l) = Σ[δA,S + δG,S − δT,S − δC,S], where S is the base present at the current sequence position (l), the sum is performed over the range 1 to l, and δX,Y = 1 if X = Y; and 0 if X ≠ Y. Interchanging the A and T subscripts in this equation defines the keto excess. A DNA sequence can be uniquely described as a walk through a three-dimensional vector space, defined by two orthogonal axes for the two types of base pair and a third perpendicular axis, that repre- sents the sequence position (3) (scheme). An A in the sequence corresponds to movement in the positive xdirection and a T to the opposite. G and C are mapped by analogous steps along the y axis and sequence position increases along z.

For example, starting at the origin of such a coordinate system, if the first base encountered is G, then the vector trace generates the point (0, +1, +1), where the indices are the usual Cartesian coordinates. If the second base is A, the trace extends to (+1, +1, +2), and so forth. The trace corresponding to GAATTTC continues on through (+2, +1, +3), (+1, +1, +4), (0, +1, +5), and (−1, +1, +6) to (−1, 0, +7). Negative values of sequence position can also be used, which allows the origin to correspond to any convenient point in the sequence. As indicated by the scheme, the purine-excess and keto-excess functions that we have graphed for the nine prokaryotic genomes consist of steps along one or the other of two diagonal axes in this sequence space. Alternatively, the functions can be visualized as projections of the vector sequence trace onto one or the other of two vertical planes that cut the base-composition plane along the designated axes.

The precise locations of the three known termini (Table 1) actually fall slightly beyond the maximum of the purine excess curve and they coincide in every known case with the end of a segment that has a sharply negative slope in the coding-strand excess curve.

In the case of E. coli phage λ, the purine-excess plot has a minimum at the replication origin and a major dip just previous to it. From the origin, there is a rise that has a continuous run of ORFs coded on the presented strand (61% of total ORFs in the genome) that are thus transcribed along the phage's one-way replication direction. The dip region is coded exclusively on the complementary strand (31% of total ORFs). The other 8% alternate between strands at the start of the dip.

We thank B. Rogers for helpful discussions regarding visualizations, the Boston University Office of Information Technology and the Scientific Computing and Visualization Group for supercomputing resources, and an anonymous reviewer for helpful suggestions. S.C.M. is partially supported by a training grant from the U.S. National Human Genome Research Institute (T32 HG00041-03). Grant DE-FG02-98ER62558 from the U.S. Department of Energy supported this research.