Since mRNA-synonymous strands follow Szybalski's transcription
direction rule in being G-rich, and the average ORF is about 1 kb, then,
on this basis alone, Chi sequences are seen to reside in 1 kb G-rich
"islands". However, RecA preferentially binds GT-rich
sequences, suggesting that genomic context might potentiate Chi action. Consist with this,
we report for E. coli that 1 kb sequence windows with Chi near their centres are a
distinct subset of total 1 kb windows, the mRNA-synonymous strands being preferentially
enriched in both G and T.

Chi function might be particularly important for bacteria which
survive high temperature and radiation. These often exist in habitats where recombination
with E. coli DNA would be unlikely, so canonical Chi sequences might not confer a
selective disadvantage in this respect. In general, Chi sequences are not more frequent in
thermophilic bacteria and D. radiodurans, than in E. coli and other
mesophilic bacteria. Only two of five thermophilic bacteria examined showed preferential
location of Chi sequences to mRNA-synonymous strands. In the thermophile M. jannaschii,
windows containing the canonical Chi sequence do not form a distinct subset.

We suggest that in thermophilic bacteria and D. radiodurans
the Chi function may be achieved by sequences which differ from the canonical Chi
sequence, or the number of these sequences is sufficient, or the Chi function is
unnecessary.

Cross-over hot-spot instigator
(Chi) sequences (5-GCTGGTGG-3) are abundant, strand-specific,
sequences, which locally increase recombination in Escherichia coli. Noting
regularities in surrounding sequences, Tracy et al. (1997b) suggested that canonical
8-base Chi sequences in E. coli usually exist within approximately one kb GT-rich
"recombination islands". A correlation of Chi
orientation with transcription direction (Burland et al., 1993) suggested that the islands
might also be "transcription islands", implying a
relationship between transcription and recombination (Bell et al., 1998; Bell and
Forsdyke, 1999a,b). Since in E. coli open reading frames (ORFs) form the main
component of transcripts, the islands could also be considered "translational
islands" (Biaudet et al., 1998; Colbert et al., 1998).

Mutations creating or inactivating Chi activity occur only in Chi, not in surrounding DNA
(Smith, 1997), and insertion of the Chi octomer into DNA enhances recombination in all
cases reported (Dixon and Kowalczykowski, 1991; Dabert et al. 1992; Kuzminov et al.,
1994). These observations imply that the octomer alone is sufficient for activity.
However, recombination hotspots also occur in eukaryotes, where they may be influenced by
chromosomal context (Ponticelli and Smith, 1992). Noting that RecA, a key enzyme
influenced by Chi, binds preferentially to GT-rich DNA, it was suggested
that the genomic context of Chi might potentiate its effectiveness (Tracy et al.,
1997a,b). If this were so, then it is possible that one kb
sequence "windows" with Chi sequences at their centres might form a distinct
subset of the corresponding total set of one kb windows.

This characteristic, together with a preference for
mRNA-synonymous regions of DNA, could also indicate whether Chi or Chi-like sequences in
other genomes are likely to be functional. Absence of the appropriate Chi function "password" targets an exogenous DNA for exonucleolytic
degradation by the host cell RecBCD homolog (Kuzminov et al., 1994), thus militating
against interspecies recombination, while permitting intraspecies recombination.
Disruption of such interspecies barriers facilitates recombination (Rayssiguier et al.,
1989). Just as a general (C+G)% similarity between the DNAs of members of
a species may be necessary for the initiation of homologous recombination (Forsdyke,
1996), so exogenous DNA must have sufficient Chi sequence similarity with the DNA of host
bacteria. However, the selection pressure promoting differentiation of sequences mediating
the Chi function would be unlikely to be sustained between species occupying different
ecological niches. This might explain why E. coli and H. influenzae have
similar Chi sequences (Sourice et al., 1998).

To examine these issues, we here present a study of

(1) the frequency of Chi sequences in bacterial genomes,

(2) their orientation with respect to transcription, and

(3) their correspondence with a distinct subset of sequence windows of characteristic
base composition.

Since Chi-dependent recombination could be involved in the repair of DNA strand breaks,
the study included bacteria which both occupy ecological niches where recombination with E.
coli
would be unlikely (hence no selection pressure against the canonical Chi sequence), and
survive under conditions where DNA strand breakage is increased (high temperature and
radiation).

Sequence information refers to the "top"
strand as designated in the GenBank record. Unedited segments of the Deinococcus
radiodurans genome were obtained in 1997 from the Institute for Genome Research,
Rockville, MD. Five of the longest segments, which had been designated gdr_5 (77466 nt),
gdr_19 (68159 nt), grd_23 (50515 nt), gdr_31 (60954 nt), and gdr_34 (61120 nt), were
presumed to originate from disparate parts of the genome, and hence were considered likely
to be representative of the genome. Putative ORFs were determined using the Glimmer
program.

Chi sequences were located using the Findpatterns program of the
Genetics Computing Group, Madison, WI. Since it is established experimentally that H.
influenzae uses both the canonical and degenerate "Chi-like" sequences
(Sourice et al., 1998), we sought such sequences (GNTGGTGG) and their top
strand complement (CCACCANC) in this organism. In all other cases, we
sought the canonical Chi sequence in the top strand, with Chis in the bottom strand being
detected as the complement (CCACCAGC) in the top strand. Chi sequences
were classified according to their location in leftward or rightward-transcribed ORFs. For
E. coli and H. influenzae locations in non-protein-encoding genes (e.g.
ribosomal RNA) were also sought in the examined segments, but were found only in H.
influenzae.

For each organism, the frequency of Chi sequences (Chi sequences/kb)
expected if base order were random was calculated from the base composition of the
combined genomic segments under study; this was the product of each bases
statistical chance of occurrence in each Chi octomer, multiplied by the number of possible
independent octomers in 1 kb (993). Thus, if P(A)=P(T)=0.2
and P(C)=P(G)=0.3, then the expected
frequency for GCTGGTGG and CCACCAGC in a sequence is
[[(0.3)6(0.2)2]+[(0.3)6(0.2)2]] x 993 = 0.058.

Chargaff's first parity rule for duplex DNA (%A
= %T; %C = %G) applies, to a close
approximation, to single stranded DNA (Chargaff's second parity rule). Deviations from
parity are referred to as "Chargaff differences",
which often depend on the "purine loading" of mRNA
synonymous strands (providing an explanation for Szybalski's transcription direction rule;
Bell and Forsdyke, 1999b). The base-composition of successive 1 kb windows, moved in steps
of 0.1 kb, was assessed as described by Dang et al. (1998). Chargaff differences were
calculated as (A-T)/W and (C-G)/S
and expressed as percentages. Here, A, T, C
and G refer to the frequency of the corresponding base in a window. The
direction of subtraction (A-T or T-A) is determined
alphabetically. W is the sum of the W base frequencies (A+T)
and S is the sum of the S base frequencies (C+G).
A window of 1 kb was chosen as the size which optimizes the difference between the
Chargaff differences of a natural and the corresponding shuffled sequence (Bell and
Forsdyke, 1999a). A 1 kb window selected by virtue of containing a Chi octomer near its
centre would receive a contribution to the Chargaff difference from the octomer by virtue
of the excess of Gs over Cs (4) and of Ts
over As (1). Thus, for an organism of 50% (C+G), the G
excess would contribute [4/500] x 100 = 0.8% to the Chargaff difference percentage value.
Differences of this order would only marginally have affected results due to compositional
differences in the entire 1 kb "island" surrounding
a Chi sequence.

Due to base composition alone, the frequency of
the G-rich Chi sequence would be expected to increase as genome (C+G)%
increases. However, whatever the (C+G)%, in most bacteria examined the
observed frequency of the canonical Chi sequence is greater than expected (Fig. 1;
exceptions A. aeolicus, B. subtilis, T. pallidum). The points corresponding
to observed and expected Chi frequencies fit simple exponential curves, with the prominent
exception of E. coli. However, the observed/expected frequency ratio for E. coli
(6.51) is of the same order as that of most other bacteria, and is slightly less than that
of the thermophile M. jannaschii (8.20). For this organism with four other
thermophiles the mean ratio is 3.20± 1.28; the mean ratio of
the eleven mesophilic bacteria studied is 3.32± 0.52.

Fig. 1.

Higher than
expected frequency of canonical Chi sequences in many bacterial genomes. Points for
observed frequency (red circles) and for the frequency expected if base order were random
(green circles) were reiteratively fitted to corresponding least-squares curves to arrive
at the formulae:

Organisms corresponding to each point are identified, with ratios of observed to
expected frequencies in parentheses. Segments examined are as in Table 1. The standard
error of estimation for the curve fitted to the observed frequencies was 4.17, which gives
a standardized residual for the E. coli outlier of >3.0 (indicating that its
outlier status is statistically highly significant). For H. influenzae only data
for canonical Chi sequences occurrence are shown.

In the case of H. influenzae, Chi and Chi-like
sequences are likely to be functional (Sourice et al., 1998). This was supported by the
demonstration that, as in E. coli, the Chi and Chi-like sequences of H.
influenzae tend to locate to ORFs, where they are found in mRNA-synonymous strands
rather than in template-strands (Bell et al., 1998). Accordingly, we examined whether this
preference for mRNA-synonymous strands also applied to the Chi sequences of other bacteria
(Table 1).

Table 1.
Distribution of top and bottom strand Chi sequences among leftward-
and rightward-transcribed ORFs

Chi sequences were located either in the top-strand, or the bottom
strand, the former being so designated by the GenBank record.

c

Chi sequences were scored as associated with either
leftward-transcribed ORFS, or rightward-transcribed ORFs if their centres overlapped the
ORFs. Otherwise they were scored as being in non-ORF DNA. In the case of H. influenzae some Chi-like sequences were also included.

d

The ratio of occurrence in the mRNA-synonymous strand of DNA
relative to occurrence in the mRNA-template strand. The significance of this ratio for
each organism was obtained through c2 analyses (2´ 2 contingency table) with the probability of each Chi being
randomly located in either the mRNA-synonymous or mRNA template strand being 0.5.

e

To the values for 735 kb recorded here, may be added the values for
the E. colisegment corresponding to nucleotides
1-200000 from Bell et al. (1998). This provides a synonymous strand/template strand
distribution ratio of 3.34 (157:47).

In all but one instance (A. aeolicus), Chi sequences in
mRNA-synonymous strands (the sum of occurrences in bottom strands when transcription is
leftward, plus occurrences in top strands when transcription is rightward) occur more
frequently than in mRNA template strands (the sum of occurrences in top strands when
transcription is leftward, plus occurrences in bottom strands when transcription is
rightward). Thus for M. jannaschii there are 14 + 18 = 32 Chi sequences in the
mRNA-synonymous strand, and 7 + 5 = 12 Chi sequences in the mRNA-template strand. The
ratio of these (32:12) is 2.67. The significance of this asymmetrical distribution (c2 analyses with a 2´ 2
contingency table) is high (P = 0.003).

However, only the two thermophiles of lowest (C+G)%
show significant bias (M. jannaschii and P. horikoshii). On the other hand,
whatever the (C+G)%, most mesophiles show significant bias (exceptions C.
trachomatis and M. genitalium). There are relatively few canonical Chi
sequences in B. subtilis (an unrelated sequence has Chi-like activity; Chedin et
al., 1998), but the asymmetrical distribution of those that occur is of marginal
significance (P = 0.059).

Most organisms obey Szybalski's transcription direction rule (Bell
and Forsdyke, 1999b). This is manifest as an asymmetry of base compositions between the
mRNA-synonymous and template strands of DNA. Thus, the asymmetry in distribution of Chi
sequences between the two strands might merely reflect this base-compositional asymmetry.
However, Tracy et al. (1997b) proposed that Chi sequences occupy distinct GT-rich
islands of the size of an average ORF (about 1 kb).

This was examined by seeking differences in base composition,
assessed as "Chargaff differences". In "quadrant plots" Chargaff differences for the W
bases (A and T) are plotted against Chargaff differences
for the S bases (C and G). Following
Szybalski's transcription direction rule, one kb sequence windows in the "top" strand whose centres overlap leftward-transcribed ORFs
are usually seen to be preferentially enriched in one or more pyrimidines (C,
T), whereas one kb sequence windows in the "top"
strand whose centres overlap rightward-transcribed ORFs are usually seen to be
preferentially enriched in one or more purines (A, G).

Fig. 2 shows quadrant plots for three 245 kb segments from different
parts of the circular E. coli genome. This genome has base compositional
asymmetries which relate to the origin of replication. The first segment (nt
310941-555940) is from the first part of the genome (located to the right of the origin of
replication) where the top-strand tends to be both G-rich and Chi-rich,
and rightward-transcribed ORFs predominate. The second segment (nt 1555401-1800400) begins
in the region of the terminus of replication, where top-strand G-richness
switches to C-richness, leftward-transcribed ORFs become more evident,
and Chi sequences begin to predominate in the bottom strand. The third segment (nt
3422021-3667020) is at the other end of this C-rich region, and lies to
the left of the origin of replication.

Fig. 2.

Quadrant analysis of base compositions
(expressed as Chargaff differences) in 1 kb windows from the top strand of three segments
from the E. coli genome, either combined (a, b), and independent (c-h).

The centres of windows overlap either leftward-transcribed ORFs (a, c, e, g), or
rightward-transcribed ORFs (b, d, f, h). Each quadrant corresponds to windows enriched for
two particular bases (indicated at the corners). Each point corresponds to a window.

Listed in each figure are the slopes (Sl) of the least
square regression lines (shown as a solid diagonal line), the adjusted squares of the
correlation coefficients (r2), and the probabilities (P) that slopes are
not significantly different from zero. Only (c) and (f) had significant slopes (P =
0.023 and 0.002, respectively), when the regression analysis used every tenth window to
avoid overlapping windows.

For all three segments combined, a clear preference of leftward ORFs for C
and of rightward ORFs for G is apparent (Figs. 2a,b). Although there is a
wide scatter of points, linear regression plots have a significant slope, indicating
correlations between A-enrichment and C-enrichment, and
between G-enrichment and T-enrichment. These
correlations are evident in the first and second segments (Figs. 2c-f) but not in the
third segment (Figs. 2g,h). Likewise, the G-richness of windows
corresponding to rightward-transcribed ORFs is most evident in the first segment (lying to
the right of the origin of replication), whereas the C-richness of
windows corresponding to leftward-transcribed ORFs is most evident in the third segment
(lying to the left of the origin of replication).

Among these 1 kb windows are some with Chi
sequences near their centres ("Chi windows"). Fig. 3 shows that the majority
of Chi windows in the top-strand (52) correspond with rightward- transcribed ORFs and tend
to be GT-rich, whereas the majority of Chi windows in the bottom strand
(61) correspond mainly with leftward-transcribed ORFs and tend to be AC-rich.
Theminorities of Chi windows in the
top strand corresponding to leftward-transcribed ORFs (22), and of Chi windows in the
bottom strand corresponding to rightward-transcribed ORFs (17), show no obvious
compositional bias. The tendency for GT-richness when corresponding with
rightward-transcribed ORFs is particularly apparent in the first segment (Fig. 4b),
whereas the tendency for AC-richness when corresponding with
leftward-transcribed ORFs is particularly apparent in the third segment (Fig. 4e).

That Chi windows corresponding to a particular strand and transcriptional orientation
are a distinct subset of the total set of windows in the same strand and with the same
transcriptional orientation, is shown in Table 2. For example, for the three combined E.
coli segments the 52 top strand Chi windows corresponding to rightward transcription
are enriched in T (T>A; Chargaff difference value =
-5.18± 1.64, which is significantly different from zero; P
<0.01). The corresponding total set of 3186 windows shows much less T
enrichment (Chargaff difference = -0.74± 0.19, which is also
significantly different from zero; P < 0.001). The two sets are significantly
different from each other (P = 0.0028; t-test). Similarly, the 61 bottom strand Chi
windows corresponding to leftward transcription are enriched in A (A>T;
Chargaff difference = 3.80± 1.25, which is significantly
different from zero; P < 0.01). The corresponding total set of 3319 windows do
not have a significant A enrichment (Chargaff difference = 0.14± 0.15). Again, the two sets are significantly different from each
other (P = 0.001).

Table 2. Comparison of Chargaff differencesa
of Chi-containing windows with those of all windows in three E. coli segments, combined
and separate

Segmentnumber

Strandlocationof Chi

Chargaff differences (%)

.

Chi-containing windows

.

All windows

Transcription orientation

Transcriptional orientation

To Left

Nb

Pc

.

To Right

Nb

Pc

To Left

Nb

.

To Right

Nb

1+2+3combined

Top

(A-T)/W

T>A

-3.77±1.51

22

0.040

T>A

-5.18±1.64

52

0.0028

A>T

0.14±0.15

3319

T>A

-0.74±0.19

3186

(C-G)/S

C>G

1.72±1.32

22

0.001

G>C

-6.64±0.82

52

0.0002

C>G

5.64±0.10

3319

G>C

-3.49±0.11

3186

.

.

.

.

.

Bottom

(A-T)/W

A>T

3.80±1.25

61

0.001

T>A

-3.61±3.03

17

0.266

(C-G)/S

C>G

7.29±0.66

61

0.020

C>G

0.02±1.36

17

0.017

.

.

.

1

Top

(A-T)/W

T>A

-4.94±3.21

9

0.072

T>A

-6.9±1.83

35

0.0004

A>T

1.02±0.34

849

T>A

-0.56±0.29

1343

(C-G)/S

C>G

1.71±1.47

9

0.606

G>C

-8.53±0.70

35

0.083

C>G

2.60±0.18

849

G>C

-6.97±0.14

1343

.

.

.

.

.

Bottom

(A-T)/W

T>A

-1.15±2.53

8

0.536

T>A

-2.57±10.1

4

0.704

(C-G)/S

C>G

4.64±2.44

8

0.268

G>C

-5.60±1.48

4

0.605

.

.

.

2

Top

(A-T)/W

T>A

-3.87±1.52

10

0.155

T>A

-7.30±3.36

7

0.238

T>A

-0.47±0.21

1234

T>A

-2.26±0.37

909

(C-G)/S

C>G

1.58±2.47

10

0.010

G>C

-1.77±0.89

7

0.802

C>G

6.27±0.16

1234

G>C

-1.27±0.18

909

.

.

.

.

.

Bottomd

(A-T)/W

A>T

2.57±1.42

18

0.089

T>A

-7.22±4.49

7

0.246

(C-G)/S

C>G

6.92±1.36

18

0.629

C>G

3.56±1.13

7

0.017

.

.

.

3

Top

(A-T)/W

A>T

0.03±2.78

3

0.983

A>T

2.34±4.67

10

0.559

A>T

0.15±0.27

1236

A>T

0.50±0.32

934

(C-G)/S

C>G

2.24±3.97

3

0.074

G>C

-3.39±2.90

10

0.101

C>G

7.10±0.13

1236

G>C

-0.63±0.17

934

.

.

.

.

.

Bottom

(A-T)/W

A>T

5.57±1.92

35

0.001

T>A

-0.08±2.91

6

0.886

(C-G)/S

C>G

8.08±0.71

35

0.222

G>C

-0.35±2.62

6

0.896

Legend to Table 2

a Chargaff differences (%) are presented together with the standard
error of the mean.
All data refer to the base composition of the top strand.

b

N = number of windows.

c

Probabilities (P) that mean base compositions
(assessed as Chargaff differences) of 1 kb windows with Chi sequence at their centres, do
not differ significantly from those of the corresponding total set of windows (t-test).
For further details please see text.

d

One Chi sequence omitted from the analysis was in an
intergenic region between two leftward-transcribed ORFs.

These trends are also evident in the individual
segments. For the first segment, the T-excess corresponding to top-strand
rightward-transcribed ORFs is most significant (P = 0.0004). For the second
segment, the C-excess corresponding to top-strand leftward-transcribed
ORFs is most significant (P = 0.010). For the third segment, the A
excess corresponding to bottom-strand leftward-transcribed ORFs is most significant (P
= 0.001).

Thus Chi sequences in E. coli were observed to occupy a
distinct subset of 1 kb windows, usually with GT-rich top-strands in the
case of top-strand Chis, and with AC-rich top-strands in the case of
bottom-strand Chis. This criterion was used to further characterize Chi or Chi-like
sequences in other organisms. Whereas E. coli obeys Szybalski's transcription
direction rule mainly with respect to the S bases, thermophilic bacteria
obey with respect to both the S bases (C and G)
and the W bases (Lao and Forsdyke, 2000). In the case of the
thermophile M. jannaschii, AC-rich windows are relatively rare in
leftward-transcribed regions (see top right quadrant of Fig. 5a), and GT-rich
windows are relatively rare in rightward-transcribed regions (see bottom left quadrant of
Fig. 5b). These windows are not selectively occupied by Chi sequences (Figs. 5c, d). The
most significant difference (P = 0.13) between Chi windows (corresponding to a
particular strand and transcription direction) and the total set of windows (corresponding
to the same strand and transcription direction), was a slight decrease in the frequency of
A bases in the top strand for rightward transcribed ORFs (Chargaff
differences 9.92± 0.09 for all 8202 rightward windows and 7.01± 2.41 for the corresponding 18 Chi windows). Thus, in this organism
there is no clear evidence that Chi windows form a distinct subset.

Fig. 5. Quadrant analysis of the entire M. jannaschii
genome.

(a) All 7494 windows overlapping leftward-transcribed ORFs.

(b) All 8202 windows overlapping rightward-transcribed ORFs.

(c) The 21 windows corresponding to bottom-strand Chi sequences,
with those overlapping leftward-transcribed ORFs (14) shown as large open
circles, and those overlapping rightward-transcribed ORFs shown as small open circles.

(d) The 27 windows corresponding to top-strand Chi sequences,
with those overlapping rightward-transcribed ORFs (18) shown as large
open circles, and those overlapping leftward-transcribed ORFs shown as small open circles.

Regression lines are for points corresponding to leftward-transcribed ORFs (a,
c), and to rightward-transcribed ORFs (b, d). Y0 is the value
of the intercept on the Y axis. Other details are as in previous figures.
In (a) and (b) slopes remained significantly different from zero when every tenth point
was employed for the regression analysis to avoid overlapping windows.

In initial studies of short E. coli segments, Burland
et al. (1993) noted that the distribution of Chi sequences correlated independently with
the directions both of replication and transcription. However, in H. influenzae
Sourice et al. (1998) found only weak evidence for a correlation with replication
direction. From studies with single short segments of the genomes of E. coli (nt
1-200000) and H. influenzae (nt 1-350000), the distribution of Chi or Chi-like
sequences appeared to correlate better with the direction of transcription than with
direction of replication (Bell et al., 1998). Table 3 shows an extension of the latter
work to three segments from each organism, the first and third of which are located on
either side of the origin of replication, and the second of which is located in the
regions either of the origin of replication (H. influenzae) or of the terminus of
replication (E. coli). In view of uncertainty regarding locations of origins and
terminations of replication, data for Chi distribution with respect to replication
direction are omitted in the case of the second segments. For the first and second
segments it is apparent for both organisms that correlation with direction of
transcription is at least as good as the correlation with direction of replication.
Indeed, for the third E. coli segment which lies to the left of the origin of
replication, the ratio of occurrences in mRNA synonymous strands relative to template
strands (5.0) exceed the ratio of occurrences in leading strands of replicating DNA
relative to lagging strands (3.15).

Uncertainties in precise locations of the origins or
terminations of replication, make it difficult to assign Chi sequences in the second
segments as corresponding to the leading or lagging strands of replicating DNA. However,
assuming the origin of replication to be at nt 603000 in the circular H. influenzae genome, segment 2 includes the origin so that
in its left part the bottom strand is identical in sequence to the leading strand, and in
the right part the top strand is identical in sequence to the leading strand. On this
basis, the leading/lagging strand ratio of the distribution of Chi sequences in segment 2
is 1.43 (33:23). In E. coli there are seven
experimentally characterized "Ter" (termination)
sites spread over 25% of the genome. The "Dif" site
(1588774-1588801) is where recA-independent recombination may occur to resolve the two
daughter molecules after replication.

It should be noted that Chi distribution among leftward or
rightward ORFs is not merely explained by the relative availabilities of these two classes
of ORFs (Table 3). For example, in the first segment of H. influenzae there are 117
leftward ORFs and 103 rightward ORFs, yet Chi sequences in the top strand predominate in
rightward ORFs (2:9; P = 0.02), and the predominance in leftward ORFs of Chi
sequences in the bottom strand is about double that expected from ORF distribution (19:9
as opposed to 117:103; P = 0.12).

A new view, arising from studies of homologous recombination in
eukaryotes, is that recombination follows an initial homology search, which may involve
exploratory "kissing" interactions (Eguchi et al.,
1991) between the tips of single-strand DNA loops extruded from intact DNA duplexes
(Kleckner and Weiner, 1993; Forsdyke 1996). Only when such homology has been "tested" through formation of paranemic joints would there be a
commitment to recombination (strand-breakage). The initial homology search might involve
special regions of DNA, recognized as "hot-spots"
or "islands of preferred DNA pairing" (Tracy et
al., 1997b). Consistent with this, single-strand loops are gaining recognition as
important intermediates in the initiation phase of recombination in bacteria (Anderson and
Kowalczykowski, 1998). Bacterial RecA can connect two intact DNA duplexes, which should
facilitate homology recognition (Leger et al., 1998; Zaitsev and Kowalczykowski, 1999),
and a human RecA homolog has been shown to form homologous joints in the absence of net
strand exchange (Gupta et al., 1999).

In homologous recombination, an "island
of preferred pairing" might suffice to guide certain enzymes of recombination
to a region where subsequent strand
breakage would occur. However, enzymes such as RecA are also involved in the repair of
double-strand breaks. In this case, the initial break might occur at some distance from an
"island", so that enzymes recognizing the break
(e.g. the RecBCD complex) would have to translocate along the DNA towards an "island", where RecA and other enzymes would then be locally
recruited.

The "recombination island"
proposal emerged at a time when Chargaff difference analysis was revealing genomes as
consisting of distinct gene-sized compositional domains characterized by base compositions
often in accord with Szybalski's transcription direction rule. However, in one organism (Drosophila
melanogaster) which possesses a genome sectored into regions where genes are compactly
arranged prokaryote-style and regions where genes are interrupted by introns
mammalian-style, the compositional domains were found in both
sectors (Dang et al., 1998). Indeed, introns and intergenic DNA of mammals show
compositional domains (patterns of Chargaff differences) similar to those in
protein-coding regions. This suggested that these domains might be related to some
function other than the genic function, perhaps recombination (Bell and Forsdyke,
1999a,b).

Both the bacterial RecA protein and its eukaryotic homologs prefer GT-rich
sequences (Tracy et al., 1997a). Furthermore, GT-rich loci are
recombinogenic in various organisms (e.g. Jeffreys et al., 1985). Thus, Tracy et al.
(1997b) proposed that Chi sequences initially evolved in GT-rich domains
which already possessed a low level of intrinsic hot-spot activity. In support of
this, they found sequence regularities extending 400 bases on either side of Chi
sequences, suggesting a fundamental GT-rich "recombination
island" size of about 1 kb, which is about the size both of the average ORF,
and of sequence windows at which deviations from Chargaff's second parity rule are maximum
when comparing natural with the corresponding shuffled sequences (Bell and Forsdyke,
1999a).

Accordingly, it was proposed that early genomes evolved as a
mosaic of regions of distinct base composition to facilitate, through loop-loop kissing
interactions the interstrand and intrastand "accounting"
necessary for recombination repair (Bell and Forsdyke, 1999b). From studies of "Chargaff differences" it was suggested that Chi was critically
orientated with respect to transcription direction (Bell et al., 1998), and that the
"recombination islands" of Tracy et al. (1997b) had
become "recombination/transcription" islands in
regions where transcriptional activity had evolved.

The adaptive value of recombination was likely to have become
evident in the early "RNA world" prior to the
development of protein-synthesis (Bell and Forsdyke, 1999b). Thus, early genomes may have
contained recombinational domains of varying base compositions, and the transcriptional
function may have later imposedupon
this recombinational mosaic, prior to
the evolution of protein-encoding potential. While we do not know at what stage Chi
functions evolved, it seems unlikely that the small Chi octomer was itself sufficient for
early recombinational events. More likely, Chi-encoding potential and protein-encoding
potential had to adapt independently to the preexisting
compositional bias of recombination domains (as well as to the species-specific (C+G)%
bias; Forsdyke, 1996).

In this light we should reexamine the work of two groups who
recently criticised the "recombinational island"
suggestion of Tracy et al. (1997b), on the grounds that the frequency and distribution of
Chi can be "accounted for" (Colbert et al., 1998), or "explained" (Biaudet et al., 1998) by a preexisting codon usage
bias. The "islands of preferred DNA pairing" are
held to be merely "oceans of ORFs" (Colbert et al.,
1998). However, on the basis of an initial study of a 136 kb segment of the E. coli
genome, Burland et al. (1993) argued against an influence of codon usage. We would further
argue that codon usage no more explains Chi frequency and distribution than that the
disappearance of storks from the rooftops in Holland in the war years explained the
concomitant decrease in the human birth rate. Just as these events were both likely to be
due to the disturbance of warfare, so a more fundamental cause, the compositional bias of
preexisting recombinational domains is likely to have created a genomic environment to
which both Chi-encoding potential and
protein-encoding potential had to adapt.

Colbert et al. (1998) further argue that "Except
for the sequence of Chi and its orientation-dependence, no deterministic features of Chi
sites are apparent". However, in addition to strand orientation (top or bottom
strand), Chi sequences selectively appear in leftward-transcribed ORFs in the bottom
strand and in rightward-transcribed ORFs in the top strand (Burland et al., 1993; Bell et
al., 1998). This provides a novel deterministic feature of Chi sites, which is the basis
of the present analysis. While we agree that in bacteria "recombination
islands are equivalent to ORFs" (Colbert et al., 1998), Chargaff difference
analysis using 1 kb window sizes provides evidence, at least for E. coli, that Chi
windows may correspond to a distinct subset of ORFs (Table 2). To this extent, the "island" proposal is supported. We also note that four Chi-like
sequences (GTTGGTGG) in the H. influenzae segments which we
examined, occur in 16S RNA genes, where there is no question of an involvement of codons.
Three of these four Chi-like sequences occur in the top strand where, as expected,
transcription is to the right, and the other occurs in the bottom strand where, as
expected, transcription is to the left.

In this initial study we have limited ourselves to the
canonical Chi sequence, except in the case of H. influenzae Chi-like sequences,
which are likely to be functional (Sourice et al., 1998). Sequences with Chi-like
function, but with little similarity to the canonical Chi sequence, have been identified
in some bacteria (Smith, 1997; Chedin et al., 1998). In these cases, perhaps because their
ecological niches overlap with that of E. coli, there might be a chance of
recombination with E. coli DNA (not-self), so there would have been a selection
pressure militating against a similarity of sequences with Chi function (so assisting the
preservation of species integrity; Forsdyke 1996). Thus, it is pertinent to ask whether
canonical Chi sequences are more frequent in bacteria other than E. coli,
particularly if they are likely to occupy different niches. If so, one can examine whether
these Chi sequences have deterministic features of Chi other than sequence similarity.
Indeed, in many such cases, Chi sequences do locate preferentially to mRNA synonymous
strands, as in E. coli (Table 1).

The fact that Chi abundance in E. coli appears as a
distinct outlier when compared with that of other bacteria (Fig. 1), suggests that the
canonical Chi is particular to E. coli. However, when expressed as the
observed/expected ratio, Chi frequency in E. coli is of the same order as in most
other organisms, and less than that of the AT-rich thermophile M.
jannaschii. Some bacteria (A. aeolicus, B. subtilis, T. pallidum) had no more
canonical Chi sequences than expected from base composition, and in at least one of these
(B. subtilis) the Chi function could be mediated by an independent sequence (Chedin
et al. 1998).

Since recombination repair may be important for correction of
DNA damage, and Chi promotes recombination, it was considered possible that the Chi
function might be generally increased in thermophilic and radiation-resistant bacteria.
These generally occupy ecological niches different from E .coli, so that
recombination with the E. coli would be unlikely. Thus, they might have even more
canonical Chi sequences than E. coli. However, when base composition was taken into
account, D. radiodurans did not have more canonical Chi sequences than E .coli,
although many preferentially located to the mRNA synonymous stands (Fig. 1; Table 1).
Furthermore, only two of the five thermophilic bacteria examined showed preferential
location of Chi sequences to mRNA-synonymous strands. Thus, in thermophilic bacteria and D.
radiodurans, the Chi function may be achieved by sequences which differ from the
canonical Chi sequence, or the number of these sequences is sufficient, or the Chi
function is unnecessary.

Acknowledgements

We thank J. Gerlach for assistance with computer
configuration, and R. Y. P. Chen, G. A. Hill, E. Y. Joo, and R. G. Rasile for assistance
in analysis of unedited sequences of D. radiodurans obtained from the Institute for
Genome Research. The Glimmer program was kindly donated by S. Salzberg. The National
Research Council of Canada, Academic Press and Elsevier Science gave permission for the
inclusion of full-text versions of relevant preceding papers at our internet site
(http://post.queensu.ca/~forsdyke/bioinfor.htm ).