Summary of the flexible method:
The basic observation is that the SD to Initiation Region (start codon
and region around it, IR) distance is variable. One can, therefore,
make a probability distribution, as shown
above. One can compute the Shannon uncertainty of any
distribution. This uncertainty remains after binding so it is to be
subtracted from the sum of the other components. Furthermore, the
ideas about
individual information
apply too and so one can build
flexible
sequence walker
models. These work very well. The
interesting thing is that one does not need to do any training to get
these models. One starts from proven binding sites and gets the model
directly. In contrast, training methods require that one provide
examples of sequences that do not contain the site. However this is
very difficult to obtain in general, so such training is probably
contaminated with weak but functional sites. The information theory
method avoids the problem.

We provide a table of data for
the rbseg12 model for the
U00096E. coli K-12 MG1655 sequence
that contains the following information:

30 nucleotides upstream of gene start
Location of the gene start (Start)
Location of the ShineDalgarno (SD)
Orientation of the gene (Orient)
Strength of the SD (Ri(SD))
Distance between the SD and the ATG (Gap)
Total strength of RBS including ATG (Ri(total))

The first codon for every gene is the last three bases in the
sequence. The SD coordinate corresponds to the central "G" =
in the
SD (refer back to the ribosome paper), and the spacing is the
difference between this base and the first base of the start codon
(usually an "A" in "ATG").

Computation of
Rsequence
and
Rfrequency
:
as of 2005 Aug 23.
The original computation of Rsequence and Rfrequency
for E. coli ribosome binding sites in
Schneider1986
gave
Rsequence = 11.0 bits
and
Rfrequency = 10.6 bits.
The computation is changed in two ways now.
First,
the original data set contained all known ribosome binding sites,
including those in bacteriophage.
However,
we now know that bacteriophage ribosome binding sites
have higher information content than chromosomal ones, so
Rsequence should be somewhat lower. Indeed, the estimates
are now:

as given in this paper.
The other change is that the entire genome has been sequenced
and Escherichia coli K12
(NC_000913)
is 4639675 bp.
It contains about
4242 genes,
so
Rfrequency = 10.10 bits.
This is remarkably close to the values for Rsequence!
(EcoGene 12 contained 4122 genes
giving Rfrequency = 10.14 bits -
which makes no difference to these results.)

CORRECTION:
Under Materials and Methods page 225, third paragraph the reference to
Blattner points to reference 37 instead of 33.
For some reason we wrote that reference as 'Blattner et al. (1997)' and it
got typeset incorrectly. Such errors never occur in
LaTeX, which we use all
the time; they occur frequently when people get involved, as apparently
happened in this case. Unfortunately we missed the alteration at the proof
stage. I, for one, am so used to the perfect referencing mechanism of
LaTeX
that I don't even think about checking such things anymore. But with humans
involved, nothing is safe.