Complete genomes provide a useful framework for organizing and analyzing partial
sequences from related genomes. A sample consisting of 2X or 3X genome equivalents
gives coverage of over 90% of the genome in which more than 99% of all ORFs
over 500 bases in length should be represented by a fragment of at least 100
bases. Information on the presence of shared ORFs and partial identities of
unique ORFs can be obtained at a fraction of the cost of complete sequencing.

To determine the utility of sample sequences, we have collected data from two
Enterobacteria, Salmonella paratyphi A (SPA), and a clinical isolate of Klebsiella
pneumoniae (KPN). These strains are of interest as human pathogens and for understanding
enterobacterial evolution. SPA is very closely related to the completed Salmonella
genomes, whereas, KPN is a sister clade of Salmonella and Escherichia. Over
10 million bases of raw sequence, representing between 2X and 3X genome equivalents,
were collected from both SPA and KPN, which melded to 4,384 kb and 5,084 kb,
respectively.

For Enterobacteriaceae, the E. coli K12 genome (ECO) is completely sequenced
[U.Wisconsin] and the genomes of Yersinia pestis (YPE), Salmonella typhi (STY)
[Sanger Center] and S. typhimurium LT2 (STM)[Wash. U., http://genome.wustl.edu/gsc/bacterial/salmonella.shtml]
are soon to be completed. The ECO sequence has been aligned to the available
sequence from each of STM, STY, SPA, KPN, YPE, and Vibrio cholera. These alignments
can be viewed as a "percent identity plot" or PIP, in which percent identities
of ungapped matches are shown in the Y-axis for each pairwise comparison. Deletions
in the sampled genomes and the sites of rearrangements and of significant insertions
are visualized in color. The alignments can be queried with any named ECO gene
and the corresponding region is visualized in multiple genomes, simultaneously.
Matching sequences in each aligned genome, associated with the reference gene
and flanking regions, are automatically made available.

Unique portions of the complete and sampled genomes were identified with the
FASTX and TFASTX programs. To search for unique regions and potential rearrangements
in the sampled genomes, each sampled sequence is compared to the E. coli proteome
using FASTX and the complete E. coli proteome is compared against partially
sequenced genome databases using the TFASTX program. We present lists of (a)
all ORFs found in ECO for which orthologues are apparently absent in the STM,
SPA, or KPN samples,(b) sequences over 400 bp in length that are found in one
or more of STM, SPA or KPN, but are absent in ECO. The best homologues of these
"unique" regions are determined from other sequence databases, including incomplete
genomes deposited at NCBI.