Est_db

The est_db package is a software suite and database system designed to support expressed sequence tag (EST) sequencing
projects, and to provide comprehensive bioinformatic analysis of sequenced EST libraries, for gene discovery and other
purposes.

The database can hold and efficiently process hundreds of thousands of EST sequences, track the cDNA libraries and
clones to which they belong, and store the results of their analysis. Should they be available, large compute farms can
be used for the analysis.

New to Est_db?

The manual explains the software and what
most parts of the program do.

What does Est_db do?

Extensive bioinformatic analysis can be carried out on the sequenced EST libraries, including similarity (BLAST)
searches, protein sequence prediction, and the import of EST clustering and assembly data from external sources.
Results are searchable via a web page, with graphic output of the various analyses, enabling one to retrieve
information pertaining to a particular cDNA clone, or EST read, as well as view EST clustering results, or graphical
representations of BLAST results on the searched EST sequences.

The est_db package is likely to appeal not only to sequencing groups directly employed in EST sequencing, but also to
groups interested in performing bespoke analysis of ESTs that may already be publically available, in order to support
their ongoing research aims. The package is easily-extensible, via an API designed specifically to handle ESTs and
their analysis. It is open source and is made available free of charge, and, where possible, similarly open-licensed
components have been used in its development.

Application at the Sanger Institute

The est_db software package has been developed and used at the Sanger Institute to support the Xenopus tropicalis EST
Project - a collaboration between the Sanger Institute and the Wellcome/Cancer Research UK, Gurdon Institute in
Cambridge. To date est_db has been used to process the nearly 400,000 ESTs sequenced as part of the project,
approximately 305,000 of which passed its quality control (QC) checks, and have been submitted to public databanks.

The extensive facilities offered by est_db to analyse large numbers of ESTs have been used for the bioinformatic
analysis of these sequenced X. tropicalis EST libraries, facilitating use of the data by the scientific community. This
analysis can be viewed and searched live via the X. tropicalis est_db web interface.

Description

The est_db software package consists of three principal components: a relational database back-end (MySQL), a perl API
(EST_DB.pm), and a CGI web script. The MySQL database holds all the information stored in the est_db system including
the EST data itself and the cDNA clone and library details from which the DNA sequences were produced. Also stored are
the results obtained from the various bioinformatic programmes incorporated into the est_db analysis pipeline
(currently WuBLAST, RepeatMasker and ESTScan). EST clustering and sequence assembly results are stored in the database,
together with the information required to control the analysis pipeline, and the tracking information necessary for the
EST submission process to public databanks.

All the stored information can be accessed and manipulated in a high-level manner using the object-orientated perl API.
This makes it straightforward to implement sophisticated analyses of both the raw EST data and derived analysis.
Classes are provided to handle the EST sequence data, EST clustering results, and subsequent BLAST and other analysis
of both ESTs and consensus sequences generated from EST clustering. The schema is neutral to the method or package used
to cluster and assemble the ESTs, but a database adaptor is provided which can directly extract results from a
StackPACK2.1.1 MySQL analysis database.

Web functionality is implemented with a perl script, using the CGI.pm, and GD.pm modules. A set of easily-extensible
classes (EST_DB::ESTView) are provided as a high-level means to generate and place features on the graphic
representations of sequences, allowing the graphic web views to be extended or customised as additional analysis
results are added to the pipeline.

The est_db pipeline has features designed to handle job creation and management within the est_db system, with the LSF
scheduler being used to execute the underlying tasks. This allows lengthy analysis processes, such as some BLAST
searches, which if carried out with a single CPU might take days or weeks, to be completed in a few hours. The whole
analysis is split into a number of smaller jobs by the pipeline each of which can be executed on a separate CPU or
machine, parallelizing execution. The pipeline has been tested to more handle than 300 machines reading and writing
concurrently to the MySQL database as analyses are performed. The user can specify various parameters to control the
pipeline (job granularity etc), allowing the software installation to be customised to the available hardware
resources.

Familarity with the Ensembl API will aid use of the est_db API, as the latter shares many design features to those of
the Ensembl genome annotation system and web browser (www.ensembl.org). The majority of programmes and modules are
documented with embedded perl documentation (POD). Additionally examples of running the pipeline and summaries of the
methods available in principal database adaptor (EST_DB::DB_Adaptor::Sanger) and the ESTView classes are provided in
the /doc and /sanger/doc directories (see below).

Licensing conditions

Open source, available free of charge under the terms of the Perl Artistic License.

Brief install instructions

Download the compressed software package (above)
ftp://ftp.sanger.ac.uk/pub/EST_data/Xenopus/est_db_software/est_db_06_11_03.tar.gz
Size : 222321 bytes
MD5 : b37d57863ef8ab69448b2d28196e1393
Download one of the current X. tropicalis EST_DB dumps
Individual libraries clustered separately:
ftp://ftp.sanger.ac.uk/pub/EST_data/Xenopus/est_db_dump/X_tropicalis_06_11_03_by_library.tar.gz
Size : 177244356 bytes
MD5 : ff137b86ed2e5d1686845967a737c7e7
Global clustering of all libaries:
ftp://ftp.sanger.ac.uk/pub/EST_data/Xenopus/est_db_dump/X_tropicalis_06_11_03_global.tar.gz
Size : 144148585 bytes
MD5 : 04bda23cd3c837d86d537d38d8a9bf8e
Extract all the files from the archives
Set perl5lib variable so that EST_DB modules can be found as well as the
others mentioned.
run scripts/create_est_tables to create a blank EST_DB on your MySQL server
(Need to edit script for MySQL username)
Reload the data with scripts/reload_text_MySQL_EST_DB_dump
(Need to edit .conf file in /conf dir)
(Needs to be run local to the server)
Install CGI script on web server
(Edit web_config file, MySQL host/user, tmp file location)
Run perldoc on files to generate a set of script and API documentation.

Support

While we hope the software sees as wide a reuse as possible, the amount of time we have to support off-site use and
installation is rather limited. Should demand for use of the software be wide, we may be able to increase the amount of
documentation currently available. It is likely that to be able to successfully install the package one should have
significant perl and relational database experience.