Hi All,
I've completed some early benchmarking on the latest iteration of the
Bio::DB::GFF3 module. What distinguishes this module from the original
Bio::DB::GFF, in addition to its ability to correctly handle the multiple
levels of containment in GFF3, is that while there are relational tables for
the feature location, name, attributes and type that are used for querying,
but the feature itself and all its subparts are instantiated as one
Bio::SeqFeatureI object at load time, then serialized (using Storable or
Data::Dumper) and stored into a relational table as a BLOB. Another change is
that the "binning" scheme now uses integers rather than floats; this will
avoid the precision problems that have plagued users of different MySQL
versions.
This means that it will take longer to load the database, but less time to
retrieve objects, because all the Bio::SeqFeature object creation was done up
front. It also means that there are fewer objects in the database because a
gene, its transcripts, and all its exons are all stored as a single object
rather than as multiple objects that need to be aggregated together at fetch
time.
Here are the benchmarking results:
DATA SET: 2,849 genes (along with associated data) from
C. elegans chromosome I
LOAD TESTS:
Bio::DB::GFF (bp_bulk_load_gff.pl): 54.58s, 13M database
Bio::DB::GFF3 (perl DBI loading): 245.06s, 11M database
RETRIEVE TESTS: (fetch 1000 random genes)
Bio::DB::GFF: 16.81s
Bio::DB::GFF3: 1.99s
So there's about an 8x speedup in retrieval, but a 4x slowdown in loading,
which is pretty much what I expected. Unexpectedly, the storage size for the
data is actually smaller for the Bio::DB::GFF3 database than for
Bio::DB::GFF.
This looks pretty good to me. My plan now is to experiment with a variation of
the scheme in which each subfeature is stored as a separate BioPerl object
and then loaded in a lazy fashion as needed. This will mean that there will
be as many as three database fetches to get a full gene, but it also allows
one to ignore genes and just do queries for exons, UTRs, etc. Things that
have split locations -- such as alignments -- will continue to be stored as a
single object, however.
Right now I'm still adjusting the names of the various modules so they are in
my private CVS. I'll move everything to bioperl-live as soon as the names
stabilize.
Lincoln
--
Lincoln Stein
lstein at cshl.edu
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
(516) 367-8380 (voice)
(516) 367-8389 (fax)