Convert BioPerl-DB to DBIx::Class

Rationale

Bioperl-db (the BioPerl bindings to BioSQL) in essence constitute a self-made ORM, invented at a time when DBIx::Class didn't exist yet. As such, it has some advantages (if you are willing to count overly clever features to be counted in this category), but arguably many more disadvantages, chief among them being the unsustainably small (you could also say non-existent) developer community supporting it, and the fact that DBIx::Class now has existed for years, and is fairly mature. So, rewriting Bioperl-db with a DBIx::Class (or another well-supported generic ORM) would stand to make a considerable impact on our ability to further develop Bioperl's relational storage capabilities, as well as BioSQL itself.

Approach

Under the supervision of their mentor(s), the GSoC student will:

Start working on conversion of BioPerl-DB classes to using DBIx::Class

write additional tests and improve documentation as needed

Challenges

BioPerl-DB is self-contained; this may require looking at the BioSQL schema and determining whether there are specific areas that need the most focus.

Difficulty and needed skills

easy to hard, depending on student's familiarity with the tools to be used. Student will need:

excellent Perl programming skills, including familiarity with:

DBIx::Class

Mentors

Hilmar Lapp, others?

Major BioPerl Reorganization, part 2

Save the monolith!

Rationale

The initial run at this project had some success, but more work needs to be done. The final goal of this project is to find and break out as many well-defined subsections of BioPerl as possible, releasing them to CPAN along the way.

Approach

Under the supervision of their mentor(s), the GSoC student will:

break current thousand-module monolithic distributions into smaller, more manageable pieces

improve characterization of dependencies

improve build and testing systems for new distributions

write additional tests and improve documentation as needed for the reorganization

Perl Run Wrappers for External Programs in a Flash

Rationale

BioPerl has a long tradition of providing wrapper objects for running external programs and parsing their output, mainly through the distribution called bioperl-run. Wrappers make it relatively easy to process data in highly customizable pipelines with the benefits of BioPerl objects and I/O. They also help to standardize the interfaces to typically idiosyncratic open-source utilities, reducing the burden on the developer. With new bioinformatics tools being released almost daily, however, it can be difficult for the BioPerl regulars to maintain a stable of run wrappers for the latest and greatest tools. Even harder is making the wrapper interfaces themselves conform to a standard API that users can count on.

Possible approaches

Integrate Galaxy's tool configuration file format in a pluggable way for developing a generic wrapper application.

Are there any shortcomings to current schemes, such as Galaxy's or EMBOSS's acd format, that could be addressed with a newer schema?

See HOWTO:Wrappers and the above module documentation for more details.

Difficulty and needed skills

Medium. The student should understand or be willing to work hard at understanding BioPerl object-oriented style. Some familiarity with XML and XML Schema will help in getting up to speed. An interest in playing with new open-source bioinformatics tools, especially those for managing next-generation sequence assembly, would also be valuable.

Lightweight/Lazy BioPerl Classes

Rationale

Many current BioPerl classes are implemented in a greedy or heavy way, where all information is pulled into memory as objects. For instance, the current Bio::Seq implementation is the primary bottleneck for sequence parsing speed and can take up a ton of memory, particularly with whole-genome information and next-generation sequencing information. Storing the data in memory in a simple data structure and generating the objects lazily could help with speed. Alternatively, storing the data in a persistent manner would also help with memory issues, with the obvious trade-off for speed but having the nice side-benefit of consistent and possibly persistent ways of handling data.

Approach

Implement a Bio::Seq/Bio::PrimarySeq class (or other commonly-used BioPerl classes) that can deal with very large datasets in a memory-efficient manner. Implement at least one corresponding parser that can either parse records lazily (akin to an XML pull parser) or create lightweight objects. These could be considered two projects but they are interrelated (lightweight objects could have many different backends, including lazy parsing), so development should proceed with this in mind.

Difficulty and needed skills

medium to hard. Student should have an excellent command of Perl and data structures, experience with persistent storage mechanisms (such as a SQL-based RDBMS, CouchDB, etc), and some familiarity with parsing methodologies.

Prior art

Jason Stajich has started a SQLite-based lightweight Bio::Tree::Tree implementation on a GitHub branch at the recent GMOD Evolutionary Biology Hackathon at NESCent in Fall 2010.

Mentors

Chris Fields, Jason Stajich

BioPerl 2.0 (and beyond)

Rationale

Design or reimplement BioPerl classes without API constraint, using Modern Perl tools or Perl 6.

Approach

Most BioPerl code is over 6 years old and doesn't take advantage of Modern Perl tools, such as new methods available in Perl 5.10 and 5.12, Moose/MooseX, DBIx::Class, Catalyst, and more. Furthermore, a viable Perl6 implementation, Rakudo, is currently available. This gives us an enormous opportunity to redesign fundamental aspects of BioPerl without the necessity for development hindered by a requirement for backwards compatibility.

Two projects, Biome (Moose-based BioPerl) and BioPerl6 (Perl 6 BioPerl) have already started but are in a very early stage. One could participate in:

IO implementations for object iteration, or Perl6 grammars for common formats

Redesign of common BioPerl classes

etc.

This is an area ripe for new student project ideas. The more focused the better! Discussion is a must, either via IRC or email.

Difficulty

Project-dependent

Mentors

Chris Fields, Rob Buels

Bio::Assembly

Continued refinement of AssemblyIO - sam or ace files once imported should have similar handles and/or methods.

Semantic Web Support

Rationale

There are great development opportunities in information discovery for bioinformatics using semantic web, specially thinking in the implementation of SPARQL queries for a "discoverable bio-cloud".

Parsers and converters from and to RDF, including IO modules for GenBank, EMBL, several XML specifications, et cetera.

Storage and retrieval of information using SPARQL.

Difficulty and needed skills

Medium. Familiarity with SeqIO modules and Perl itself. The student should also be familiar with RDF format and the RDF triples concept for Semantic Web.

Mentors

To be determined. Kjetil Kjernsmo can help mentor students wishing to explore the RDF::Trine direction.

(your idea here)

Please feel very free to propose your own idea. As long as it is relevant to one of our projects, we will give it serious consideration. Creativity and self-motivation are great traits for open source programmers.

Do not hesitate to propose your own project idea: some of the best applications we see are by students that go this route.

Past Projects

2011

Major BioPerl reorganization

Save the monolith!

Rationale

BioPerl is currently suffering from an overly-monolithic structure, which is becoming unwieldy and contributing to paralysis of the project.

Approach

Under the supervision of their mentor(s), the GSoC student will:

break current thousand-module monolithic distributions into smaller, more manageable pieces

improve characterization of dependencies

improve build and testing systems for new distributions

write additional tests and improve documentation as needed for the reorganization

2010

Alignment Subsystem Refactoring

Rationale

BioPerl's Bio::Align::AlignI subsystem is quite old and in need of significant refactoring. Furthermore, the Bio::AlignI and Bio::Assembly subsystems need further integration. This is an area ripe for reimplementation to make a more consistent set of modules.