Synopsis

We propose a hackathon to fill critical gaps in the capabilities of the Generic Model Organism Database (GMOD) toolbox that currently limit its utility for evolutionary research. Specifically, we aim to focus on tools for 1) viewing comparative genomics data; 2) visualizing phylogenomic data; and 3) supporting population diversity data and phenotype annotation.

The event would be hosted at NESCent and bring together a group of about 20 software developers, end-user representatives, and documentation experts who would otherwise not meet. The participants would include key developers of GMOD components that currently lack features critical for emerging evolutionary biology research, developers of informatics tools in evolutionary research that lack GMOD integration, and informatics-savvy biologists who can represent end-user requirements.

The event would hence provide a unique opportunity to infuse the community of GMOD developers with a heightened awareness of unmet needs in evolutionary biology that GMOD components have the potential to fill, and for tool developers in evolutionary biology to better understand how best to extend or integrate with already existing GMOD components.

Background

The GMOD project is a confederation of intercompatible open-source projects developing software tools for storing, managing, curating, and publishing biological data. Although the GMOD project originated from the goal of developing a generic tool set for common needs among model organism databases, GMOD tools are meanwhile used by many large and small, collaborative and single-investigator biological database projects for the dissemination of results of experimental research and curated knowledge.

GMOD's software tools provide a powerful and feature-rich basis for working with biological, in particular genomic and other molecular data. However, due to GMOD's historical emphasis on single-genome projects many GMOD tools still lack features that are critical to effectively support the comparative, phylogenetic, and natural diversity-oriented questions frequently asked in evolutionary research.

Recent developments have given rise to a window of opportunity for forging collaborations towards filling this gap. In particular, the cost of collecting comparative molecular data on a large or even genomic scale has recently dropped dramatically, primarily thanks to next-generation high-throughput sequencing technologies. This has enabled evolutionary researchers to bring genome-scale molecular data to bear on key evolutionary questions. It has also allowed single organism-focused molecular biology labs, who represent GMOD's traditional user base, to broaden out to multi-organism comparative approaches. Bringing these two communities with increasingly shared interests and complementary scientific and technical expertise together offers an opportunity to start filling GMOD's gaps in these areas while building on its existing strengths. In addition, such direct interaction will heighten future awareness of needs of evolutionary researchers among GMOD developers who have so far mostly supported its traditional user base, and can in the long term increase the ranks of GMOD contributors from a field it was not originally designed to serve.

The hackathon meeting format is ideally suited to realize this opportunity. Its strengths lie in facilitating face-to-face interaction among people with complementary expertise, and collaborative work on tangible products that can form the basis of continued partnerships long beyond the end of the meeting. This meeting format, and the overall goals of the event are closely aligned with NESCent's objectives in promoting collaborative work, data sharing and interoperability. NESCent's past experience in organizing successful hackathons, and its position as a neutral intellectual hub within the evolutionary biology make it an ideal location for holding the event.

Specific objectives

Organizers have identified the following broad themes for focusing work at the event. These are based on the organizers' experience, interactions with others in the GMOD and evolution communities, and insights gained by the recent Tools for Emerging Model Systems working group (EMS WG) at NESCent.

Before and at the hackathon, the participants will refine and distill these and other options into concrete implementation targets. The participants will develop criteria for priotization, such as maturity of a target for implementation, availability of test data, and potential for completing or making significant progress the target during the hackathon. Further ideas and discussion topics can be found on the Supplemental Information page.

Viewing tools for comparative genomics data

GBrowse_syn is a popular GMOD component for viewing comparative genomics data, particularly for viewing synteny between genomes. It does not currently support the next-generation sequencing (NGS) data increasingly available for comparative genomics and emerging model systems. Support for NGS data was identified by the EMS working group as a high priority.

In particular, GBrowse_syn lacks support for the Sequence Alignment Format (SAM), its mechanism of storing genome comparisons does not scale beyond a few organisms, and the means for tracking the necessary alignment metadata in Chado are insufficient.

In addition to filling those gaps, GBrowse_syn would also particularly stand to benefit from the event by gaining a more sustainable developer base.

Visualization of phylogenetic data and trees

The GMOD toolkit at present does not include web-based alignment viewers, nor can the increasingly popular JBrowse genome browser (the designated successor of GBrowse) display multiple sequence alignments. GMOD also lacks a phylogenetic tree widget.

Implementing these from scratch would be far beyond a suitable hackathon target. However, SGN has a relatively mature web-based multiple alignment and tree browser that could be extracted from SGN's codebase and transformed into a GMOD component, an add-on for JBrowse. Current Java-based tree viewers (such as Archaeopteryx or PhyloWidget) could be used as the basis for a JavaScript-based tree viewer (or an applet that can be controlled through JavaScript) that integrates with JBrowse.

Population Diversity and Phenotype support

GMOD's capabilities in managing phenotype and natural diversity data is scattered across partially redundant and outdated modules, does not support modern ontology-based entity-quality data, and lacks a web-interface. The sophisticated phenotype annotation tools that do exist cannot interface with Chado, GMOD's central relational data model. Yet, phenotypic and genetic diversity data are central to many evolutionary research questions.

A Natural Diversity Module initiative to address at least the deficiencies within Chado has already formed earlier this year. Several key developers (one of the original developers of the module, the developer of GDPDM, the basis of its design, and the developer of Phenex, a phenotype curation tool) are already local to NESCent, and so the hackathon provides a unique opportunity to review and refine the natural diversity data model face-to-face, and to integrate it the with an updated and reconciled phenotype module. A recently reported prototype of a Chado data adapter for Phenote, GMODs phenotype annotation tool, could be generalized to become the data persistence interface for such data.

Aside from the data model deficiencies, the ANISEED project has started efforts to generalize its sophisticated atlas/image-based web interface for phenotype data, and to make it operate on top of Chado. The hackathon could harness this synergy to help this effort leap forward, which could ultimately provide GMOD with the currently missing web-interface for such data.

Hackathon Logistics

The event will tentatively be held at NESCent in Durham, North Carolina, from Nov 8-12, 2010.

Participation will be arranged by invitation and by self-nomination followed by review. If you are interested in participating, please contact one of the organizers. We expect to support about 20 participants, about half of whom will be invited and half will be self-nominated attendees.

The organizing committee will select participants from the applicant pool to create a group with balanced, complementary, and diverse sets of expertise, background, and interests, using a number of criteria:

Experience in bioinformatics programming in general and GMOD in particular;

Experience with and understanding of evolutionary data types;

Potential to uniquely benefit from the event;

Complementarity of expertise and background;

Achieving critical mass for each of the themes; and

Availability during the event.

A hackathon is a working meeting and concentrates on writing code. All code and documentation will be made available immediately and freely to the community under an OSI-approved open source license.

Hackathon Organization

Participants will split into subgroups at the event. The composition and tasks of the subgroups will be guided by the overall objectives, but will otherwise emerge and be self-determined by the participants both prior to and at the event.

After the hackathon organizers and GMOD staff will followup with participants to help with seeing unfinished tasks through to completion, similar as has been done in GMOD following the GMOD Meetings.

Organizers

Nicole is currently the Lead Data Manager for the Data Collection Center of the modENCODE project at the Lawrence Berkeley National Lab. She also has experience curating phenotype data with ontologies and was one of the developers of Phenote, an ontology-based phenotype annotation tool.

Hilmar is the Assistant Director of Informatics at NESCent, where he is responsible for implementing NESCent's goal of enabling data interoperability in evolutionary biology. He is also a veteran of many hackathons at NESCent and elsewhere.