Big Picture

GMOD's Role

Don Gilbert pointed out that cheap short sequencers are now available. Lots of people have inexpensive sequnces, but there still is no way to do cheap annotation.

Current GMOD clients are species or family centered. Want to make it easy to integrate multiple species. ApiDB is at the point of opening new species databases and web sites with relatively little effort.

Comparative genomics came up over and over again, both across species and within species.

As data grows and is consolidated, issues of who owns the data and who's responsible for the annotation become more problematic.

How does GMOD want to deal with integration issues?

How close to the sequencer does GMOD want to get? We don't want to pull the data off the sequencer.

Should we position GMOD as something that can feed data into places like Ensembl? Ensembl does not have curation expertise of the MODs. Even if NCBI is wonderful at consolidation, they won't have quality curation. GMOD sits right there, supporting curation. So, we doubt that Ensembl or NCBI will swallow us whole.

Releases and Bundles

We need to figure out what components we want and what we are pushing. If we focus on a core set of packages then life gets easier for the project.

There was discussion of better release management for components, and the VMWare Community Annotation Server package. Are GMOD bundles the way of the future? Believe that binary packages are generally not going to work for GMOD unless someone is willing to put a lot of time into maintaining them.

Comparative Genomics

Comparative genomics came up over and over again, both across species and within species. The GBrowse_syn talk in particular spawned a discussion on this.

First, can Chado represent relationships that have more than two members? Yes. Feature_loc has a rank column. Do we want collections in Chado?

Jason suggested a working group on how to do this. Dave from UMD volunteered to manage a wiki page on this, with the end goal of establishing a document that defines how to store comparative genomes.

Talks on synteny are spread throughout this document.

GMOD Components / Functions

Apollo

New Development

Work has resumed on developing Apollo. Ed Lee formerly of TIGR/JCVI started working for Suzi Lewis at Berkeley this fall and is working on it. Work is being done on

Speeding up Apollo when it uses Chado as a backend (or, just speeding up Chado).

Communicating with more than one Chado instance.

Undo/Redo support.

ID Generation and JDBC Drivers

Apollo can talk directly to a database or it can use XML files instead. FlyBase, VectorBase, BeeBase, and BovineBase are all believed to take the XML approach.

Apollo currently has two choices for database adaptors:

One that uses Postgres database triggers to set IDs.

One that does not.

The trigger version is used in the Community Annotation Server and on the Dolan-Rice project. We could not think of anywhere else it was used. The triggerless version is used everywhere else that we knew of.

The trigger version is Postgres specific. The triggerless version stores multiple copies of shared exons.

Notes from Tuesday: Decided to actively discourage use of the trigger version. Best thing may be to go through trigger code and externalize the logic.

Notes from Wednesday: Apollo - Chado - No short term decision. Long term probably move to Crabtree.

As you may have noticed, those notes disagree.

BioPerl, GFF

There was a discussion of BioPerl and how it relates to GMOD.

Jason Stajich created a slimmed down feature Perl package based on arrays instead of hashes: Bio::SeqFeature::Slim. This is 70% faster for reading a GFF file. Bio::Feature::IO only supports GFF3. It is slow, uses heavy objects, and is strongly typed. Jason wants to spend more time on middleware speed. He also wants converter into a common object model and code to get it back out to any supported format.

6 to 8 people are currently contributing to BioPerl.

GFF3 has an ID field. ID is not clear in earlier versions. GFF2 supports arbitrary feature types. GFF3 requires SO types (but you can always ignore that). Keep detailed alignment data in a separate database, not in GFF3. Indicate in GFF3 that data is stored elsewhere. Could store cigar strings in GFF3 and spec supports that.

Chado

There was a request to make to Chado be more database neutral, rather than Postgres-specific.

The slowness of Chado databases came up in several contexts. David from UMD Medical Center started a Postgres performance page on the wiki.

Scott described a potential way to implement materialized views in Chado that gets us most of the benefits of DBMS-supported materialized views. Store

the SQL to create it in a table,

a run time schedule for when the table should be rebuilt,

an enabled/disabled flag that is disabled by default.

Question was raised if genome metadata fits into the current Chado. The belief was that it does not.

Jason Stajich wants a better idea of who is responsible for what in terms of Chado modules. Dave C will take this on.

Chado Documentation

The table level and column level documentation for Chado is in a good state. Enhanced basic, big picture documentation was requested. Josh Goodman is thinking of providing a mapping from Chado DB columns to FlyBase report columns. Mike Caudy pointed out we should have multiple examples of implementation, not just FlyBase.

Chado Validator

We discussed if a Chado database validator would be worthwhile. A validator would check a Chado database to see if it conforms to the canonical model for a Chado database. There was no consensus on the value or practicality of this. There was consensus that no one was willing to volunteer to write it.

Ben suggested that if and when we do this, we use the GFF3 to Chado validator as a starting point.

DBMS Choice

There was a request to make to Chado be more database neutral, rather than Postgres-specific. Someone also asked if there was an SQLite adapter for GBrowse.

Postgres Performance

Slow performance of Chado Postgres implementations came up repeatedly.

Some bits:

Specify locale. ASCII-US runs fast. UTF-8 is slow and that is the default. Specified for the server, at server start.

A lot of time has been spent on making the queries go fast.

RTree indexes are in the core.

Allen's FRange functions are in the DB, but aren't used by default queries.

Community Annotation

Community Annotation at ParameciumDB

Linda Sperling discussed ParameciumDB. Paramecium is a small community with few resources and no dedicated curators.

Paramecium curators are a small set of people that must do their annotation from fixed IP addresses. Curator annotations are kept in addition to existing Genoscope predictions. These annotation are not validated when they are submitted. Annotators cannot chage annotations made by other people. There are two databases: one backing the website, and one where annotation goes. Once a month the new annotation is pushed to the web site. Validation happens prior to release.

They are also using ParameciumDB to teach annotation at two colleges, and some annotation comes from that. The bulk of annotations come from 2 curators, with the other curators all making a small number of annotations.

Community Annotation at JGI

Don Gilbert briefly described community annotation at JGI. They have a web interface for simple annotations and use Apollo for complex annotations. Anyone can promote any gene model, but they can't delete other models. Use the Wikipedia model: Whoever annotates last is correct.

Community Annotation at SGN

SGN has data for tomato, potato, eggplant, and many other species. SGN is locus centric. Each locus has (or can have) a single person who is the editor/owner of that locus. The locus editor can change anything about that locus that they want. The name of the locus editor is displayed on the locus page. Every locus has a "request editor privileges" link, if that locus has been assigned or not.

All edits are logged, and nothing is ever truly deleted. 'Deleted' items are retained but flagged as obsolete and are no longer shown.

SGN supports tagging of loci. Tags are free text that are rationalized after they are created. The tagging metaphor for curation also came up in several contexts during the Genome Informatics meeting.

Community Annotation Server (CAS)

Scott Cain spoke about this. It is almost ready to go. The Community Annotation Server (CAS) is meant to be "GMOD in a box". Currently it consists of:

A VMWare image, containing

Ubuntu Linux, version 6.10 LTS.

Picked Ubuntu LTS over CentOS because LTS stands for long term service and it will be supported for a while.

Postgres

A Chado database with DictyBase data in it.

An empty Chado database

Modware

Apoolo - Uses the JDBC adaptor with triggers. This is a Java WebStart version.

GBrowse

MediaWiki - includes Cite, ProcessCite and TableEdit extensions.

Cite extensions make it easy to provide literature annotations. Provide PubMed ID and it finds and grabs extract from PubMed.

Note that it does not include Turnkey and/or GMODWeb. Lincoln would like to add GMODweb, Testpresso and BioMart to that list.

This can run on any Intel machine, inlcuding Apple. Very little performance hit is caused by virtualization.

An online trial version of the Community Annotation Server was requested and was already on the way.

Version 3.0 is a fork of the code and version 2 and 3 are expected to co-exist 'forever'. Some shops won't have the horsepower to power version 3, and Lincoln wants to keep it as an easy to install tool.

Performance

Chado is usually too slow to run GBrowse on top of. Consider using Bio::DB:GFF instead. (Can't run GBrowse on top of BioMart. No adapter exists because of BioMart's flexible schema.)

Jason S argues that GBrowse slows down when it does BioPerl object creation. These are relatively heavyweight objects. He has just written a Slim version that is up to 70% faster.

Browser speed was also the number one issue (with all browsers) at the Genome Browsers Birds-of-a-Feather meeting at Genome Informatics.

Genome Grid

Genome Grid is middleware to enable easy use of TeraGrid for genome analysis tasks. Don is looking for genomes that need compute intensive analysis. He also interested in applying BioMart and Ergatis to these problems.

Help Desk

Dave Clements introduced himself and the goals of the GMOD Help Desk position.

Dave will make the help desk more visible on the web site, and add a GMOD News column to the home page.

SynView

Steve Fischer of ApiDB (see below) spoke about SynView. SynView is a synteny browser based on GBrowse. It is described in a Bioinformatics paper.

His talked raised a number of issues that have come up with recent extensions to SynView.

TableEdit

This is a MediaWiki extension by Jim Hu. It does two things. First, it makes it easier to update tables in MediaWiki, by presenting a nicer interface for altering wiki tables. Secondly, it supports synchronizing MediaWiki tables from database tables and vice versa.

Turnkey, GMODweb, DrupalFly

These are all web interface layers that lay on top of Chado databases.

GMODWeb is currently not working, we think because SQLTranslator has not been upgraded to deal with recent versions of Postgres. Ben Faga agreed to actively work on this.

Michael Caudy argued that even if GMODWeb did work right now that it is not extensible enough to support complex queries and presentation. Mike presented Drupal, Drupal Views, and PHPTemplate as an alternative web framework for providing a web interface to Chado databases. Mike demonstrated a prototype called DrupalFly that presents FlyBase data in an alternative organization.

Lincoln has an opening in Toronto for a full time programmer. Lincoln will talk with Brian about GMODWeb's future. We will put something on web site asking for volunteers to take on GMODweb.

GMOD Participating Organizations

A number of organizations talked about their recent work.

ApiDB

Steve Fischer talked about ApiDB. ApiDB uses GUS as their schema. They do multispecies comparative analysis. They have a database adapter link from GBrowse to GUS. It is based on the Chado adapter. They use materialized views in Oracle 10G and it is still relatively slow.

Synteny at ApiDB

Syntenic maps at ApiDB are produced with Mercator. The maps are based on gene orthology. Gene orthologs are generated using OrthoMCL. All alignments are pairwise, rather than multiple. Orthology is represented outside standard GUS schema. In the synteny schema, everything is defined relative to the reference sequence. Also need a table to define anchors.

Steve Fischer showed an 11 track page, which has about 5000 popups in it.

ApiDB has a release cycle. They discard and recalculate synteny with every new release.

GeneDB, Sanger

Chinmay Patel spoke about a week-long annotation project at Sanger involving 40 people all annotating the same genome.

They used the Artemis annotation editor (instead of Apollo), but Artemis was talking to a Chado database using an Artemis-Chado Ibatis-based (instead of Hibernate-based) adapter. The adapter is not yet released.

Imperial College London

Using GMOD to support a fungal sequencing project. Using:

Chado

GBrowse

Apollo

JCVI (nee TIGR)

Using Chado as database schema.

MaizeGDB

Taner Sen from MaizeGDB was at the meeting. Maize has multiple groups generating different gene models. It would be nice to display each groun in a separate track. MaizeGDB is evaluating genome browsers and is considering using GBrowse.

WormBase / CSHL

GBrowse_Syn

Sheldon McKay talked about GBrowse_syn, a prototype extension to GBrowse for viewing synteny. Goal is to have a sequence alignment viewer that can look at more than two species at a time. GBrowse_syn is based purely on sequence alignments. It does not know about genes or orthologs per se.

Used PECAN for the alignments. Maps are precomputed in a very CPU-intensive step.