Multiple Genome Versions, Varieties / Subspecies / Strains

We were not able to come to an agreement, but everyone still felt there might be a core data model that can allow for single and multiple genomes that will be useful for all InterMines.

The fundamental question is do we want organism to be per genome version, or allow organism to have several genome versions. In the latter case, we’d also need a helper class, e.g. “Strain”, that would contain information about the genome.

This topic is sufficiently complex that we’ve agreed to try a more formal process here, listing our different options, their potential impact etc. More information on this process soon!

We decided against making a SyntenyBlock a bio-entity, even though it would benefit from inheriting some references.

We also decided against the SyntenicRegion1 / SyntenicRegion1 format and instead they will be in a collection of regions.

GO Evidence Codes

Currently the GO evidence codes are only a controlled vocabulary and are limited to the code abreviation, e.g IEA. However UniProt and other data sources have started to use ECO ontology terms to represent the GO evidence codes instead.

We decided against changing the GO Evidence Code to be an ECO ontology term.

The ECO ontology is not comprehensive

Some mines have a specific data model for evidence terms

Instead we are going to add attributes to the GO Evidence Code:

Add a link to more information on the GO evidence codes

Add the full name of the evidence code.

Change GOEvidenceCode to be OntologyAnnotationEvidenceCode

We decided against loading a full description of the evidence code. The description on the GO site is a full page. We tried shortening but then it didn’t really add much information. Also there is no text file with the description available.

Ontology Annotations – Subject

Currently you can only reference BioEntities, e.g. Proteins and Genes, in an annotation. This is unsuitable as any object in InterMine can be annotated, e.g. Protein Domains. To solve this problem, we will add a new data type, Annotatable.

This will add complexity to the data model but this would be hidden from casual users with templates.

Also publications will be moved from BioEntity to Annotatable. This change will allow us to have both publication and annotations on things that are not BioEntities. primaryIdentifier will also be moved from BioEntity to `​Annotatable`.

Protein molecular weight

Protein.molecularWeight is going to be changed from an integer to a float.

Organism taxon ID

Organism.taxonId is going to be changed from an integer to a string.

Sequence Ontology update

The sequence ontology underpins the core InterMine data model. We will update to the latest version available.

If you would like to be involved in these discussions, please do join our community calls or add your comments to the GitHub tickets. We want to hear from you!

It was suggested that we take a vote to choose the name. Please note that you can overwrite attribute names locally. But it would be better if we could all (mostly) agree!

User Interface

Both the above changes will require updates to the core InterMine code where it is assumed that Organism.taxonID is the unique field. This assumption will be replaced so that the new fields in Organism, where present, are used for the primary key.

For user friendliness, it will be necessary to assign unique organism names. Users will then be able to easily identify distinct versions in template queries and widgets.

We decided against making a SyntenyBlock a bio-entity, even though it would benefit from inheriting some references.

We also decided against the SyntenicRegion1 / SyntenicRegion1 format and instead they will be in a collection of regions.

GO Evidence Codes

Currently the GO evidence codes are only a controlled vocabulary and are limited to the code abreviation, e.g IEA. However UniProt and other data sources have started to use ECO ontology terms to represent the GO evidence codes instead.

We decided against changing the GO Evidence Code to be an ECO ontology term.

Ontology Annotations – Subject

Currently you can only reference BioEntities, e.g. Proteins and Genes, in an annotation. This is unsuitable as any object in InterMine can be annotated, e.g. Protein Domains. To solve this problem, we will add a new data type, Annotatable.

Both the above changes will require updates to the core InterMine code where it is assumed that Organism.taxonID is the unique field. This assumption will be replaced so that the new fields in Organism, where present, are used for the primary key.

For user friendliness, it will be necessary to assign unique organism names. Users will then be able to easily identify distinct versions in template queries and widgets.

GO Evidence Codes

Currently the GO evidence codes are only a controlled vocabulary and are limited to the code abreviation, e.g IEA. However UniProt and other data sources have started to use ECO ontology terms to represent the GO evidence codes instead.