Ensembl Compara Schema Documentation

Introduction

This document describes the tables that make up the Ensembl Compara schema. Tables are listed grouped in different categories, and the purpose of each table is explained. Several examples are also given. They are intended to allow people to familiarise themselves with the schema.

This table contains all taxa used in this database, which mirror the data and tree structure from NCBI Taxonomy database (for more details see ensembl-compara/script/taxonomy/README-taxonomy which explain our import process)

Column

Type

Default value

Description

Index

taxon_id

int(10)

-

The NCBI Taxonomy ID

primary key

parent_id

int(10)

-

The parent taxonomy ID for this node (refers to ncbi_taxa_node.taxon_id)

key: parent_id

rank

char(32)

''

E.g. kingdom, family, genus, etc.

key: rank

genbank_hidden_flag

tinyint(1)

0

Boolean value which defines whether this rank is used or not in the abbreviated lineage

left_index

int(10)

0

Sub-set left index. All sub-nodes have left_index and right_index values larger than this left_index

key: left_index

right_index

int(10)

0

Sub-set right index. All sub-nodes have left_index and right_index values smaller than this right_index

key: right_index

root_id

int(10)

1

The root taxonomy ID for this node (refers to ncbi_taxa_node.taxon_id)

Example:

This examples shows how to get the lineage for Homo sapiens:

SELECT * FROM ncbi_taxa_node WHERE left_index <= 339687 AND right_index >= 339690 ORDER BY left_index;

This table contains descriptive tags for the species_set_ids in the species_set table. It is used to store options on clades and group of species. It has been initially developed for the gene tree view.

This table specifies which kind of link can exist between entities in compara (dna/dna alignment, synteny regions, homologous gene pairs, etc...)NOTE: We use method_link_ids between 1 and 100 for DNA-DNA alignments, between 101 and 200 for genomic syntenies, between 201 and 300 for protein homologies, between 301 and 400 for protein families and between 401 and 500 for protein and ncRNA trees. Each category corresponds to data stored in different tables.

Column

Type

Default value

Description

Index

method_link_id

int(10)

-

Internal unique ID

primary key

type

varchar(50)

''

The common name of the linking method between species

unique key: type

class

varchar(50)

''

Description of type of data associated with the \"type\" field and the main table to find these data

This table contains information about the comparisons stored in the database. A given method_link_species_set_id exist for each comparison made and relates a method_link_id in method_link with a set of species (species_set_id) in the species_set table.

This table defines the genomic sequences used in the comparative genomics analyisis. It is used by the genomic_align_block table to define aligned sequences. It is also used by the dnafrag_region table to define syntenic regions.NOTE: Index has genome_db_id in the first place because unless fetching all dnafrags or fetching by dnafrag_id, genome_db_id appears always in the WHERE clause. Unique key is used to ensure that Bio::EnsEMBL::Compara::DBSQL::DnaFragAdaptor->fetch_by_GenomeDB_and_name will always fetch a single row. This can be used in the EnsEMBL Compara DB because we store top-level dnafrags only.

This table is the key table for the genomic alignments. The software used to align the genomic blocks is refered as an external key to the method_link table. Nevertheless, actual aligned sequences are defined in the genomic_align table.Tree alignments (EPO alignments) are best accessed through the genomic_align_tree table although the alignments are also indexed in this table. This allows the user to also access the tree alignments as normal multiple alignments.NOTE: All queries in the API uses the primary key as rows are always fetched using the genomic_align_block_id. The key 'method_link_species_set_id' is used by MART when fetching all the genomic_align_blocks corresponding to a given method_link_species_set_id

This table is used to index tree alignments, e.g. EPO alignments. These alignments include inferred ancestral sequences. The tree required to index these sequences is stored in this table. This table stores the structure of the tree. Each node links to an entry in the genomic_align_group table, which links to one or several entries in the genomic_align table.NOTE: Left_index and right_index are used to speed up fetching trees from the database. Any given node has its left_index larger than the left_index of its parent node and its right index smaller than the right_index of its parent node. In other words, all descendent nodes of a given node can be obtained by fetching all the node with a left_index (or right_index or both) between the left_index and the right_index of that node.

Column

Type

Default value

Description

Index

node_id

bigint(20)

-

Internal unique ID

primary key

parent_id

bigint(20)

0

Link to the parent node

key: parent_id

root_id

bigint(20)

0

Link to root node

key: root_idkey: left_index

left_index

int(10)

0

Internal index. See above

key: left_index

right_index

int(10)

0

Internal index. See above

left_node_id

bigint(10)

0

Link to the node on the left side of this node

right_node_id

bigint(10)

0

Link to the node on the right side of this node

distance_to_parent

double

1

Phylogenetic distance between this node and its parent

Example 1:

The following query corresponds to the root of a tree, because parent_id = 0 and root_id = node_id

This table contains the coordinates and all the information needed to rebuild genomic alignments. Every entry corresponds to one of the aligned sequences. It also contains an external key to the method_link_species_set which refers to the software and set of species used for getting the corresponding alignment. The aligned sequence is defined by an external reference to the dnafrag table, the starting and ending position within this dnafrag, the strand and a cigar_line. The original aligned sequence is not stored but it can be retrieved using the cigar_line field and the original sequence. The cigar line defines the sequence of matches/mismatches and deletions (or gaps). For example, this cigar line 2MD3M2D2M will mean that the alignment contains 2 matches/mismatches, 1 deletion (number 1 is omitted in order to save some space), 3 matches/mismatches, 2 deletions and 2 matches/mismatches. If the original sequence is:

External reference to method_link_species_set_id in the method_link_species_set table. This information is redundant because it also appears in the genomic_align_block table but it is used to speed up the queries

This table contains conservation scores calculated from the whole-genome multiple alignments stored in the genomic_align_block table. Several scores are stored per row. expected_score and diff_score are binary columns and you need to use the Perl API to access these data.

This table stores cross-references for gene members derived from the core databases. It is used by Bio::EnsEMBL::Compara::DBSQL::XrefMemberAdaptor and provides the data used in highlighting gene trees by GO and InterPro annotation"

Column

Type

Default value

Description

Index

gene_member_id

int(10)

-

External reference to gene_member_id in the gene_member table. Indicates the gene to which the xref applies.

primary key

dbprimary_acc

varchar(10)

-

Accession of xref (e.g. GO term, InterPro accession)

primary key

external_db_id

int(10)

-

External reference to external_db_id in the external_db table. Indicates to which external database the xref belongs.

: This table stores the raw local alignment results of peptide to peptide alignments returned by a BLAST run. The hits are actually stored in species-specific tables rather than in a single table. For example, human has the genome_db_id 90, and all the hits that have a human gene as a query are stored in peptide_align_feature_90

Column

Type

Default value

Description

Index

peptide_align_feature_id

bigint

-

Internal unique ID

primary key

qmember_id

int(10)

-

External reference to seq_member_id in the seq_member table for the query peptide

hmember_id

int(10)

-

External reference to seq_member_id in the seq_member table for the hit peptide

qgenome_db_id

int(10)

-

External reference to genome_db_id in the genome_db table for the query peptide (for query optimization)

hgenome_db_id

int(10)

-

External reference to genome_db_id in the genome_db table for the hit peptide (for query optimization)

This table holds the gene tree data structure, such as root, relation between parent and child, leaves, etc... In our data structure, all the trees of a given clusterset are arbitrarily connected to the same root. This eases to store and query in the same database the data from independant tree building analysis. Hence the "biological roots" of the trees are the children nodes of the main clusterset root. See the examples below.

Column

Type

Default value

Description

Index

node_id

int(10)

-

Internal unique ID

primary key

parent_id

int(10)

NULL

Link to the parent node

key: parent_id

root_id

int(10)

NULL

Link to the root node

key: root_idkey: root_id_left_index

left_index

int(10)

0

Internal index. See above

key: root_id_left_index

right_index

int(10)

0

Internal index. See above

distance_to_parent

double

1.0

Phylogenetic distance between this node and its parent

seq_member_id

int(10)

NULL

External reference to seq_member_id in the seq_member table to allow linkage from trees to peptides/transcripts.

key: seq_member_id

Example:

The following query returns the root nodes of the independant protein trees stored in the database

Header table for gene_trees. The database is able to contain several sets of trees computed on the same genes. We call these analysis "clustersets" and they can be distinguished with the clusterset_id field. Traditionally, the compara databases have contained only one clusterset (clusterset_id=1), but currently (starting on release 66) we have at least 2 (one for protein trees and one for ncRNA trees). See the examples below.

Column

Type

Default value

Description

Index

root_id

INT(10)

-

Internal unique ID

primary key

member_type

ENUM('protein', 'ncrna')

-

The type of members used in the tree

tree_type

ENUM('clusterset', 'supertree', 'tree')

-

The type of the tree

key: tree_type

clusterset_id

VARCHAR(20)

'default'

Name for the set of clusters/trees

method_link_species_set_id

INT(10)

-

External reference to method_link_species_set_id in the method_link_species_set table

This table contains all the genomic homologies. There are two homology_member entries for each homology entry for now, but both the schema and the API can handle more than just pairwise relationships. dN, dS, N, S and lnL are statistical values given by the codeml program of the Phylogenetic Analysis by Maximum Likelihood (PAML) package.

SELECT homology_id, description, GROUP_CONCAT(genome_db.name) AS species FROM homology JOIN method_link_species_set USING (method_link_species_set_id) JOIN species_set USING (species_set_id) JOIN genome_db USING(genome_db_id) WHERE method_link_id=201 AND homology_id<5000000 GROUP BY homology_id LIMIT 4;

This table contains the sequences corresponding to every genomic homology relationship found. There are two homology_member entries for each pairwise homology entry. As written in the homology table section, both schema and API can deal with more than pairwise relationships. The original alignment is not stored but it can be retrieved using the cigar_line field and the original sequences. The cigar line defines the sequence of matches or mismatches and deletions in the alignment.

This table contains one entry per stable_id mapping session (either for Families or for Protein Trees), which contains the type, the date of the mapping, and which releases were linked together. A single mapping_session is the event when mapping between two given releases for a particular class type ('family' or 'tree') is loaded. The whole event is thought to happen momentarily at 'when_mapped' (used for sorting in historical order).

Column

Type

Default value

Description

Index

mapping_session_id

INT

-

Internal unique ID

primary key

type

ENUM('family', 'tree')

NULL

Type of stable_ids that were mapped during this session

unique: key

when_mapped

TIMESTAMP

CURRENT_TIMESTAMP

Normally, we use the date of creation of the mapping file being loaded. This prevents the date from chaging even if we accidentally remove the entry and have to re-load it.

rel_from

INT

NULL

rel.number from which the stable_ids were mapped during this session. rel_from < rel_to

unique: key

rel_to

INT

NULL

rel.number to which the stable_ids were mapped during this session. rel_from < rel_to

This table keeps the history of stable_id changes from one release to another. The primary key 'object' describes a set of members migrating from stable_id_from to stable_id_to. Their volume (related to the 'shared_size' of the new class) is reflected by the fractional 'contribution' field. Since both stable_ids are listed in the primary key, they are not allowed to be NULLs. We shall treat empty strings as NULLs. If stable_id_from is empty, it means these members are newcomers into the new release. If stable_id_to is empty, it means these previously known members are disappearing in the new release. If both neither stable_id_from nor stable_id_to is empty, these members are truly migrating.