Installing the API

Connecting to an EnsEMBL Compara database

Connection parameters

Starting from rel.48 EnsEMBL is running two public MySQL servers on host=ensembldb.ensembl.org with two different port numbers. The server on port=3306 hosts all databases prior to rel.48 and the server on port=5306 hosts all newer databases starting from rel.48.

There are two API ways to connect to the EnsEMBL Compara database:

In most cases you will prefer the implicit way - using Bio::EnsEMBL::Registry module,
which can read either a global or a specific configuration file or auto-configure itself.

However there are cases where you might want more flexibility provided by the
explicit creation of a Bio::EnsEMBL::Compara::DBSQL::DBAdaptor.

For using the auto-configuration feature, you will first need to supply the connection parameters to the
Registry loader. For instance, if you want to connect to the the public EnsEMBL databases you can
use the following command in your scripts:

This will initialize the Registry, from which you will be able to create object-specific adaptors later.
Alternatively, you can use a shorter version based on a URL:

use Bio::EnsEMBL::Registry;
Bio::EnsEMBL::Registry->load_registry_from_url('mysql://anonymous@ensembldb.ensembl.org:5306/');

Implicitly, using the Bio::EnsEMBL::Registry configuration file

You will need to have a registry configuration file set up.
By default, it takes the file defined by the ENSEMBL_REGISTRY environment variable or
the file named .ensembl_init in your home directory if the former is not found.
Additionally, you can use a specific file
(see perldoc Bio::EnsEMBL::Registry or later in this document for some examples on how to use a different file).
Please, refer to the EnsEMBL registry documentation for details about this option.

Explicitly, using the Bio::EnsEMBL::Compara::DBSQL::DBAdaptor

EnsEMBL Compara data, like core data, is stored in a MySQL relational database.
If you want to access a Compara database, you will need to connect to it.
This is done in exactly the same way as to connect to an EnsEMBL core database,
but using a Compara-specific DBAdaptor. One parameter you have to supply
in addition to the ones needed by the Registry is the -dbname, which by convention contains the release number:

Only some of these adaptors will be used for illustration as part of this tutorial through commented perl scripts code.

You can get the adaptors from the Registry with the get_adaptor command. You need to specify three arguments: the
species name, the type of database and the type of object. Therefore, in order to get the GenomeDBAdaptor for the
Compara database, you will need the following command:

NB: As the EnsEMBL Compara DB is a multi-species database, the standard species name is 'Multi'. The type of the
database is 'compara'.

Code Conventions

Refer to the EnsEMBL core tutorial for a good description of the coding conventions normally used in EnsEMBL.

We can divide the fetching methods of the ObjectAdaptors into two categories: the fetch_by and fetch_all_by. The former return one single object while the latter return a reference to an array of objects.

Whole Genome Alignments

The Compara database contains a number of different types of whole genome alignments.
A listing about what are these different types can be found in the ensembl-compara/docs/schema_doc.html document in method_link section.

GenomicAlignBlock objects

GenomicAlignBlocks are the preferred way to store and fetch genomic alignments.
A GenomicAlignBlock contains several GenomicAlign objects.
Every GenomicAlign object corresponds to a piece of genomic sequence aligned with the other GenomicAlign in the same GenomicAlignBlock.
A GenomicAlign object is always in relation with other GenomicAlign objects
and this relation is defined through the GenomicAlignBlock object.
Therefore the usual way to fetch genomic alignments is by fetching GenomicAlignBlock objects.
We have to start by getting the corresponding adaptor:

In order to fetch the right alignments we need to specify a couple of data:
the type of alignment and the piece of genomic sequence in which we are looking for alignments.
The type of alignment is a more tricky now:
you need to specify both the alignment method and the set of genomes.
In order to simply this task, you could use the new Bio::EnsEMBL::Compara::MethodLinkSpeciesSet object.
The best way to use them is by fetching them from the database:

There are two ways to fetch GenomicAlignBlocks.
One uses Bio::EnsEMBL::Slice objects while the second one is based on
Bio::EnsEMBL::Compara::DnaFrag objects for specifying the piece of genomic
sequence in which we are looking for alignments.

Homologies and Protein clusters

All the homologies and families refer to Members. Homology objects store orthologous and paralogous relationships between Members and Family objects are clusters of Members.

Member objects

A Member represent either a gene or a protein. Most of them
are defined in the corresponding EnsEMBL core database. For
instance, the sequence for the human gene ENSG00000004059
is stored in the human core database.

The fetch_by_source_stable_id method of the MemberAdaptor takes two arguments. The first one is the
source of the Member and can be:

ENSEMBLPEP, derived from an EnsEMBL translation

ENSEMBLGENE, derived from an EnsEMBL gene

Uniprot/SWISSPROT, derived from a Uniprot/Swissprot entry

Uniprot/SPTREMBL, derived from a Uniprot/SP-TrEMBL entry

The second argument is the identifier for the Member. Here is a simple example:

Homology Objects

A Homology object represents either an orthologous or paralogous relationships between two
or more Members.

Typically you want to get homologies for a given gene. The HomologyAdaptor has a
fetching method called fetch_all_by_Member(). You will need the Member object for your
query gene, therefore you will fetch the Member first like in this example:

# first you have to get a Member object. In case of homology is a gene, in
# case of family it can be a gene or a protein
my $member_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi', 'compara', 'Member');
my $member = $member_adaptor->fetch_by_source_stable_id('ENSEMBLGENE','ENSG00000004059');
# then you get the homologies where the member is involved
my $homology_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi', 'compara', 'Homology');
my $homologies = $homology_adaptor->fetch_all_by_Member($member);
# That will return a reference to an array with all homologies (orthologues in
# other species and paralogues in the same one)
# Then for each homology, you can get all the Members implicated
foreach my $homology (@{$homologies}) {
# You will find different kind of description
# UBRH, MBRH, RHS, YoungParalogues
# see ensembl-compara/docs/docs/schema_doc.html for more details
print $homology->description," ", $homology->subtype,"\n";
# And if they are defined dN and dS related values
print " dn ", $homology->dn,"\n";
print " ds ", $homology->ds,"\n";
print " dnds_ratio ", $homology->dnds_ratio,"\n";
}

Each homology relation has 2 or more members, you should find there the initial member used as a query.
The get_all_MemberAttribute method returns an array of pairs of Member and Attributes. The Member
corresponds to the gene or protein and the Attribute object contains information about how this
Member has been aligned.