The analysis and usage of biological data is hindered by the spread of information across multiple
repositories and the difficulties posed by different nomenclature systems and storage formats. In particular,
the study and use of protein-protein interactions is one area where there is an important need for data
integration. Without good integration strategies, it is difficult to assess how much interaction data is available
and its properties

Results:

We present a data integration approach for protein-protein interactions. This integrative approach
has been implemented into PIANA, a protein-protein interaction software framework under the GNU Public
License (http://sbi.imim.es/piana). We find that the integrated network of interactions shows properties very
similar to those observed in previously reported protein interaction networks. We also find that interaction
prediction methods find interactions for many proteins for which experimental methods have not produced any
information.

Conclusions:

PIANA´s approach to protein interaction data integration solves many of the nomenclature
issues common to systems dealing with biological data. The concept presented here can be extended to other
types of biological data. The integration of all available protein interaction data is fundamental to obtaining a
comprehensive picture of the interactions taking place in the cell.

However, interaction data is spread across multiple repositories
and codified using various nomenclature systems
(Mathivanan et al., 2006. In consequence, experimental
biologists face difficulties when trying to find all known interactions
for their proteins of interest, and the computational
analysis and usage of protein interaction data is usually
constrained to using a partial subset of all available
knowledge. For example, any comprehensive search of interactions for a particular protein must include at least seven
databases of protein-protein interactions: the Database of
Interacting Proteins (DIP)
(Salwinski et al., 2004), the MIPS
database of interactions (Pagel et al., 2005), the Molecular
INTerations database (MINT) (Chatr-aryamontri et al.,
2007), IntAct (Kerrien et al., 2007), the Biomolecular Interactions
Database (BIND)
(Alfarano et al., 2005), the BioGrid
(Stark et al., 2006) and Human Protein Reference Database
(HPRD) (Peri et al., 2003).

Besides, each database uses different strategies for identifying
proteins, and translations between synonym identifiers
(i.e. identifiers linked to the same protein sequence) are
required before any manual search or automatic processing.
Moreover, there are methods for predicting protein interactions
that can be used when no experimental interactions
have been detected for a protein, but results from these
methods are usually spread across multiple websites, each
one in its own format.

There are efforts to standardize and harmonize protein
interaction data. HUPO-PSI (Hermjakob, 2006) has developed
a schema that enables the description of interactions
between a wide range of molecular types, thus facilitating
the access and data exchange between different research
groups. The IMEx consortium (Orchard et al., 2007) is a
group of major public interaction data providers sharing
curation effort and exchanging completed records on molecular
data following the HUPO standard exchange format.
In consequence, the rate of data curation and data
sharing between different repositories has been improved,
but integration is still not completed. For example, HUPO
PSI-MI 2.5 format allows the identification of interactors
by unique identifiers from different databases, but the guidelines
implemented do not include a strategy for naming proteins,
which leaves unresolved many of the integration issues.

The issue of protein nomenclature has been addressed
by internationally recognized scientific organizations like
HGNC
(Wain et al., 2002) and SGD (Christie et al., 2004),
but they do not cover all species and do not map all database
identifiers. IPI (Kersey et al., 2004) offers a non-complete
redundant data set with cross-references with external
identifiers.

While these tools have been shown to be useful for creating
and analyzing protein-protein interaction networks,
there is still the need for an integration engine that truly
unifies all available data into a single network and allows
automatic analyses on a global scale. Most current integration
tools are designed to work with interactions coming
from one single type of data format, and others have problems
when dealing with interactions codified using different
types of protein identifiers.

They concluded that the overlap between repositories is
small but significant, and showed that the different interaction
maps suffer from sampling and detection biases. The
integration strategy of both works consisted in mapping all
binary interactions to pairs of Entrez Gene identifiers.
Marcotte and coworkers (Hart et al., 2006) analyzed yeast
and human interaction data sets, and estimated that their
protein interaction networks should contain 37,800-75,500
and 154,000-369,000 interactions respectively.

Interaction Networks Based on ProteinIDs and
other External Identifiers
Interaction networks are built using proteinIDs as nodes
(see sections ‘PIANA and protein identifiers’ and ‘Proteinprotein
interactions integration’). When translating the nodes of the network to external protein identifiers (process referred
as ‘unifying the network’), there are two possibilities:
1) one proteinID corresponds to a single external identifier
and 2) different proteinIDs correspond to the same
identifier, and thus, nodes and interactions are merged.
Therefore, the same PIANA proteinID network will correspond
to different unified networks, depending on the external
identifier. Statistics in this article have been obtained
after unifying the networks by NCBI geneID. Although
geneIDs only cover 42% of proteinIDs, the cardinality
proteinID:externalIdentifier is the highest (Table 1), and therefore
geneID is the best suited identifier type for obtaining
an unbiased view of the integrated protein interaction network.
Protein sequences of unknown geneID were unified
using UniProt accessions.

Table 1: Protein identifiers statistics.

Summary of the most relevant protein identifier types, calculated from a total of 6,476,028 distinct
sequenceIDs in the database. Columns are: identifier type, number of distinct identifiers, the proportion
of proteinIDs with respect to external identifier correspondences, the proportion of external
identifiers with respect to proteinIDs, and the percentage of proteinIDs covered by the external
identifier. Primary gene symbols are those gene symbols that have been established as the official
gene name by nomenclature authorities such as HUGO (Wain et al., 2002) or FlyBase (Crosby et
al., 2007).

Methods for the Prediction of Protein Interactions

We used predictions of protein-protein interactions obtained
by four different methods: (i) Gene fusion, in which two
proteins are predicted to interact if their corresponding genes
appear fused in another genome (Enright et al., 1999); (ii)
Phylogenetic profiles, in which similarity of phylogenetic
profiles is interpreted as being indicative of two proteins
need to be simultaneously present to perform a given function
together (Pellegrini et al., 1999); (iii) Distant conservation
of sequence patterns and structure relationships, in
which structural similarities among domains of known interacting
proteins and conservation of pairs of sequence
patches involved in protein–protein interfaces are used to
predict putative protein interaction pairs
(Espadaler et al.,
2005b); and (iv) Structural interologs, in which interactions
are transferred between proteins with the same structural
domains
(Aragues et al., 2006). Interactions for the two
first methods were retrieved from STRING
(von Mering et
al., 2007) by querying the database for interactions with a
score higher than 0.7 for that particular methodology. Interactions
for (iii) were obtained from the work of Espadaler
et al., (2005b) (Espadaler et al., 2005b). Interactions for (iv) were
predicted by transferring experimental interactions in PIANA
between proteins with a domain within the same SCOP family.

Results

Overview
PIANA (Protein Interactions And Network Analysis)
(Aragues et al., 2006) is a software framework capable of
(i) integrating multiple sources of information into a single
relational database (see database design on additional file 1); (ii) creating and analyzing protein interaction networks;
and (iii) mapping multiple types of biological data onto protein
interaction networks. PIANA code and documentation
are freely available under an open source license for local
installation and modification (http://sbi.imim.es/piana). The
data warehousing approach and software architecture of
PIANA are shown in Figure 1 (see additional file X for
details). The PIANA database is accessed by the Graph
library through a database interface, which is also used by
the PIANA library to create, manage and analyze proteinprotein
interaction networks. The whole process can be
controlled from a user interface module.

Figure 1:PIANA architecture

A set a parsers inserts information from external repositories into the PIANA database.
This database is accessed by the Graph library through a database interface, which is
also used by the PIANA library to create, manage and analyze protein-protein interaction
networks. The whole process can be controlled from a user interface module.

Mapping Protein Identifiers

PIANA handles an extensive set of protein identifiers
types: UniProt entries and accessions; gene symbols; NCBI
gi, geneID, Unigene and accession numbers; ENSEMBL;
RefSeq; PDB; and FastA formatted sequences. PIANA
internally identifies proteins with proteinIDs (integers). Each
proteinID is linked to a pair [aminoacid sequence, taxonomy
id], so there is a unique identifier for each protein sequence
for a given organism. This allows PIANA to use the < sequence,
species > of the protein as an inter-lingua between
the external identifiers provided by the main repositories of
genes and proteins. Therefore, one external protein identifier
(e.g. UniProt entry THRB_HUMAN) can be associated
to one or more proteinIDs (e.g. 11483), which are in
turn linked to other external identifiers that are also used to
represent that protein (e.g., gene symbol ‘f2’ and Unigene‘Hs.410092’). Consequently, along the different processes
involved in inputting/outputting PIANA, external identifiers
are ‘translated’ to proteinIDs, the desired operations are
performed, and finally, if needed, proteinIDs are returned
into the external identifier expected by the user
(Figure 2).
This strategy reduces the ambiguity and processing problems
to the minimum: there is no need for continuously translating
between distinct types of protein identifiers, since all
information has been previously stored by assigning it to
specific proteinIDs. Furthermore, codifying interactions in
terms of proteinIDs allows PIANA to capture a larger number
of interactions than platforms based on third party protein
identifiers.

Figure 2: PIANA use of proteinIDs as an interlingua between external identifiers.

PIANA keeps all information in terms of proteinIDs (an integer that uniquely identifies
a protein sequence of a given taxonomy). User inputs are immediately translated to
proteinIDs. Once this translation has been performed, all operations are performed at
the sequence level, reducing ambiguities and synonyms conversions to a minimum.

Moreover, PIANA uses a number of techniques to assure
the quality and completeness of the identifiers used as
input/output: 1) inferring correspondences between identifiers
and sequences even in the case that no external database
explicitly contained the cross-reference: if one database
identifies sequence A with identifier id1 and another
database uses identifier id2 to sequence A, PIANA infers
that id1 is equivalent to id2; 2) uniqueness of output protein
identifiers: if two proteinIDs are linked to the same external
identifier, those proteins are considered to be the same, and
hence, merged into a single network node; 3) avoiding gene
name ambiguities: thanks to integrating the species of the
protein into the internal identifier, gene names are not confounded
even if the same symbol is used for several species;
and 4) using representative protein identifiers: (i)
PIANA will use the identifier labeled as ‘preferred’ by the
source database (eg. official gene symbol) unless the user
says the contrary; and (ii) any input identifiers given by the
user are prioritized over other identifiers in the PIANA database.

Since PIANA works internally with identifiers linked to
the sequence of proteins (i.e. proteinID), the output identifier
that is used for proteins depends not only on the type of
identifier chosen by the user (e.g. UniProt) but also on the
specific results that are being outputted. The reason is that
one proteinID can be associated to several external identifiers (i.e. one sequence is associated to three gene names)
and consequently, one of those external identifiers has to be
chosen above the others. The algorithm used to chose among
external identifiers depends on the input identifiers given by
the user (they are prioritized over other identifiers) and the
number of external databases that linked that sequence to
the identifiers. Therefore, one proteinID will not always be
represented in the output by the same external identifier.

Our internal protein identifiers do not distinguish between
identical paralogs. We believe this distinction is not needed,
since most repositories of interactions do not reach that level
of specificity. Finally, proteinIDs are not intended to be new
external protein identifiers, their only purpose is to be used
for integration. Therefore, the way the integration is performed
remains transparent to the user, whose only concern
is to decide on the type of identifiers for input and
output.

Protein-Protein Interactions Integration
Each interaction described in a third-party database is‘translated’ to one or more interactions between proteinIDs.
For example, if the external database contains an interaction
between proteins A and B, with A corresponding to two
proteinIDs (e.g 1 and 2) and B to one proteinID (e.g. 3),
two interactions (1-3 and 2-3) will be inserted into the
PIANA database. Both interactions will be described in the
PIANA database as coming from that specific external
database and labeled with the method used to detect the
interaction between A and B. For example, HPRD describes
an interaction between Entrez Gene 217 (mitochondrial
ALDH) and Entrez Gene 3336 (heat shock protein). According
to the correspondences in the PIANA database,
Entrez Gene 217 corresponds to 13 different proteinIDs,
and Entrez Gene 3336 corresponds to 12 proteinIDs. Therefore,
PIANA will internally store the interaction between
those two proteins as 156 different interactions. This methodology
allows PIANA to give full control to the user: 1)
interactions can be retrieved from any type of identifier; 2)
a network can be created for a given external database
(e.g. use only interactions from IntAct) and/or a specific
method (e.g. do not use interactions detected in two hybrids
assays) and/or a species (e.g. only interested in human interactions);
3) PIANA outputs can be set to use any type of
protein identifier and therefore, interactions between
proteinIDs are transformed to non-redundant interactions
between protein identifiers (Methods). Consequently, describing
interactions in terms of protein sequences instead
of external identifiers provides a true integration of all known
interactions into a single network, while keeping record of
the source databases and detection methods associated with
the interaction. Currently, PIANA can integrate interactions
from DIP (Salwinski et al., 2004), MIPS (Pagel et al., 2005),
MINT (Chatr-aryamontri et al., 2007), IntAct (Kerrien et
al., 2007), BIND (Alfarano et al., 2005), BioGrid (Stark et
al., 2006), HPRD (Peri et al., 2003), STRING (von Mering
et al., 2007), interactions predicted by distant conservation
of sequence patterns and structure relationships (Espadaler
et al., 2005b), interactions transferred between proteins
based on orthology
(Yu et al., 2004) and, in general, any interaction data that is in tabulated or PSI-MI (Hermjakob,
2006) formats. See additional file 2 for the detailed description
of interaction repositories that have been used in this
work. Furthermore, data does not to have to be integrated
indiscriminately without differentiating high-throughput versus
small-scale experiments and literature annotation. Therefore,
PIANA allows users to define subsets of interactions
based on the source repository and detection methods employed.
For example, a subset of reliable interactions can
be extracted by requiring them to be in at least two different
repositories.

Experimental Interactions
The integrated set of experimental interactions consisted
of 4,055,698 interactions between 113,785 different
proteinIDs. When grouping proteinIDs by their associated
NCBI geneID (Methods), there were 405,808 interactions
for 53,143 proteins, an average of 7.63 interactions per protein.

Interactions Distribution
The experimental interactions in the PIANA integrated
database have been obtained from 7 different repositories,
belong to 736 different species, and were detected using
106 different experimental methods. As shown on Table 2,
the species with the largest number of experimental interactions
are yeast (111,535 interactions) and human (110,457
interactions). Most interactions were found in just one database
and were detected by just one method (Figure 3).
The high correlation between the number of methods and
databases is explained by the fact that most interactions
appear in just one external repository, and these repositories
usually label interactions with a single detection method.
We calculated the overlap between 7 repositories with experimental
information in terms of interactions (Table 3A) and proteins (Table 3B). BioGrid (Stark et al., 2006) is the
repository with the highest number of interactions (216,370)
and with the highest number of unique interactions (163,700).
The two repositories that show the greatest overlap are
MINT and IntAct (61% of interactions and 82% of proteins
in MINT are also in IntAct) while the lowest overlap
was between HPRD and DIP (only 4% of interactions and
9% of proteins in HPRD are also in DIP). Most low overlaps
in terms of interactions are explained by the low overlap
in terms of proteins. Therefore, data integration is required
in order to obtain an interaction network that covers
most proteins and interactions.

We were interested in analyzing the distribution of interactions
in terms of the detection method employed. We examined the overlaps between different detection methods
in terms of interactions (Table 4A) and proteins (Table 4B).
We observed that high-throughput methods account for most
of the known interactions (126,136 for affinity methods and
103,334 for yeast two hybrid assays). The overlap between
the interactions detected by the different methods is low,
even in cases where the overlap at the protein level is high.
For example, while 51% of proteins with interactions from
affinity methods also had interactions detected by yeast twohybrid
methods, only 9% of interactions from yeast twohybrid
were also detected by affinity methods. Therefore,
in order to maximize the number of known interactions for
a protein, multiple experimental detection methods should
be employed.

Table 2:Number of protein interactions per species.

The number of interactions and proteins with at least one known interaction
are shown for species with more than 2000 interactions.

Figure 3: Distribution of interactions in PIANA across different source
databases and detection methods.

Most interactions were found in just one database and were detected just by
one method. Unspecific detection method names were not taken into account
(e.g., experimental, in-vitro, in-vivo).

For each repository, cells show the overlap with other repositories in terms of (A) interactions and
(B) proteins. In parenthesis, the percentage that the overlap represents over the repository from
the pair with less interactions or proteins is shown. Unique interactions and proteins are those only
appearing in that repository. This table reflects the overlaps in the interaction network unified by
NCBI geneID identifiers.

For each detection method, cells show the overlap with other methods in terms of (A) interactions and
(B) proteins. In parenthesis, the percentage that the overlap represents over the method from the pair
with less interactions or proteins is shown. This table reflects the overlaps in the interaction network
unified by NCBI geneID identifiers.

Properties of the Experimental Integrated Protein
Interaction Network
Well-documented observations about protein interaction
networks are confirmed when analyzing the integrated experimental
interaction networks of different species. Moreover,
the integrated network shows the modular functional
organization of the proteome reported by previous works
(Gavin et al., 2006). In particular, proteins tend to interact
with proteins of the same Gene Ontology (GO) (Harris et
al., 2004) biological process (Table 5). Furthermore, 95%
of the interacting proteins in the integrated network have
the same cellular component according to GO. In addition,
the following properties were observed for the yeast protein
interaction network (Table 6): (i) yeast hubs (proteins
with 5 or more interactions) are more likely to be essential
(Giaever et al., 2002) than non-hubs (22% of hubs are essential
versus only 5% of non-hubs), although this might be
a reflection of hubs usually having multiple interfaces (Kim
et al., 2006); (ii) approximately 59% of the interactions have
the same cell localization according to (Lee et al., 2002);
(iii) approximately 60% of the interactions reported are found
coexpressed during the yeast cell cycle according to Cho
et al., 1998.

This table shows the fraction of experimentally detected interacting proteins with
the following properties: a) co-localized according to GO cellular component
terms; b) same biological process according to GO biological process terms; and
c) same molecular function according to GO molecular function terms. An interaction
was considered to respect the GO restriction if both interacting proteins
shared a GO term when retrieving GO parents up to level 3 (Harris et al., 2004).
In parenthesis, the percentage of interactions where both interacting proteins share
a GO term is shown. Interactions were used for the study only if both proteins
had at least one GO term assigned. Interactions where a protein interacts with
itself were discarded for this study.

Yeast co-localization data was obtained from the work of Lee and coworkers
(Lee et al., 2002). Yeast co-expression data was obtained from
the work of Cho et al., (1998) (Cho et al., 1998). Yeast essentiality data was
obtained from the work by Giaever et al.,(2002)(Giaever et al., 2002). A yeast
protein was considered a hub if it had 5 or more interaction partners. The
interactions and proteins were included in the study for those cases in
which information was available. Interactions where a protein interacts
with itself were discarded for this study.

Protein Function Prediction from the Experimental
Integrated Network
Recently, it has been shown that the number of common
interaction partners between two proteins can be used to
annotate proteins (Brun et al., 2003; Samanta et al., 2003).
We have studied the use of this heuristic to predict molecular
functions and biological processes as defined by GO
(Harris et al., 2004), by calculating the percentage of shared
GO terms between proteins with common interaction partners
(Figure 4). As expected, we observe that the interactions
of a protein in the integrated network can be used to
predict its function and the biological processes in which it
intervenes. For example, proteins with 10-20 interaction
partners in common share 90% of their GO biological process
terms. Moreover, the accuracy of the predictions based
on the integrated network is similar to that obtained when
solely using the subset of interactions from DIP
(Salwinski
et al., 2004), while the number of annotated proteins is
much higher (additional file 4).

Predicted Interaction Networks
We were interested in assessing protein interaction predictions
and evaluating the similarities between the predicted
interaction network and the experimental interaction network.
In particular, we studied 4 different types of predictions
(Methods): (i) Gene fusion events (Enright et al., 1999)
as predicted by STRING (von Mering et al., 2007); (ii) Phylogenetic
profiles
(Pellegrini et al., 1999) as predicted by
STRING (von Mering et al., 2007); (iii) Distant conservation
of sequence patterns and structure relationships as described by Espadeler et al., (2005b) (Espadaler et al., 2005b); and
(iv) Structural interologs predicted by PIANA (Aragues et
al., 2006). We calculated the overlap between the different
experimental and prediction methods in terms of interactions
(Table 7A) and proteins (Table 7B), observing a high
overlap between prediction methods based on genomes
analyses (i.e. gene fusion events and phylogenetic profiles)
and a very low overlap between all other prediction methods.
This minimal overlap between interaction predictions
is explained by the different types of input data used by
each method and the type of proteins for which the methods
are capable of predicting interactions. For example, the
method based on structural interologs predicts interactions
for proteins with known 3D structure, while STRING predictions
from gene fusion events were mainly applied to
prokaryotes. Most proteins with known 3D structure are
eukaryotes (Berman et al., 2000), and therefore, the two
methods rarely predict similar interactions. Moreover, there
is low overlap between predicted interactions and those
obtained by experimental high throughput methods, both in
terms of interactions and proteins. These results indicate
that different methods identify interactions for different proteins.
For example, there are many species for which no
yeast two-hybrid experiments have been carried out, while
many predictions can be ‘transferred’ to those species on
the basis of genomes analysis, resulting in a low overlap at
the interaction and protein level between the two methods.

Fiugre 4:Function prediction based on common interaction partners in the integrated experimental
network.

The percentage of shared GO terms is plotted as a function of the number of common interaction
partners.

Table7:Pairwise overlaps of protein interactions and proteins for four interaction prediction methods,
two types of high-throughput methods (yeast two hybrid assays and affinity purification methods), and
curated data (invitro and invivo).Table 7(A)Table 7(B)

For each method, cells show the overlap with other methods in terms of (A) interactions and (B) proteins.
In parenthesis, the percentage that the overlap represents over the method from the pair with less interactions
or proteins is shown. This table reflects the overlaps in the interaction network unified by NCBI
geneID identifiers.

We evaluated whether interacting proteins according to
different prediction methods tended to share biological process,
molecular function and cellular component according
to GO (Table 8). We observed that the method that better
captures functional relationships between proteins is the one
based on gene fusion events (Methods): 85% of the predicted
interacting pairs belong to the same biological process.
Moreover, all prediction methods detected a sensible
number of colocalized proteins. For example, 87% of interacting
proteins according to the prediction method based on
structural interologs had the same cellular location.

This table shows the fraction of predicted interacting proteins with the following properties: colocalized
according to GO cellular component terms; same biological process according to GO
biological process terms; and same molecular function according to GO molecular function terms.
An interaction was considered to respect the GO restriction if both proteins shared a GO term
when retrieving GO parents up to level 3. Interactions were used for the study only if both
proteins had at least one GO term assigned. Interactions where a protein interacts with itself
were discarded from this study.

Discussion

We presented the data integration approach of PIANA,
a software framework designed for creating, managing and
analyzing protein-protein interaction networks. PIANA was
created to address nomenclature and integration issues common
in protein interaction repositories and network visualization
tools. Moreover, the modular approach of PIANA
makes it a useful resource for bioinformaticians wishing to avoid the low-level details related to working with protein interaction networks.

Many areas of biological research are hampered by the
difficulties found in accessing all biological information available. In particular, protein-protein interactions analysis is
usually biased by the input sources of data. PIANA is one
of the very few protein interaction platforms where all interactions from all external databases can be found for a protein of interest, regardless of the type of identifier used
as input or the name given to the protein by the researcher
that submitted the interactions. We presented a detailed
analysis of the protein-protein interactions in the integrated
network, in terms of their distribution across different databases
and detection methods. We showed that most interactions
appear in just one database and the overlap in terms
of interactions is below 50% between most repositories,
reinforcing the need for tools that unify all known interactions
into a single network. Moreover, this integrated network
has been shown to agree with properties previously
reported about protein-protein interaction networks retrieved
from just one database/detection method, such as its capability
of predicting the function of proteins. Besides, the overlap
between different experimental and prediction methods
for protein-protein interaction identification was low, both in
terms of interactions and proteins for which at least one
interaction has been described. Despite this low overlap,
interaction prediction approaches such as those based on
gene fusion events and structural interologs were successful
at identifying pairs of proteins within the same GO biological process. However, more in-depth studies are undertaken to evaluate the ability of annotating proteins based on interaction predictions (Espadaler et al., 2008).

Our analysis of protein interaction data in the public domain
is similar to the studies of Herzel et al. (Futschik et al.,
2007) and Pandey and coworkers (Mathivanan et al., 2006).
However, our study includes protein interactions for all species,
as well as predicted interactions from diverse methods.
Moreover, we have analyzed the overlap between diverse
experimental and prediction methods. The main conclusions
from the studies in (Futschik et al., 2007) and
(Mathivanan et al., 2006) are confirmed for interactions for
organisms other than human. However, we found a higher
overlap between the different interaction repositories, probably
due to recent efforts in data exchange. Moreover, the
total number of interactions in the experimental human integrated
network is 110,457, compared to the 154,000-369,000
interactions estimated by Marcotte and coworkers
(Hart et
al., 2006).

PIANA's approach to data integration is a good equilibrium
between reliability and flexibility, while giving a good
coverage of the information available. Two potential improvements
to the current integration approach are: (i) the
implementation of more sophisticated gene name disambiguation
(Schijvenaars et al., 2005; Xu et al., 2007); and (ii) the
capability of detecting highly similar protein sequences (e.g.
via sequence alignments) and thus, transferring interactions
and identifiers between similar proteins. The data integration
techniques described here could also be of help for areas
other than protein-protein interactions, such as gene
expression studies or regulatory networks.

Conclusions
Our approach to data integration is based on using the
sequence of proteins as an interlingua between the different
identifiers. This strategy allows PIANA, our proteinprotein
interaction software platform to integrate data from
multiple sources into a single interaction network, while allowing
the user to control which interactions are used in the
analyses. The low overlap found between the different repositories
of interaction data reinforces the need for integration
tools. Moreover, we found that the integrated network
of interactions shows properties similar to those previously
reported for partial interaction networks. Finally, we
observed that interaction predictions are not as accurate as
experimentally detected interactions in tasks such as protein
annotation. However, prediction methods can help experimental
methods to cover a larger portion of the
interactome space.
Authors’ Contributions
RA designed PIANA and wrote the manuscript. JGG
and RA implemented the code and performed analyses. BO
conceived of the PIANA project and provided scientific
guidance. JGG and BO helped draft the manuscript. All
authors read and approved the final manuscript.

Acknowledgements
We thank members of the UPF-IMIM SBI lab and P.
Boixeda for their helpful comments. R.A is supported by a
grant from the Spanish Ministerio de Ciencia y Tecnología
(MCyT, BIO2002-03609). J.GG. is supported by a FI grant
from the Catalonian Agència de Gestió d’Ajuts Universitaris
i de Recerca del Departament d’Innovació, Empresa I
Universitats de la Generalitat de Catalunya. The work has
been supported by grants from the Spanish Ministerio de
Educación y Ciencia (MEC, BIO02005-00533, PROFIT
PSE-010000-2007-1 and FIT-350300-2006-40/41/42).

Additional file 2:Repositories used for generating the PIANA database used in this
work.

For each external repository used to populate the PIANA database, the version and the
file used are shown.

Additional file 3:Overlaps between repositories of protein sequences.

The overlap between UniProt Swiss-Prot, UniProt TrEMBL, and NCBI genpept is shown. The
percentage over the total number of sequences in PIANA is shown in parenthesis. Two sequences
must be identical in order to be considered a positive overlap. A table with overlaps between the
NCBI non-redundant database (nr) and the three other databases is also provided.

Additional file 4:Function prediction based on common interaction partners in the
interaction network from the Database of Interacting Proteins.

The percentage of shared GO terms is shown for each number of common interaction
partners range (Figure 4A). We observed that the accuracy when using a partial subset of
interactions is similar to that obtained by using the integrated network of interactions. However,
the coverage provided by the partial set of interactions is much lower (Figure 4B).