This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

Abstract

Motivation: The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects—such as a gene and a disease—can be related in different ways, for example, directly via gene–disease associations or indirectly via functional annotations, chemicals and pathways. Different ways of relating these objects carry different semantic meanings. However, traditional methods disregard these semantics and thus cannot fully exploit their value in data modeling.

Results: We present Medusa, an approach to detect size-k modules of objects that, taken together, appear most significant to another set of objects. Medusa operates on large-scale collections of heterogeneous datasets and explicitly distinguishes between diverse data semantics. It advances research along two dimensions: it builds on collective matrix factorization to derive different semantics, and it formulates the growing of the modules as a submodular optimization program. Medusa is flexible in choosing or combining semantic meanings and provides theoretical guarantees about detection quality. In a systematic study on 310 complex diseases, we show the effectiveness of Medusa in associating genes with diseases and detecting disease modules. We demonstrate that in predicting gene–disease associations Medusa compares favorably to methods that ignore diverse semantic meanings. We find that the utility of different semantics depends on disease categories and that, overall, Medusa recovers disease modules more accurately when combining different semantics.

1 Introduction

In recent years, there is increasing evidence that gene–disease association prediction and disease module detection can benefit from integrative data analysis (Moreau and Tranchevent, 2012; Ritchie et al., 2015). Large-scale molecular biology data systems analyzed with integrative approaches are typically heterogeneous and contain objects of different types, such as genes, pathways, chemicals, disease symptoms and exposure measurements. These objects interconnect through multiple, most often pairwise, relations encoded in the data. Consider an example of such a data system from Figure 1 that contains 16 datasets (solid edges) and objects of 13 different types (nodes). For example, dataset R2,7 encodes the clinical manifestations of diseases, whereas dataset R5,3 describes associations of chemicals with biological processes and molecular functions. The ubiquity of complex data systems of this kind presents many unique opportunities and challenges for uncovering genotype–phenotype interactions, or, in general, interactions between any kind of objects.

Data fusion graph showing relations between the datasets used in this study. Each node represents a distinct type of objects, such as chemicals, pathways or exposure events, and each edge represents a dataset. For example, R2,6 is a matrix of curated...

Challenges in the joint consideration of systems of datasets, such as that in Figure 1, include inferring accurate models to predict disease traits and outcomes, elucidating important disease genes and generating insight into the genetic underpinnings of complex diseases (Barabási et al., 2011; Han et al., 2013; Ruffalo et al., 2015; Taşan et al., 2015). We would like these models to collectively consider the breadth of available data, from whole-genome sequencing to transcriptomic, methylomic and metabolic data (Navlakha and Kingsford, 2010; Greene et al., 2015; Zitnik et al., 2015). A major barrier preventing existing methods from fully exploiting entire data collections is that individual datasets usually cannot be directly related to each other. For example, datasets R5,3 (annotation of chemicals with Gene Ontology (GO) terms) and R2,7 (symptoms of diseases) in Figure 1 reside in completely different feature space. As we will learn in this manuscript, we can relate distant objects by chaining through the fusion graph, for example, we can relate molecular signatures with disease symptoms through genes, GO terms, chemicals and diseases. But such chaining actually exacerbates the problem, as distant objects can be linked in many different ways. For example, another way to relate molecular signatures with disease symptoms is via genes, pathways, chemicals, GO terms (Ashburner et al., 2000), exposure events and diseases. A priori, it is not obvious which of the two ways, or which of any existing ways of connecting the signatures with the symptoms, performs better and should thus be preferred when mining disease data.

Different ways of relating objects often carry different semantic meanings and can potentially generate different results. Intuitively, different semantics imply different similarities. However, traditional methods that we review in the next section disregard the subtlety of different types of objects and links. These methods mix, discard or ignore different semantics, which might impede their performance and explanatory capabilities. In this work, we aim to fill this gap by developing an approach for disease module detection that can consider diverse semantics in a principled manner.

We here introduce a novel approach, called Medusa, for automatic detection of size-k significant modules from heterogeneous systems of biological data. Unlike previous works in integrative data analysis, Medusa explicitly takes different semantics into consideration during module detection by allowing a user to either choose a particular semantic or combine them. Our goal is to answer association queries on possibly complex data systems, such as the one in Figure 1. For example, given a small number of diseases, infer the most significant group (module) of genes of size k. Or, given a list of genes, propose a group of k other genes that, taken together, will give the highest significance under a particular null hypothesis. Or, given a selection of molecular pathways of interest, find which k chemicals have the largest collective impact on these pathways. To achieve this level of versatility, Medusa builds upon a recent collective matrix factorization algorithm (Zitnik and Zupan, 2015). In addition, Medusa formulates a submodular optimization program, which provides theoretical guarantees about the significance of the detected modules (Fujishige, 2005).

In a case study with datasets shown in Figure 1, we applied Medusa to find gene–disease associations and infer disease modules. We demonstrate that Medusa-inferred associations are more accurate than those of alternative approaches, which conflate distinct semantics that exist in the data system. Importantly, we find that different semantics vary in their ability to make accurate predictions. We also show that the performance of different semantics depends on the disease category. Finally, we observe that the overall best performance is achieved when Medusa infers associations by combining distinct semantics.

2 Related work

The question of distinguishing different semantics that exist within biomedical data systems remains largely unexplored. Two notable exceptions include a meta-path-based approach for gene–disease link prediction in heterogeneous networks (Himmelstein and Baranzini, 2015) and a latent-chain-based approach for gene prioritization (Zitnik et al., 2015). These approaches, however, are algorithmically different. The approach of Himmelstein and Baranzini (2015) is a network-based technique that relies on meta-paths (Sun et al., 2011a, 2011b, 2012; Wan et al., 2015). Meta-paths represent the number of path instances between two objects that follow a particular sequence of object types in a heterogeneous network. In contrast to meta-paths, Zitnik et al. (2015) use collective matrix factorization (Zitnik and Zupan, 2015) to estimate a latent data representation of a data system and then derive new connections by appropriately multiplying the latent matrices. In Section 5.3, we empirically compare our Medusa, which formulates module detection on top of latent representation of the system, to alternative meta-path-based approaches.

There are several lines of research on how to consider many datasets to derive good groupwise disease associations.Vanunu et al. (2010); Ghiassian et al. (2015) considered network propagation and random walk analysis for prioritizing disease genes and inferring protein complex associations. Han et al. (2013) sought groupwise disease associations for sets of single nucleotide polymorphisms mapping to a given functional category. On a related note, guilty-by-association methods (Wang et al., 2012) have used cofunction networks to assign functions to uncharacterized genes in various organisms and to functionally characterize whole sets of genes. Functionally coherent subnetworks were also used to augment curated functional annotations by connecting genes that share, or are likely to share, functions (Greene et al., 2015; Lee et al., 2004; Taşan et al., 2015), for example, by sharing protein domain annotations or tissue-specific interactions.

However, while these gene–disease association and disease module detection methods use the information from different data sources, none, including our previous work on this topic (Zitnik et al., 2013; Zitnik and Zupan, 2016), explicitly considers that different purposes of disease-related analysis might benefit from considering different, potentially distant, semantics between genes and diseases, nor do they ask users to select/combine different ways of connecting genes with diseases. Medusa, the algorithm presented here, differs from the above methods in that it utilizes semantically distinct chains consisting of possibly many relations to derive relationships between objects, such as genes and diseases. Medusa can establish connections between objects for which direct relationships are not available in present data. It then uses these connections to find size-k modules of objects that together exhibit near-highest significance to the preselected pivot objects. The flexibility of Medusa is further shown in that the object type of the pivots can, but not necessarily, coincide with the object type of the candidates and that chains carrying different semantic meanings can be combined in a principled manner.

3 Methods

Medusa is an approach for the detection of size-k modules that are maximally significant for a predefined set of pivot objects. On the input, Medusa accepts (i) a set of candidate objects that are potential module members, (ii) a set of pivot objects that are not necessarily of the same type as the candidates and (iii) a possibly large and heterogeneous collection of datasets represented in the form of a data fusion graph such as that shown in Figure 1.

Medusa uses collective matrix factorization to jointly estimate a latent data model from all datasets included in a fusion graph. It exploits the latent data model to establish semantically distinct connections between candidate objects and domains of other object types in the fusion graph (Section 3.2). The module detection algorithm in Medusa (Sections 3.3–3.5) is a submodular optimization program which yields an efficient algorithm and provides theoretical guarantees about the significance of the detected modules.

3.1. Preliminaries and notation

3.1.1. Data fusion graph

A data fusion graph G=(V,R,T) is a relational map of datasets (Zitnik and Zupan, 2015). Nodes of the graph V represent different types of objects, such as ontological terms, genes, diseases, pathways and chemicals. The edges of the graph correspond to datasets, which are given in matrices annotated next to the edges. An exemplar data fusion graph is shown in Figure 1. Matrix R2,7 therein is a n2×n7 real-valued matrix whose rows correspond to diseases indexed by a respective disease-based controlled vocabulary and whose columns indicate disease symptoms indexed by a symptom-based vocabulary. Elements of matrix R2,7 represent a dataset, such as disease–symptom associations.

Technically, the edges of the fusion graph are given by a set of relation matricesR={RI,J;I,J∈V,I≠J,RI,J∈ℝnI×nJ} that represent dyadic datasets and a set of constraint matricesT={ΘI;I∈V,ΘI∈ℝnI×nI} that represent unary datasets. It is possible to have multiple relation matrices that relate object types I and J (i.e. more than one edge between I and J in the fusion graph) or multiple constraint matrices for object type I (i.e. more than one loop for I in the fusion graph). Here, this possibility is suppressed for notational brevity.

3.1.2 Collective matrix factorization

Collective matrix factorization (Zitnik and Zupan, 2015) is an algorithm that considers a fusion graph G=(V,R,T) and infers its latent model by compressing the datasets with co-factorization of matrices in R. Matrices T are used for regularization of the latent model. The method simultaneously co-factorizes all the relation matrices into the products of much smaller latent matrices through a procedure which (i) ensures the transfer of information between related matrices and (ii) promotes good generalization via a high-quality data compression. To achieve the first point, collective factorization reuses the latent matrices when decomposing distinct but related relation matrices. The second feature is possible due to the low-dimensional nature of matrix factorization.

The collective matrix factorization algorithm aims to estimate low-dimensional latent matrices GI,I∈V, and SI,J,I,J∈V, which minimize the following objective:

∑RI,J∈R||RI,J−GISI,J(GJ)T||Fro2+∑ΘI∈T tr((GI)TΘIGI).

(1)

Here, the inferred latent matrices tri-factorize each relation matrix as R^I,J=GISI,J(GI)T. Matrix Gi is a nI×kI (kI≪nI) non-negative latent matrix containing latent profiles of objects of type I in rows, GJ is a nJ×kJ (kJ≪nJ) non-negative latent matrix with profiles of objects of type J in rows and SI,J is a kI×kJ latent matrix that models interactions between latent components in the (I, J)-th dataset. The latent profile of an object of type I is given by its corresponding row vector in GI. Semantically, the profile encodes membership of the object to the kI latent components.

The parameters of the algorithm are factorization ranks, kI, for every object type I in the data fusion system, which are selected as in Zitnik et al. (2015). We refer the reader to Zitnik and Zupan (2015) for a detailed description and theoretical analysis of the factorization algorithm.

3.1.3. Chaining of latent data matrices

A factorized system of latent data matrices returned by the collective matrix factorization can be used to establish connections between distant object types, i.e. non-neighboring nodes in the fusion graph (Zitnik et al., 2015).

Definition 1 —

Chain: A chain CS,T is a sequence of relations defined on a fusion graph G=(V,R,T) that connects object type S∈V with possibly distant object types T∈V. The chain CS,T is denoted in the form of:

CS,T=RS,I1○RI1,I2○…○RIl−2,Il−1○RIl−1,T,

(2)

which defines a composite relation between object types S and T, where○denotes the composition operator on relations. Here, forI1,I2,…,Ilthe object types Ij andIj+1must be adjacent in the fusion graphG.

The length l of chain CS,T is measured by the number of its constituent relations. We now formulate how to materialize a given chain and derive the profiles of objects of one type in the space of objects of another type.

Definition 2 —

Materialized chain: Given a fusion graph G=(V,R,T) and a latent data system estimated by collective matrix factorization, a materialized chain CS,T∈ℝnS×nT for chain CS,T specified in Equation (2) is defined as:

CS,T=R^S,I1R^I1,I2…R^Il−2,Il−1R^Il−1,T,

(3)

whereR^I,Jis the relation matrix reconstructed from the latent data system asR^I,J=GISI,J(GJ)T.((·)Tis matrix transposition.)

3.1.4. Submodular functions and optimization

Submodular functions (Edmonds, 1970; Fujishige, 2005) have recently attracted much interest, e.g. see Krause and Guestrin (2011). Let us assume we are given a finite set of n objects V and a valuation function f:2V→ℝ+ that returns a non-negative real value for any subset X⊆V. The function f is said to be submodular if it satisfies the property of diminishing returns. That is, for any set X⊆Y and i∉Y, we must have: f(X∪{i})−f(X)≥f(Y∪{i})−f(Y). This means that the incremental gain of element i decreases when the background in which i is considered grows from X to Y⊇X. We define the ‘gain’ as f(i,X)=f(X∪{i})−f(X), which implies that f is submodular if f(i,X)≥f(i,Y).

In this article, we deal with functions that are not only submodular but also non-negative (i.e. f(X)≥0 for all X⊆V) and monotone non-decreasing (i.e. f(X)≥f(Y) for all X⊆Y). Such functions are trivial to uselessly maximize, since f(V) is the largest possible valuation. However, we would typically like to identify a valuable subset of bounded and small cost. Here, we are interested in subsets whose costs are measured by their size. This leads to the optimization problem X*∈arg maxX⊆V,|X|≤kf(X), where k is the desired subset size. Solving this problem exactly is NP-complete (Feige, 1998). However, when f is submodular, then the greedy algorithm has a worst case guarantee of f(X˜*)≥(1−1/e)f(Xopt)≈0.63f(Xopt), where Xopt is the optimal and X˜* is the greedy solution (Nemhauser et al., 1978).

3.2. Problem definition

In this section, we introduce a framework for module detection on data fusion graphs, a novel approach to find size-k maximally significant modules, Medusa, and propose a Medusa-based top-k module detection problem that takes into consideration diverse semantics in heterogeneous data systems.

We start by defining the concepts needed to guide the module detection procedure and to assess the significance of the modules.

Definition 3 —

Candidate objects: Given a fusion graph G=(V,R,T), candidate objects are given by a set I={i1,i2,…,inI} of all the entities that belong to type I∈V. Candidates constitute a pool of objects from which a module is identified.

Definition 4 —

Pivot objects: Given a fusion graph G=(V,R,T), pivot objects are given by a subset S0={s1,s2,…,su},u<nJ, of the entities of type J∈V. Pivots are the objects against which the significance of the current Medusa module is assessed.

Depending on whether the pivots and the candidates belong to the same or different type of objects in G, we distinguish two prediction settings. This distinction is important because it will lead to different optimization objectives when detecting modules in Section 3.4.

Task 3.1 —

We aim to find a size-k module Mk={m1,m2,…,mk} of the candidates I that display the maximal significance with respect to the given pivots S0.

Let J be the object type of the pivots. (i) In the candidate-pivot-equivalence (CPE) regime, candidates and pivots are of the same data type, I=J. (ii) In the candidate-pivot-inequivalence (CPI) regime, candidates and pivots are of different types, I≠J.

We illustrate the CPE and the CPI regimes with concrete examples in Figures 2 and ​and3,3, respectively. We proceed by formulating a measure, which we optimize for when detecting Medusa modules.

The CPE regime in Medusa. In the CPE environment, Medusa detects a relevant module of candidate objects based on a set of pivot objects, which belong to the same object type as the candidates. For example, given three disease genes (pivots, black circles),...

The CPI regime in Medusa. In the CPI regime, Medusa detects a relevant module of candidate objects based on a set of pivot objects, which belong to a different object type than the candidates. For example, given three diseases (pivots, black squares),...

3.3. Submodularity for detection of Medusa modules

Submodularity is a natural model for detection of size-k maximally significant modules in multiplex data. In this case, each i∈Vsource is a distinct candidate and Vsource corresponds to a set of all candidate objects. An important characteristic of a good model for this problem is that we wish to decrease the ‘value’ of a candidate i∈Vsource based on how much that candidate has in common with candidates Sr that have been chosen in the first r rounds.

The value p(i,Sr) of a given candidate i in a background of previously chosen objects Sr⊆Vsource further diminishes as the background grows Vsource⊇St,t>r. When, for example, both candidate and pivot objects are genes and a candidate’s value is represented as the statistical significance of its concentration, it is natural for the significance to be discounted based on how much representation of that candidate already exists in a previously chosen subset. When the module grows, it naturally becomes more diverse, and hence, its characterization is less distinctive, which results in the overall reduction of statistical power. This means that the candidate is pulled into the module when its significance towards the pivot objects is the highest. If the candidate were added to the module later, its significance could only be smaller. That is, if we were to observe that candidate after being included into the module, its significance would fade into insignificance.

This paradigm corresponds to submodularity, which we express mathematically by functions in Equations (5) and (7) below.

3.4. Detection of size-k maximally significant modules

Next, we describe the Medusa module detection algorithm. Recall that Medusa is able to operate in two prediction regimes defined in Section 3.2. We start by describing the algorithm for the CPE regime and proceed with the algorithm for the CPI regime.

Recall that a particular semantic aspect connecting object types S and T is realized as matrix CS,T (Definition 2). For notational convenience, we here denote a given aspect simply as a matrix C. Prior to the analysis, matrix C is row-wise normalized by the sum of the matrix rows, and then multiplied by the number of matrix columns.

3.4.1. Medusa in the CPE regime

We capture the distinct connectivity patterns of the candidates by evaluating the significance of their connections in matrix C. For a randomly picked candidate we evaluate the probability that a certain fraction of the candidate’s strongest connections to the objects in the columns of matrix C match exactly with the strongest connections of the pivots.

In the simplest case, we would simply count the connections, and this notion would correspond to the hypergeometric distribution. However, since matrix C is a real-valued object inferred by a latent model, we take into account the estimated strength of connectivity rather than its mere existence (cf. experiments in Section 5.3). For this, we need to extend the binomial coefficient to the real line using the gamma function (Fowler, 1996). Technically, this is implemented in the following definition.

Definition 5 —

Candidate concentration: Given a semantic C, pivots S0, and a candidate i, we define the concentration of candidate i as:

where c is the observed strength of connectivity of candidate i, Bin(n,k)=Γ(n+1)/(Γ(k+1)Γ(n−k+1)),andΓ(x)is the gamma function evaluated at x. Here, δ(x)is the indicator function, δ(x)=1if x=1 andδ(x)=0otherwise. Also, XS0returns the rows ofXindexed by setS0.Here, Qi is a set containing column indices of i’s strongest connections. The weights in vectorware defined aswi=1ifi∈S0, elsewi=(1−α)rifi∈Sr,where r is the iteration when i was included into the module, andwi=0otherwise. The value forα,which promotes modules that are tight around the initial set of pivots S0, and the size of Qi, are user-defined parameters.

We evaluate whether candidate i has greater correspondence with the pivots than expected under this null hypothesis by calculating the cumulative probability for the observed or any weaker concentration of the connectivity:

pCPE(i,S0;Qi)≜∫|Qi|c˜ihCPE(c,S0,Qi)dc

(5)

where c˜i=CiQi1. A better candidate will have a lower pCPE value.

It can be shown that the use of this cumulative probability to select a candidate object, which is to be included into the current module, leads to a submodular program (the proof is omitted here for brevity). This appealing property allows us to use the greedy algorithm to find modules of size k that approximate the optimal modules within a constant factor (Section 3.1.4). Recall that the optimal modules can only be found using a prohibitively expensive exhaustive enumeration of all size-k modules. Building on these observations, we propose the algorithm to identify a maximally significant module of size k, which solves Task 3.1 in the CPE regime:

Add the candidate i with the highest rank (i.e. lowest pCPE value) to the current Medusa module, Mr→Mr+1=Mr∪{i}, and to the set of pivot objects, Sr→Sr+1=Sr∪{i}.

Repeat steps 2–4 k-times with the expanded set of pivot objects.

The order in which the candidates are pulled into the module reflects their relevance to the pivot objects according to semantic C.Figure 4 is an example for finding a size-3 gene module based on a semantic that relates genes to disease symptoms via GO terms, exposure events and diseases.

Inferring a three-maximally significant distant module with Medusa in the CPE regime. Shown is a toy example of a chain that relates seven genes in rows to four disease symptoms in columns via GO terms, exposure data and diseases (see the highlighted...

3.4.2. Medusa in the CPI regime

So far, we have seen that in the CPE regime, candidates and pivots are given by the rows of matrix C. However, in the CPI regime this is not true anymore. Here, pivots correspond to columns of matrix C, whereas candidates are still given by rows of matrix C. To adjust for this change, we assess a candidate’s significance by its visibility, which we define next.

Definition 6 —

Candidate visibility: Given a semanticC,pivots S0 and a candidate i, we define the visibility of candidate i as:

Intuitively, the visibility of a candidate is the strength of its connections with the pivots. We evaluate whether candidate i has stronger connections to the pivots than expected under this null hypothesis by calculating the cumulative probability for observed or any stronger visibility:

pCPI(i,S0)≜∫c˜imhCPI(c,S0)dc

(7)

where c˜i=CiQi1 and m is the number of columns in C. A better candidate will have a lower pCPI value.

Similarly as for the CPE regime, the use of cumulative probability pCPI leads to a submodular optimization program, which has the same appealing properties as the cumulative probability pCPE in Section 3.4.1. Building on the theory of submodular functions, we tackle Task 3.1 in the CPI regime by proposing the following greedy algorithm to identify a maximally significant module of size k:

Start with an empty module M0={}.

Compute visibility significance (Equation (7)) for all candidates. Visibility of candidate i naturally decreases in iteration r according to c˜i=(1−∑t=1rβtexp⁡(−KL(Ci,Cxt−1)))CiS01, where KL denotes the Kullback–Leibler divergence, Cxt−1 is matrix row of the candidate selected in iteration t − 1 and β is a user-defined parameter promoting diverse modules.

Rank the candidates according to their respective pCPI values.

Add the candidate i with the highest rank (i.e. lowest pCPI value) to the current Medusa module, Mr→Mr+1=Mr∪{i}.

Repeat steps 2–4 k-times, bringing in one candidate object at a time into the growing module.

Figure 5 shows a toy example for finding a size-3 gene module based on a semantic that relates genes with diseases via chemicals, pathways, genes, GO terms and exposure events.

Inferring a three-maximally significant distant module with Medusa in the CPI regime. Shown is an example of a chain that relates six genes to five diseases via chemicals, pathways, genes, GO terms and exposure data (see the highlighted chain in Fig....

3.5. Combining chains carrying different semantics

So far, we have described the Medusa algorithm that operates on one particular semantic given by a single matrix C. To be able to combine d different semantics, given by a set of chained matrices C(1),C(2),…,C(d), we employ the following technique.

In the r-th iteration of Medusa, we independently score the candidates according to Equation (5) (in the case of the CPE regime) or Equation (7) (in the case of the CPI regime) for all matrices C(1),C(2),…,C(d). We then combine candidate scores from different semantics into one score per candidate by an affine combination of semantics’ weights.

In the CPE regime, this is done such that the semantics which rank the pivots higher are assigned larger weights than those in which the pivots are ranked lower. Intuitively, this means that semantics that are more informative for the given set of pivots contribute more towards the final candidate score.

In the CPI regime, the combination of semantics is done such that the semantics in which the pivots have more similar profiles measured by the KL divergence are assigned larger weights than those in which the pivots have more heterogeneous data profiles. Detailed steps are provided within the online implementation of Medusa.

Once the aggregated candidate scores are calculated, Medusa uses algorithms from Sections 3.4.1 and 3.4.2 to detect the modules.

4 Experimental setup

4.1. Datasets in the data fusion graph

In our experiments, we consider a collection of datasets shown in Figure 1. Table 1 lists public sources from which data were obtained to build 16 data matrices.

Each dataset was represented with a real-valued data matrix as indicated in Table 1. Prior to the analysis, all matrices were independently column–row normalized according to the second vector norm.

4.2. Disease modules and gene–disease associations

The corpus of 310 diseases was downloaded from the CTD (Davis et al., 2015) with the criteria that every disease should have at least 10 and at most 100 curated gene associations for which direct evidence is available in the CTD database. These diseases and their associated genes constituted our ground-truth information against which we evaluated disease module detection and gene–disease association prediction. On average, each disease had 28 associated genes.

To ensure there was no leakage of information from training to test set in our integrative analysis, we used the following protocol. From our data system we excluded datasets that directly or indirectly rely on disease–gene associations (dotted lines in Fig. 1). For example, we skipped the pathway-to-disease dataset (a hypothetical data matrix R4,2 in Fig. 1) because a particular pathway would only be linked to a particular disease if there was a disease-associated gene in this pathway. Another example of an excluded dataset is the GO term-to-disease dataset (a hypothetical data matrix R5,2 in Fig. 1). Here, a given term and a given disease would be linked only if there existed a gene that was both annotated with the term in the GO and associated with the disease.

This protocol enabled us to construct a data system for performance evaluation of various methods that was not contaminated by existing gene–disease associations. For example, dataset R3,2 in Figure 1 contained curated chemical–disease associations that were extracted from the published literature by the CTD curators (Davis et al., 2015). However, this dataset omitted inferred chemical–disease associations from our training data, which associated chemicals and diseases through shared gene interactions.

4.3. Performance evaluation

We evaluated prediction accuracy using a disease-centric cross-validation procedure. (i) Prediction of gene–disease associations was evaluated as follows. For a particular disease in our corpus of 310 diseases, we conducted a leave-one-gene-out cross-validation to obtain an estimated score for the left-out gene. The remaining (training) genes were considered positive instances and were used to select and fit the model parameters. (ii) Detection of disease modules was evaluated using the procedure from Ghiassian et al. (2015). For each disease module, we randomly removed a certain fraction (25%, 50% and 75%) of the disease genes and used the remaining genes as pivots.

For Medusa, we need to specify the parameters required by collective matrix factorization, the value for α, the size of Q (in the case of the CPE regime) and the value for β (in the case of the CPI regime). We tuned α, β and the size of Q in an internal cross-validation procedure on the training genes. Factorization ranks for collective matrix factorization were selected using a procedure similar to the one described in Zitnik et al. (2015); 13 values were required, each representing latent dimension of one object type in our fusion graph. We selected these dimensions through a single parameter p, which specified latent dimension for an object type as a fraction of the number of objects of that type: kI=pnI. The value p=0.05 was used in the experiments because it maximized the mean areas under precision-recall curves (AUPRC) achieved on a set of 10 diseases from the CTD, which were later not considered for performance comparison. The selection of parameters for other approaches was made based on internal cross-validation.

We measured accuracy using the AUPRC and the area under the receiver operating characteristic curve (AUROC).

5 Results and discussion

5.1. Capturing biological semantics with Medusa

First, our goal was to investigate the effects of different biological semantic aspects on the prediction of gene–disease associations. We used Medusa to estimate disease genes for 310 diseases included in the CTD database (Section 4.1). We considered eight distinct semantic aspects denoted as C1–C8 (Fig. 6, left). For example, chain C6 corresponds to a matrix that relates genes to diseases via GO terms and exposure events (see Definition 2).

Gene–disease association prediction with Medusa. Nine different biological semantics (C1–C8, CA) were considered in the analysis. Each semantic is shown as a sequence of object types contained in the fusion graph (Fig. 1). For example,...

Medusa detects significant modules of a specified size k. To apply it to the prediction of gene–disease associations, we search for size-1 modules and use probability estimates returned by the algorithm (Equation (5)) to make predictions.

Figure 6 shows the performance of Medusa in terms of AUROC and AUPRC. In addition to the eight distinct semantics, we also analyzed the performance of Medusa in its mode, which combines different semantic aspects, denoted as CA in the figure (Section 3.5). We further studied how prediction accuracy varies across classes of diseases, which were identified using Disease Ontology (Kibbe et al., 2014). A cutoff at level 2 of the Disease Ontology graph revealed 25 disease classes (Fig. 6), such as ‘cognitive disorders’ and ‘gastrointestinal diseases’.

We observed substantial variation of AUROC and AUPRC values across different semantic aspects (i.e. chains). In terms of the AUROC, the best single semantic appeared to be C1, which was followed closely by C3 and C5. In addition to genes, chemicals were the common object type considered for construction of these three chains. These results are important because they demonstrate that different ways of establishing connections between genes and diseases can result in more or less accurate predictions. The results also suggest that objects of different types and links carry different semantic meanings, and it might not make sense to mix them without distinguishing their semantics when associating genes with diseases. This experiment also alludes to the explanatory value of Medusa. Medusa is able to provide insights into the utility of different semantics, a capability, which most present models for co-factorization of multiple matrices do not have.

While a user might explicitly specify a semantic that he would like to consider in a concrete application, Medusa can also make predictions that are consistent with multiple chains. In particular, combining semantics C1–C8 in Figure 6 yielded the most accurate predictions overall. However, prediction accuracy in Figure 6 varies greatly by disease and we explore this issue next.

5.2. Detecting disease modules with Medusa

We analyzed the extent to which Medusa could recover the full disease module if we removed a certain fraction of disease associated genes. Recall that a disease module is given by the set of genes associated with that disease in the CTD (Davis et al., 2015). For a given disease, Medusa used 50% of the disease genes as pivots and detected a size-k module, where k was set to the size of the full disease module. Notice that Medusa pulls in one gene at a time into the growing module and given the inferred latent model, the running time of the module detection increases linearly with the desired module size k. Figure 7 shows the fraction of held-out disease genes (recall) that were found in Medusa modules. Higher values indicate better performance.

Disease module detection with Medusa. We considered nine different semantic meanings (C1–C8, CA; rows in the heat map) listed in Figure 6. Reported is the recovery rate when 50% of genes from known disease modules were left out. The recall values...

We found that the highest rate of true positives was achieved in the early iterations of the Medusa algorithm, i.e. when the number of executed iterations was less than the size of the full module. This is an important observation because it indicates that the highest ranked genes are most likely to be part of the disease module.

The results in Figure 7 further show that the estimated recovery rates varied across different semantic aspects as can be seen by comparing rows in the heat map. Typically, the best performance was observed when Medusa was used to detect modules based on joint analysis of all semantics (see CA in Fig. 7).

It is interesting to examine which classes of diseases display higher recovery rates than others and how the rates compare to each other. The dendrogram in Figure 7 shows that ‘monogenic diseases’ exhibited a distinct recovery pattern. For example, modules corresponding to monogenic disorders according to Disease Ontology (Kibbe et al., 2014) were best recovered using the chain C7, whereas other disease classes (with the exception of cognitive disorders) were best detected when Medusa was used in the CA setting. We also observed that related disease classes displayed similar patterns of recovery rates across different semantics. For example, immune system diseases and viral infectious diseases are placed closely together in the dendrogram, as well as acquired metabolic diseases and diseases of the gastrointestinal system. It is known that diseases from similar disease classes are more likely to be associated with sets of genes that overlap (Barabási et al., 2011). The similar recall patterns from related disease classes thus suggest that the Medusa outcome is robust with respect to variations in the set of pivot genes.

5.3. Comparing Medusa with existing methods

So far we have studied the utility of Medusa to take into consideration distinct semantics that exist in heterogeneous biological data when predicting gene–disease associations and detecting disease modules. We proceed by examining how Medusa performs relative to several other approaches for mining gene–disease associations.

First, we compare Medusa with meta-path based approaches (Section 2). These approaches have just recently been tested on prediction problems in biology for the first time (Himmelstein and Baranzini, 2015) and have shown promising performance for prioritizing genetic associations from genome-wide association studies. Meta-paths (Sun et al., 2011b) are sequences of object types. They are used to represent complex relationships between objects beyond what links in a homogeneous network capture. For example, given a meta-path that corresponds to the chain C6: Genes → GO terms → Exposure events → Diseases (Fig. 6, left), a meta-path-based approach in its simplest form relates a particular candidate gene with a particular pivot disease by counting the number of paths in a heterogeneous network between a candidate and pivot node. These counts then serve to derive features. Each feature represents one meta-path originating in a given candidate gene and terminating in a given pivot disease and quantifies the prevalence of that meta-path between any gene–disease pair. To describe different aspects of connectivity, we computed eight features based on chains C1–C8 (Fig. 6, left) and then used a rank-correlation metric or a sophisticated PathSim meta-path-based metric (Sun et al., 2011b; Wan et al., 2015) to score gene–disease associations. The results in Table 2 show that Medusa compares favorably to both meta-path models in terms of AUROC and AUPRC values. It is important to understand the subtlety: Medusa relates candidate genes to diseases by deriving new connections between them based on matrices estimated by a collective latent factor model. This highlights Medusa’s advantage of taking into consideration projections of data into the latent space, which potentially give more informative connections than the rather crude meta-path count metrics. Furthermore, Medusa’s technique to estimate associations considers the significance of derived connections under a particular null hypothesis, whereas alternative methods rely on similarity scoring.

Cross-validated performance for predicting gene–disease associations using a heterogeneous data system shown in Figure 1.

Second, we applied a random walk algorithm (Li and Patra, 2010) to the heterogeneous network whose schema is shown in Figure 1. It is known that random walk approaches are often the best performing methods for associating genes with diseases (Navlakha and Kingsford, 2010). We found that the random walk approach has performance comparable to the meta-path-based approach that used the PathSim metric (Table 2). However, in the majority of the diseases, Medusa achieved higher cross-validated accuracy.

Finally, we also considered two simplified variants of Medusa (Table 2, third block). To measure the effect of Medusa’s submodular optimization program, we ran Medusa against variations, which associated genes to diseases based on: (i) the rank-correlation between candidate gene profile and disease gene profiles in a materialized chained matrix or (ii) the gene–disease score returned by the random walk approach. Medusa offered an overall improvement of 37% over the correlation-based variant and a 10% improvement over the random walk approach as measured by the AUROC. The results suggest that both key ingredients of Medusa, the collective latent model and the submodular program, are important for its good performance.

6 Conclusion

We here presented a novel and practical approach to infer connections between objects that are either close to each other or far away from each other in heterogeneous biological data domains. We introduced Medusa, a module detection algorithm that, given a set of pivot objects, finds a size-k module of candidate objects that are jointly relevant to the pivots. Importantly, this module achieves significance that is provably close to the maximum significance that could be achieved by any size-k set of the candidates. Our experiments reveal the versatility of Medusa to accurately detect disease modules and predict gene–disease associations by either flexibly choosing or combining different semantic meanings. The distinct property of Medusa to distinguish diverse semantics enabled Medusa to compare favorably against several alternative methods. These findings put Medusa on the path towards a biomedical data fusion search engine.

Funding

This work was supported by the ARRS (P2-0209, J2-5480) and the NIH (P01-HD39691).