Assessing the informativeness of features

Preliminary feature computation

Background

Our method for hetnet edge prediction[1] works by quantifying the connectivity between a source and target node. For this project, source nodes are compounds and target nodes are diseases. To extract a feature from the network, we quantify the prevalence of a specific type of paths (metapath) for each compound–disease pair. We use a metric called the degree weighted path count (DWPC) to quantify the extent that that a path of the specified type connects a compound and disease. The DWPC downweights paths through high degree nodes, which are less specific and therefore likely less informative. Thus each metapath yields a feature. We evaluate the predictiveness of a feature by whether it discriminates indicated from non-indicated compound–disease pairs.

Methods

We computed features for the 261 metapaths with length ≤ 3 for all 1,386 indications and 4,227 non-indications. The 4,227 non-indications were randomly selected from all non-indications. We computed features for 2%, rather than 100%, of non-indications to decrease computation time. This compromise allows us to quickly assess feature-specific performance via AUROC, but does not allow us to make comprehensive predictions or provide reliable estimates of measures that depend the balance between positives and negatives, such as AUPRC and properly-scaled predicted probabilities.

We separately assessed the performance of each of the 261 features (notebook). We used the DWPC with $$w = 0.4$$ — the dampening exponent to control the downweighting of paths through high degree nodes. We chose $$w = 0.4$$ because that was optimal in our previous study [1] and performance was stable for surrounding parameter choices.

Results

We created a table of feature performance. Scroll to the bottom of this notebook for the abbreviation system used in the metapath column. nonzero indicates the proportion of compound–disease pairs that had at least one path for that metapath. auroc represents the chance that a random indication received a higher DWPC than a random non-indication. Stay tuned to this discussion for further analysis.

Limitations

There are still a few steps remaining before we can draw conclusions on the mechanisms of efficacy:

Our expert curated indication catalog is not yet ready. Therefore, an estimated 42% the 1,386 indications are symptomatic or non-indications.

We haven't yet created permuted networks to compute feature performance on. Permuted networks preserve degree but destroy edge specificity. Much of the current performance is likely attributable to node degree rather than edge specificity. For example, compounds that are indicated for many other diseases are more likely to be indicated for the target disease. Many of our 261 features will capture this effect.

General assessment

All features yielded AUROCs ≥ 0.5. In other words, no features were negatively associated with indication status: greater path prevalence between a compound and disease never resulted in a lower therapeutic likelihood. The lack of negatively associated features is unsurprising given that our network is primarily composed of general relationships. For example, we have a compound–gene edge for targeting but not for agonism or antagonism.

The majority of features had AUROC ≤ 0.53. In other words, most features performed only slightly better than random. However, a quarter of the features had AUROC ≥ 0.60, and five features had AUROC ≥ 0.80. The strong performance of a subset of features is encouraging.

We did not observe major differences in the distributions of AUROCs for features with length 2 versus length 3 metapaths. However, since there are many more metapaths with length 3 than 2, the top performing features were mostly of length 3.

Performance strongly correlated with the fraction of nonzero values per feature. Metapaths traversing sparsely connected areas of the hetnet performed poorly because they yielded $$DWPC = 0$$ for almost all compound–disease pairs.

AUROC and AUPRC were positively correlated. However, features with AUROCs near 0.5 (the random expectation) often had AUPRCs considerably above 0.25 (the random expectation). These features were often > 99% zero. Therefore, we suspect the low-AUROC features produced decent top predictions but poor comprehensive predictions due to sparsity, leading to discordance between AUROCs and AUPRCs [1, 2].

One reason we primarily rely on AUROC rather than AUPRC is to enable comparisons across different prevalences. We may consider using the AUCROC (area under the condensed ROC [3]) to emphasize top predictions while remaining balance-agnostic.

Feature improvements

We've made several improvements that affect our features. Compared to the previous post the following updates were made:

We completed the first production version of our hetnet named Hetionet v1.0. See this post for the updated metagraph and type abbreviations.

We created PharamacotherapyDB which refined our indication catalog to differentiate between disease modifying and symptomatic indications. Now disease modifying indications (called treatments or DM) are positives and all other compound–disease pairs (observations) are negatives.

We switched from using our pure python toolset (hetio.pathtools) to Neo4j for computing network-based features. While we maintained the duplicate node exclusion for path traversal, there are some minor differences in the Cypher and hetio.pathtools DWPC algorithms.

We invented a DWPC derivative called residual DWPC (R-DWPC), which is the difference between DWPC and average permuted DWPC for an observation.

AUROC-based metrics for assessing DWPC features

The performance for each of 1,206 DWPC features is now available (dataset). The dataset consists of the following AUROC-based performance metrics:

dwpc_auroc: the AUROC of a DWPC

pdwpc_auroc: the average AUROC of a DWPC across permuted networks but computed on the positives and negatives corresponding to the unpermuted network

rdwpc_auroc: the AUROC of a residual DWPC

pdwpc_primary_auroc: the average AUROC of a DWPC across permuted networks. This AUROC uses reassigned positives and negatives based on the specific treatment edges in each permuted network

delta_auroc: the dwpc_auroc minus the pdwpc_primary_auroc for a metapath

pval_delta_auroc: the p-value from a t-test of the difference between the dwpc_auroc and the five corresponding primary permuted AUROCs.

We're adopting the Δ AUROC (delta_auroc) as our main assessment of feature informativeness. The Δ AUROC is preferable to the plain DWPC AUROC because it assesses whether the specific edges of a metapath are predictive. Otherwise predictiveness may be dominated by node degree effects — an often overlooked shortcoming of many network analyses [1, 2]. The R-DWPC AUROC (rdwpc_auroc) is a promising metric for the future but currently suffers from an edge-dropout contamination issue.

Visualizing metaedge informativeness

The following figure shows the Δ AUROC for each of the 1,206 metapaths (notebook). Metapaths are assigned to their composing metaedges, so the CbGaD metapath will show up under Compound–binds–Gene and Disease–associates–Gene. Metaedges are ordered by their max Δ AUROC metapath, under the assumption that a metapath's performance is limited by its least informative metaedge.

While informativeness varies by metaedge, it appears that almost all data types we integrated are informative of whether a compound treats a disease.