Network Edge Prediction: Estimating the prior

We discussed in a different thread why we believe that including the prior is a correct way of mitigating the problems posed by self-testing. In one of the comments, I derived a formula for the prior probability $$p_{ij}$$ of a certain couple $$C_i\text{-}D_j$$ to have a treatment edge, only based on the source/target degrees $$n_{CtD}(C_i)$$ and $$n_{DtC}(D_j)$$. Unfortunately, this formula is wrong.

Here, we want to discuss why the formula is wrong, find a more accurate analytical formula, and explore ways to estimate programmatically this prior probability.

The previous formula

The reasoning was flawed at the point where we combined the two correct "asymmetrical" probabilities into our final prior probability $$p_{ij}$$. Indeed, the "asymmetrical" probability of a source node $$C_i$$ connecting a target node drawn at random can be accurately expressed as the ratio of the degree $$n_{CtD}(C_i)$$ of this node over the total number of target nodes $$N_D$$. Nevertheless, multiplying the two probabilities $$p_t(C_i\text{-}D) = \frac{n_{CtD}(C_i)}{N_D}$$ and $$p_t(D_j\text{-}C) = \frac{n_{DtC}(D_j)}{N_D}$$ is not equal to the prior probability $$p_{ij}$$, but merely the joint probability of $$C_i$$ to treat a disease drawn at random, and $$D_j$$ to be treated by a Compound drawn at random.

This formula appeared obviously wrong since the sum of the prior probabilities over all potential is not equal to anything meaningful, and certainly not the expected number of treatments $$N_T$$:

Prior probability estimation is a hard problem

Deriving the prior probability $$p_{ij}$$ from the individual source/target degrees is harder than we first thought. To grasp the complexity of the problem, we can see that the prior probability for one particular couple $$C_i\text-D_j$$ is not only dependant on the degrees of the two nodes $$C_i$$ and $$D_j$$, but on the degrees of all the Compound and Disease nodes. As a result, it is complicated (maybe impossible) to derive an exact analytical solution, and it might be desirable to strive for a 'mean-field approximation' solution.

In the meantime, we can brute-force the estimation of the prior probability, based on many random source/target subnetworks that have the same degrees than ours.

Estimating the prior probability of treatment via permutation

Problem: We have a bipartite graph of compounds and diseases connected by treatment edges. Using only node degrees, what is the probability that a given compound treats a given disease.

Since compounds with the same degree are equivalent and diseases with the same degree are equivalent, we can focus on estimating the probability of a degree pair ($$n_{CtD}(C_i)-n_{DtC}(D_i)$$) rather than a specific compound–disease pair.

Since we struggled to solve this problem analytically, we applied a brute-force permutation approach. Specifically, we took our empiric bipartite treatment network (PharmacotherapyDB) with 755 treatments and derived 744,975 permuted networks.

We applied the XSwap method of permutation [1, 2, 3] and found 2,265 (755 × 3) attempted swaps provided sufficient randomization [4] (notebook). After every permutation, we recorded the percent of possible edges that were present for each degree pair. The final result is the prior probability that each degree pair exists (table).

Challenge: Can @alizee or others derive a general formula from these probabilities? For context here are the compounds and diseases — note that nodes with zero treatments (zero-prior nodes) can be ignored. The pseudo-linear relationship we see between our permuted prior probabilities and our faulty theoretic prior probabilities (scaled $$n_{CtD}(C_i) \cdot n_{DtC}(D_j)$$) after logit transformation suggests an analytic solution may exist:

Another interesting visualization shows the log-transformed probability of an edge existing for all compound–disease degree pairs (notebook). The top facet shows the prevalence in the unpermuted treatment network:

Now here is the same plot without transforming the probabilities. Notice the high probabilities for the existence of treatment between high degree compounds and diseases.

The incredible disparity in prior probabilities based on treatment degree reinforces the need to explicitly model the phenomenon.

Antoine Lizee: We knew there would be a great deal of disparity, at least in the number of different values (hence the high auroc) - the surprise to me is the high value of the highest probability. Could we get the updated AUROC from this empiric prior? Also,

The pseudo-linear relationship we see ... suggests an analytic solution may exist

I agree only approximately :-) The intricate relation that we see suggest on the contrary that there is no easy exact solution. We should focus on getting a good approximation.