Discovering and connecting data to build new algorithms, AI and machine learning approaches

Leveraging the UK’s renowned leadership in drug discovery – a modern approach to R&D

Virtual R&D is the delivery of medicines R&D using multiple best-in-class partners, ensuring the critical value-based experiments are done by the right people, data is captured, and IP-estate is secured.

Brokering easier access to consented patient data and samples

The UK has millions of samples and billions of data points collected from UK patients who have agreed that their samples and data can be used for research. But small UK medical research companies struggle to access them.

Sparse data typically arise when compounds are being assayed against multiple biological targets, but not all compounds are assayed against all targets. Or, it can result from aggregating different literature studies with only partly overlapping compound sets and biological targets. A small sparse array of pIC50s might look like this:

Figure 1

The question then arises, can we make inferences about the missing data using the known data?

Can we predict, with any confidence, the missing values in this matrix based on ones we know? This would, for example, enable a project to reduce screening costs, or develop criteria to decide when to stop further screening or enable integration of multiple literature studies. This may appear to be a something-for-nothing aim but, the underlying latent structure in the data means that we can make surprisingly good estimates. But first, a detour to Netflix.

Look again at a data bioactivity array (Figure 2, below). Now replace the rows (proteins) with Netflix subscribers and replace the columns (compounds) with film titles, to generate Figure 3. Let each value in the array be a preference from 1 to 10, where 1 is hating the film and 10 is loving it.

Figures 2 and 3

This is the kind of data Netflix gets from user reviews – everyone has seen some films and reviewed them, but most individuals have not seen all films, and it would be very useful to be able to predict the missing preferences.

The potential for drug discovery

While the entertainment industry is a very different domain from drug discovery, the underlying problem here is the same – filling in the missing values – and the potential for borrowing methods used by them is there. Netflix were very keen to impute the missing values so that they could predict which film a subscriber would enjoy, recommend it to them and get another sale. It is the basis of the “if you liked that, you’ll like this” system.

We are just as keen to predict biological activities: “if your kinase liked that compound, it will love this one”

The success of this approach relies on underlying correlations between preferences of different users. Jim and Christine, from figure 3, both have similar film preferences. Therefore, it is likely Jim will like films that he has not seen, but that Christine has liked; in this case, Love Actually. Netflix use a technique based on matrix factorisation to extract this latent structure and make predictions.

Success with bioactivity data

Similar approaches have been used for bioactivity data with considerable success. If kinases A and B show similar sensitivity profiles, to compounds they have both been tested on, it is more likely that they will each be sensitive to compounds only the other has been tested against.

This factorisation method can only capture linear relationships between preferences, or data in general. Many problems, such as activity of a compound against a panel of biological targets, do not vary smoothly and generally display non-linear relationships.

Deep Learning is well-suited to dealing with non-linear relationships, let’s see what happened when Intellegens applied it to this bioactivity imputation problem.

Initially, they set the missing values to the mean of the known values; then iteratively improved the estimates using a deep neural network until convergence. They tested the method on a kinase data set: with 54% of the data used for model training, 14% for optimising hyperparameters and the remaining 32% as the cross validation set to test the model predictions in an unbiased manner.

Their approach outperformed matrix factorisation and traditional machine learning approaches, such as Random Forest, by a large margin. Performing similarly to a method developed at Novartis when predicting all the missing values.

A particularly attractive feature of the Intellegens approach is that, not only is the prediction made but, unlike the Novartis method, the confidence associated with each prediction can be generated. As a result, they could identify which predicted values were most likely to be correct. They were able to show that the R2, with the experimental values, rose from 0.445, when predicting all the values; through 0.7, when using the most confident 50% of predictions; to 0.9, when using only the most confident 1%. This allows screening resources to be focused on the most interesting or most uncertain values.

Seeing the applicability of their approach, more generally, whilst building a picture of how it behaves will be intriguing. Correlation Coefficients are a global estimate of model quality and the local behaviours around e.g. activity cliffs will be of interest.

About the author

Dr Andrew Pannifer is Lead Scientist in Cheminformatics at Medicines Discovery Catapult.

After a PhD in Molecular Biophysics at Oxford University, mapping the reaction mechanism of protein tyrosine phosphatases, he entered the pharmaceutical industry in 2002. Firstly at AstraZeneca and then at Pfizer, he performed structure-based drug design and crystallography, and in 2010 joined the CRUK Beatson Institute Drug Discovery Programme to start up Structural Biology and Computational Chemistry.

Sign up for our newsletter

First Name

Last Name

Email

Complete this form and we'll send you our monthly newsletter, and occasional alerts when we have major announcements and collaboration opportunities. Every email we send will include a link to unsubscribe, and after 12 months we'll ask you if you want to stay opted-in. We won't share your data with any third-party. For more details, read our Privacy Policy

To comply with EU directives we now provide detailed information about the cookies we use. To find out more about cookies on this site, what they do and how to remove them, see our information about cookies. Click OK to continue using this site.OkRead more