Ideas and Insight supporting all stages of Drug Discovery & Development

Predictive Analytics for Drug Discovery

The drive to predict molecular properties and thereby reduce lab testing and focus research projects started before the invention of the computer; a notable advance was with the Hammett equation in the 1930’s.[1]A next great step forward was by Corwin Hansch in 1962[2] who was one of the first to use computers to perform the calculations. Since that time both computer and analytic technologies have made enormous leaps. However, one element has remained the same since before 1930 – the need for data to create predictive models.

An explosion of measurements has occurred in the past 20 years, fueled by technologies that make bioassays easier; particularly high-throughput screening. However, data reported from the scientific community has been locked in thousands of tables reported in journal articles and patents. The Elsevier Reaxys Medicinal Chemistry (RMC) product extracts the numeric data, as well as the target, the assay types, and other information to make it a powerful resource for making predictive models for protein-ligand binding.

The RMC data includes millions of data points for thousands of targets, encompassing hundreds of assays. To use this data for predictive models we used the open-source KNIME toolset, and the well-tested R statistics system as a framework to gather the information from RMC, normalize it, and use sophisticated predictive model techniques. The process includes model validation by using data not used for making the model to test its predictive ability. The test set is used to measure the expected error of prediction for each model. Figure 1 shows an example of predicted vs actual data for compounds binding to the protein EGFR (P00533).

Figure 1 Predictive Model for EGFR. Click image to enlarge.

Extending this concept further, we can create a large number of predictive models to create an entire simulated screening panel, as we did for a set of diverse kinases, shown in Figure 2. This allows not only prediction of activity, but prediction of the selectivity of the compound for a particular kinase or set of kinases.

Among the next steps the R&D Life Science Solutions team is investigating is to use deep-learning systems to analyze the complete set of bioactivities, structures and known toxicities for the compounds to relate specific activities to toxicities observed in animals and humans. This will allow identification of simple in-vitro screens that may be used as markers to help predict in-vivo toxicities.