Overview

You are invited to participate in a data mining contest associated with the prestigious 2011 SIAM International Conference on Data Mining (Society for Industrial and Applied Mathematics) in Arizona, USA, and organized by Simulations Plus, Inc. The task is related to prediction of an important biological property of molecules from their chemical structures – a routine task in the domain of cheminformatics. The modeling dataset contains two classes of chemical molecules: one with and one without the said property. Using this dataset, contestants will build a binary classification model with the goal of classifying new molecules into the two classes. A prize of 1000 USD will be awarded to the winning group. After the contest, top solutions will be presented at the SIAM ICDM’11 Contest Workshop.

Introduction

The pharmaceutical industry exists for, and profits from, the discovery and development of new medicines. The current costs of bringing a new medical drug to the market are huge, reaching 1.7 billion US dollars and 10-15 years of continuous research. Up until the late 90's, modus operandi of the pharmaceutical industry involved mainly slow and costly experimental work in laboratories and clinics. This stood in sharp contrast to automotive or aerospace industries where the final products, automobiles and airplanes, are designed primarily in the computer and the experimental work is done at the end to test prototypes ready to drive or fly. A lecturer at one of conferences in 2001 jokingly commented on this situation: "If pharmaceutical industry were to design airplanes, they would build a sequence of prototypes, at a cost of 10 million USD per month, and then they would push these prototypes off a cliff to see whether or not they could fly."

Fortunately, in the last decade, a significant paradigm shift has taken place wherein computer-based simulations and data mining approaches are gradually replacing costly experiments. Consider orally dosed drugs as an example. Patients much prefer swallowing a pill or capsule each morning from driving to a medical clinic to get an intravenous injection. Hence, oral bioavailability is a chief requirement of a successful pharmaceutical product. Since all earthly living organisms, including humans, are based on water as a solvent system, oral drugs must first dissolve in the intestinal fluids to be subsequently absorbed and distributed to the site of pharmacological action. Solubility in water is thus a critical parameter for all the chemical compounds that are to become drug candidates. In the past, pharmaceutical researchers would chemically synthesize thousands of such compounds, and then experimentally measure their water solubilities, along with other critical properties - a process that was both slow and costly. A desire to eliminate many of these unnecessary costs has accelerated the emergence and acceptance of the science of cheminformatics, roughly defined as data mining in chemical space. Based on the concept that "similar chemical compounds have similar properties", one would take existing solubility data and build statistical correlative models to create a map between structures of chemical molecules (which are points in chemical space) and observed water solubility. Solubilities of novel chemical compounds would not have to be measured. Instead, one would simply draw a structure of a completely new molecule on the computer screen and submit it to the correlative model to predict its solubility. Correlative computer modeling of properties of chemical molecules is widely known as Quantitative Structure-Property Relationships and abbreviated QSPR. A special subclass of QSPR involving biological activities of chemicals is named, appropriately, Quantitative Structure-Activity Relationships (QSAR). Another common name is "in silico methods" (i.e., in computer) built by loose analogy to Latin terms in vivo (indicating experiments performed in a living organism) and in vitro (laboratory experiments conducted in non-living systems). [1,2]

With sufficient number of QSAR/QSPR models, a pharmaceutical researcher would not have to even synthesize new compounds whose properties were predicted as unpromising. Of course, this perfect situation never happens. All QSAR/QSPR models are far from ideal, carry relatively significant prediction error, and have limited domain of applicability. Still, QSAR/QSPR can be quite useful in reducing the cost of pharmaceutical research. For example, a researcher might draw 1000 chemical structures in a computer, process these through an appropriate water solubility model, sort by solubility estimate in descending order, and throw away, e.g., the bottom 25%. Thus, the 250 compounds with projected low solubility would not have to be synthesized and measured. The reality is far more complex than this simplified example, but the basic principle remains: use QSAR/QSPR models to screen out potential drug candidates with unwanted properties. Many large pharmaceutical companies have entire departments dedicated to QSAR/QSPR model building and processing.

Figure 1. Translation of a chemical structure into an exemplary vector of molecular descriptors for computer processing.

With human creativity as the only limit, many thousands of descriptors have been identified. [3] Choosing the “right” descriptors for modeling a particular property is the first critical step towards a successful QSPR/QSAR model. These must be diverse enough and encompass all the important mechanistic aspects related to the modeled property. For example, molecular size and hydrogen bonding are some of the crucial details required for predicting water solubility, and neglecting such descriptors will surely result in a poor model. Since raw values of different molecular descriptors are calculated on different scales, normalization to a common scale is required prior to modeling (usually the 0 - 1 scale).

Next, enough experimental data must be collected for n distinct, diverse molecules to form the training set. The n measured values of the property of interest, Y, are then collected into an n-dimensional vector y. Once all the m descriptors are calculated for each training molecule, a molecular description n*m matrix X is formed where n rows represent individual molecules in the training dataset and m columns are normalized descriptors.

In QSAR/QSPR world the molecular description matrix is almost always poorly conditioned. Pairwise as well as multiple descriptor correlations are very common, underrepresented descriptors happen quite often, and the intrinsic dimensionality of the experimental data is usually smaller than the number of calculated descriptors. Therefore, the second critical step is the selection of the proper subset of m’ molecular descriptors, a n*m’ submatrix X’ (where m’≤m), such that the model is predictive and satisfies the principle of parsimony. Simultaneous with descriptor selection, the process of model training is already familiar to data mining practitioners: find a mathematical function f such that the following mapping:

y' = f(X')

produces vector y’ of predicted values that closely reproduces the original measured values y, subject to optimal predictivity constraints. Regression models would minimize the || y’ – y || norm, for example. Classification model training would result in an f that optimizes one of the binary classification statistics.

We must clearly state that the goal of QSAR/QSPR modeling is NOT a perfect reproduction of the training data y. Instead, y is used to guide the development of a model with the highest predictivity. The latter is defined as model’s performance on predicting property Y for new molecules and determines the model usefulness to the end user. This, in turn, can be evaluated by measuring Y for these new molecules and comparing against predictions. Models that are driven to reproduce training data as closely as possible are usually overtrained and tend not to be predictive. In general, statistics calculated on the training set are useless in judging a model’s predictivity. A much better strategy is taking out a part of the training set and setting it aside solely for the purpose of evaluating predictivity (a simulated validation set). A proper data division where the validation set would be representative of the whole training set is a challenge in itself. Cross-validation schemes, especially the “leave-one-out” flavor, are usually not advised, as they have been shown to be unreliable in QSAR/QSPR. [4] Similarly, many other strategies used successfully in other areas of data mining may prove insufficient. Any prospective QSAR/QSPR modeler is strongly encouraged to consult quoted literature for principles and tips of proper model building. [1,2]