Molecular neural network models with RDKit and Keras in Python

Neural networks are interesting models underlying much of the newest AI applications and algorithms. Recent advances in training algorithms and GPU enabled code together with publicly available highly efficient libraries such as Google’s Tensorflow or Theano makes them highly interesting for modelling molecular data. Here I explore the high level Neural Network library for Python, Keras, which can interface for both Theano or Tenserflow for the underlying number crunching. Having a high level API in python for quickly defining the neural network architecture and training makes it fast to prototype the molecular data modelling. For this first model I use a solubility dataset I have used before: http://www.wildcardconsulting.dk/useful-information/wash-that-gold-modelling-solubility-with-molecular-fingerprints/

We start with some imports. I’ll need RDkit for molecular conversion and descriptor calculation, Pandas for loading and data management, Scikit-learn for standardisation and data set splitting as well as Keras for the neural network building and training.

The next couple of lines read the dataset into a Pandas dataframe object. The last 27 lines contains comments, so they are skipped. I want to use the E-state indices together with the molecular weight, so I define a function that calculates this. Afterwards its easy to add two new columns to the dataframes by applying functions to the existing column containing the molecules. First a column of RDkit mols are created from the column containing the SMILES strings, and afterwards the descriptor arrays are calculated.

Now the fun begins. Keras uses a base model object and adds the layers and activations as other objects in a sequential manner. The first layer must get an input dimensions matching the data, whereas the following can deduce their input size from the previous layer. Its a very simple model with a small hidden layer of 5 neurons with sigmoid activation. Larger networks will overfit, unless some form of regularization is put in place (such as early stopping or drop out). As the output values are continuous rather than class labels, the output dimension is a single neuron with a linear activation.

Before the network can be trained using the Theano backend, the optimizer algorithm and the loss function need to be defined. The loss function is set for the usual mean squared error and the optimizer is stochastic gradient descent with a learning rate of 0.01 and a momentum of 0.9, using the Nesterov modification of the gradient.

I was positively surprised by the fitting time of the network, even though I could only use CPU and not GPU on my laptop.

After fitting it was easy to predict the values of the test set and make a quick plot. RMS doesn’t seem so good, compared with previous models I’ve built, but this was meant as a fast test of Keras and Theano on an easy dataset.

Comment

Thanks to Brian Barnes for pointing my attention to an unintended broadcast during the RMS calculation, which made the RMS look much worse than it really was. The code above has been corrected with a .reshape(-1,1) operation.