A deep Tox21 neural network with RDKit and Keras

I found some interesting toxicology datasets from the Tox21 challenge, and wanted to see if it was possible to build a toxicology predictor using a deep neural network. I don’t know how many layers a neural network actually has to have to be called “deep”, but its a buzz word, so I’ll use it ;-). The idea and hope of deep learning is to let the network learn a hierarchy of more and more abstract concepts by passing it sequentially through layers of neurons, where one layers output are used as input to the next. They can be training in a supervised fashion via the back propagation algorithm, which updates the weights based on the mistakes the network makes when it tries to predict labeled samples. In the end of the blog post I’ll compare the performance of the deep neural network with a more simple logistic regression model with regularization.

Deep networks are prone to over fitting as the number of parameters in the network quickly adds up. In this example network I use a Morgan Circular fingerprint with 8192 bits and three fully connected dense layers as well as a single output neuron. The first layer will have 8192*80 connections to it, the next 80*80 and so forth, so the total number of weights that needs fitting is 668240. This is way past the number of available samples in the Tox21 datasets, so overfitting is likely.

Luckily there are ways to balance the bias/variance by means of regularization. By adding a penalty for high weights, the network can be forced to use smaller weights and many small contributions, instead of balancing of large weights with opposite signs to get the prediction of the train set 100% correct. I have covered the use of L2 regularization in a previous blogpost. Another regularization technique for neural network is dropout. Here a percentage of the activations are randomly dropped between layers during the training phase of the network. This makes it more difficult for the network to let the output of the neurons depend to much on the output of the others and breaks the dependence between the neurons. This should lead to single neurons doing more generalized work. The Tox21 dataset is in the range of a few thousands, so both techniques are used in the example below.

Now for the more interesting part. Building the neural net with Keras and train it. First layer is a dropout layer, so 20% of the incoming features are randomly dropped. Then follows three dense layers with both 50% and weight regularization. The last layer is a single neuron with sigmoid activation function.

During training the learning rate is reduced when no drop in loss function is observed for 50 epochs. This is conveniently done via the Keras callback ReduceLROnPlateau. The objective to minimize is the binary_crossentropy + the cost from the weight regularization. This makes the reported loss on the train set bigger than the reported loss on the validation set, which can be confusing to see if theres potential overfit. So the binary_crossentropy is also added as an additional metric. This callback from Keras is used at the end of each epoch, and makes it possible to compare the training loss with the validation loss.

The auc on the test set is 0.63, which is a lot lower than the neural network model. One of the overall best performing algorithms of the Tox21 challenge was a deep neural network. However, they used extra “tricks” such as more neurons, a lot more features in descriptors and fingerprints, and co-modelling of endpoints, as well as careful optimization with cross validation between compound classes found with ECFP4 similarity. Their paper is open access here: doi: 10.3389/fenvs.2015.00080

Please comment and let me know if you find some better regularization or network settings if you try this example, I haven’t done any systematic search and optimization.

Hi Dan, thanks for your feedback and that you like it. I’ve updated the examples above with a code example how to convert the SDF files available from https://tripod.nih.gov/tox21/challenge/data.jsp into pandas data frames and saving them. Let me know how you fare. The data clean-up and curating part could probably use a bit more attention, but my focus was on showcasing the Keras neural network modelling.

Hi Abdul, Thank you for commenting.
I think that the molvs and RDKit documentation would be a much better and up to date place to read about the installation options. I personally prefer to use my Linux distributions package manager (apt-get install python-rdkit) or failing that, pip (pip install molvs). For some pip installs i use a virtualenv, as some pip packages interfere with my system packages.
(I also have a bleeding edge RDKit compiled from source with some patches that I have applied myself, but that’s not necessary for the code in this blog-post.)
The “from rdkit import Chem,&amp;amp;amp;amp;nbsp; DataStructs” fails, because WordPress doesn’t like my source code and have put a lot of ,&amp;amp;amp;amp;nbsp; which shouldn’t be there. Thank you for pointing my attention to that, I’ll correct it.