A multiprocessing example and more

A multiprocessing example and more

Recently, I had to search a given chemical structure into a list of structures. Using the python chemoinformatics packages pybel and rdkit, I was easily able to do so but the operation took a little too much time for my linking. Wondering how I could search faster, I immediately thought about Jean-Philippe’s previous blog post titled Put Those CPUs to Good Use. I’ve decided to follow his instructions and give it a try.

Goal

Look for a molecule (a given chemical structure) in a list of molecules.

Sequential implementation

Note that I could have used only pybel or only rdkit. That being said, I am facing a specific constraint in my current project: the molecules in my list are pybel molecule objects while the molecule in my query is a rdkit molecule object. Without this constraint, I could have written the function using only pybel or only rdkit.

Running this function, it took 29.8 seconds to find the three occurrences of my query among the 25791 molecules of the list. I am patient but to some extent: waiting 30 seconds every time I need to find a molecule is too long.

Multiprocessing implementation

To run on more than one CPU using the multiprocessing python package, I only need to add two modifications to my previous code: (1) I must define a wrapper function and (2) the tasks that are going to be run in parallel.

With this code, I am still capable of identifying the three occurrences of my query but it now takes 7.8 seconds since I have divided my list of molecules into smaller lists of 1000 molecules, and I am now using 4 CPUs. This list division was possible because each comparison of the query to a molecule of the list is an independent operation.

Final remarks

Even though there are other options to speed up my search, my intents here were to present a (hopefully) simple example of the multiprocessing package. You can see that it is quite easy to alter existing code to take advantage of the available CPUs in order to reduce waiting time.

Another option would be the usage of a code profiler to identify the operations taking a lot of time and see if those can be modified to get an optimized code.

The operations that are the most time consuming are the ones generating the fingerprint fp1 for all the molecules and the conversion of the query. Both are done 25791 times and take about 94% of the 30 seconds. The conversion operation should be outside of the loop as it needs to be done only once. As for the generation of the fingerprint, I only need to compute the fingerprint if the formulas of the compared molecules are the same. Otherwise, I already know that they are different molecules. Hence, I do not need to compute the fingerprint for all the 25791 molecules. Optimizing the code following an inspection with the code profiler reduces the search time from 29.8 seconds to 0.137 seconds. In this case, there is no gain in using more CPUs: it might even take longer!

Share This Story, Choose Your Platform!

I’ve started in biochemistry but it is as a bioinformatician that I’ve been having fun for several years now : whether doing data analysis and visualization in R, building interactive web interfaces in javascript or exploring machine learning in python.