ChEMBL Resources

Monday, 10 November 2014

A couple of us attended the 3rd RDKit UGM, hosted by Merck in Darmstadt this year. It was an excellent opportunity to catch up with RDKit developments and applications and meet up with other loyal "RDKitters".

I presented a talk-torial there and went through an IPython Notebook, which some of you may find useful. It uses patent chemistry data extracted from SureChEMBL and after a series of filtering steps, it follows a few "traditional" chemoinformatics approaches with a set of claimed compounds. My ultimate aim was to identify "key compounds" in patents using compound information alone, inspired by papers such as this and this. The crucial difference is that these authors used commercial data and software, where in this implementation everything is free and open. At the same time, I wanted to show off what the combination of pandas, scikit-learn, mpld3, Beaker, RDKit, IPython Notebook and SureChEMBL can do nowadays (hint: a lot).

So, here is the Notebook and here are the associated slides which give a bit of background and context.

Obviously, the logic and steps can be reimplemented with other toolkits or workflow tools, such as KNIME.

Wednesday, 5 November 2014

It is common practice for organizations and companies to make use of proxy servers to connect to services outside their network. This can cause problems for users of the ChEMBL web services who sit behind a proxy server. So to help those users who have asked, we provide the following quick guide, which demonstrates how to access ChEMBL web services via a proxy.

Most software libraries respect proxy settings from environmental variables. You can set the proxy variable once, normally HTTP_PROXY and then use that variable to set other related proxy environment variables:

Or if you have different proxies responsible for different protocols:

On Windows, this would be:

If you are accessing the ChEMBL web services programmatically and you prefer not to clutter your environment, you can consider adding the proxy settings to your scripts. Here are some python based recipes:

1. Official ChEMBL client library

If you are working in a python based environment, we recommend you to use our client library (chembl_webresource_client), for accessing ChEMBL web services. It already offers many advantages over accessing the ChEMBL web services directly and handling proxies is yet another. All you need to do is configure proxies once and you are done:

2. Python requests library

If you decide to use requests, you have to add 'proxies' parameter to every 'get' and 'post' function call:

3. Python urllib2 library

Finally, in the lowest level library, 'urllib2' you can set a ProxyHandler and register it to URL opener:

We would like to thank Dr. Christine Rudolph for the idea and providing code snippets.

Tuesday, 4 November 2014

PPDMs has been in the making for more than a year and is a follow-up on a conference paper we published in 2012. As in 2012, our objective is to map small molecule binding sites to protein domains, the structural units that form recurring building blocks in the evolution of proteins. An application note describing PPDMs is just out in Bioinformatics.

Mapping small molecule binding to protein domains

The mapping facilitates the functional interpretation of small molecule-protein interactions - if you understand which domain in a protein is targeted, you are in a better position to anticipate the downstream effect. Mapping small molecule binding to protein domains also provides a technical advantage to machine-learning approaches that incorporate protein sequence information as a descriptor to predict small molecule bioactivity. Reducing the sequence descriptor to the part that mediates small molecule binding increases the informative content of the descriptor. This is best exemplified by the domain-poisoning problem, illustrated below.

Result of a hypothetical query using as input the rat Tyrosine-protein phosphatase Syp (P35235) - and one of the hits, retrieved from a BLAST query against the ChEMBL target dictionary - the rat Tyrosine-protein kinase SYK (Q64725). The significant e-value for this query results from high scoring alignments of the SH2 domains. At the same time, the overlap between small molecules binding both proteins is expected to be low.

A simple heuristic

For individual experiments, it is often quite trivial to decide which domain was targeted. For example, medicinal chemists know whether their compound is a kinase inhibitor or one of a handful of SH2 inhibitors. This knowledge, while easily gleaned by the expert, is implicit and cannot be accessed programmatically. Hence we were motivated to implement a solution that could achieve this across as many measured bioactivities as possible.

Our initial implementation of mapping small molecules to protein domains consisted of a simple heuristic: Identify domains with known small molecule interaction and use these domains as a look-up when mapping measured bioactivities to protein domains. This process is illustrated in the figure below.

A catalogue of validated domains was extracted from assays against single-domain proteins (step 1, 2) and projected onto measured bioactivities in ChEMBL (step 3). Three possible outcomes are: i) A successful mapping if exactly one of the Pfam-A domain models from the catalogue matches the sequence; ii) No mapping if none of the Pfam-A domain models from the catalogue match the sequence; iii) A conflicting mapping if multiple domain models from the catalogue match the sequence.

Despite its simplicity, this method works surprisingly well, owing to the fact that protein domains that are relevant to drug discovery are prioritised in Pfam-A model curation. Another factor that contributes here is the conservative route taken by many drug discovery projects that focus on targets that are in well characterised protein families. However, as illustrated by the cases labelled ii) and iii), some constellations are not covered by the simple heuristic. A public platform to review and improve mappings

Measured activities in ChEMBL falling into category iii) from the illustration above amount to only a fraction of the total but often reflect interesting biology. DHFR-TS for example is a multi-functional enzyme combining both a DHFR and Thymidylate_synt domain that occurs in the group of bikonts, which includes Trypanosoma and Plasmodium. In humans (and all metazoa), these domains occur as separate enzymes.

Small molecule inhibitors exist for both domains, DHFR (yellow, with Pyrimethamine) and Thymidylate synthase (blue, with Deoxyuridine monophosphate).

We built PPDMs as a platform to resolve such cases. PPDMs aggregates information that supports manual mapping assignments based on medicinal chemistry knowledge. New mappings can be committed to the PPDMs logs and then transferred to the ChEMBL database in future releases.

The Conflicts section on the website summarises conflicts (cases that correspond to category iii as discussed above) that were encountered when the mapping was applied to measured activities in the ChEMBL database and offers an interface to resolve them.

The Evidence section provides the full catalogue of domains for which we found evidence of small molecule binding. Evidence for the majority of domains in this list is provided in the form of measured bioactivities in ChEMBL, while in a few cases we provide a reference to the literature. These are cases where well-known domains occur exclusively in multi-domain architectures, such as 7tm_2 and 7tm_3. The catalogue can be downloaded in full from this section.

PPDMs also provides logs of individual assignments - these can be queried by date, user and comments left when the assignment was made. A log of all assigned mappings can be downloaded from this section. Another way to review assigned mappings is through the Resolved section, where assignments are grouped by domain architecture.

We invite everyone with an interest in the matter to sign up with PPDMs, whether it's simply for playing around, resolving remaining conflicts, or reviewing existing assignments. Please get in touch and we'll sort out a login for you!