Pages

Sunday, July 17, 2016

A long time ago Martijn van Iersel wrote a PathVisio plugin that visualizes 2D chemical structures of metabolites in pathways as found on WikiPathways. Some time ago I tried to update it to a more recent CDK version, but did not have enough time at the time to get it going. However, John May's helpful DepictionGenerator made it a lot easier, so I set out this morning in updating the code base to use this class and CDK 1.5.13 (well, strictly speaking it's running a prerelease (snapshot) of CDK 1.5.14). With success:

The released version is a bit more tweaked and shows the 2D structure diagram more filling the Structure tab. I have submitted the plugin to the PathVisio Plugin Repository.

Now, you may know that these GPML pathways only contain identifiers, and no chemical structures. But this is where the metabolite identifier mapping database helps (doi:10.6084/m9.figshare.3413668.v1): it contains SMILES strings for many of the compounds. It does not contains SMILES string from Wikidata, but I will start adding those in upcoming releases too. The current SMILES strings come from HMDB.

To show how all this works, check out the below PathVisio screenshot. The selected node in the pathway has a label uracil and the left most front dialog was used to search in the metabolite identifier mapping database and it found many hits in HMDB and Wikidata (middle dialog). The Wikidata identifier was chosen for the data node, allowing PathVisio to "interpret" the biological nature of that node in the pathway. However, along with many mapped identifiers (see the Backpage on the right), this also provides a SMILES that is used by the updated ChemPaint plugin.

It will give some output on the console, including a webpage with SPARQL endpoint, upload form etc.

That it tracks past queries is a nice extra.

Step 2: there is no step twoStep 3: OK, OK, you also want to try a SPARQL from the command line
Now, I have to say, the webpage does not have a "Download CSV" button on the SPARQL endpoint. That would be great, but doing so from the command line is not too hard either.

But it would be nice if you would not have to copy/paste the query into a file, or go to the command line in the first place. Also, I had some trouble finding the correct SPARQL endpoint URL, as it seems to have changed at least twice in recent history, given the (outdated) documentation I found online (common problem; no complaint!).

HT to Andra who first mentioned Blazegraph to me, and the Blazegraph team.

Friday, July 08, 2016

A good conference needs some time to digest. A previous supervisor advised me that a conference travel of 5 days takes 5 full day to follow up on everything. I think he is right, though few of us actually block our schedules to make time for that. Anyway, I started following up on things last weekend, resulting in a first two blog posts:

The second was pretty much how I have been blogging a lot: it's my electronic lab notebook. The first is about how people can link out to WikiPathways. That post explains how people can create links between identifiers and pathways.

Saturday, July 02, 2016

Biological knowledge should not only be capturedin nice graphics, but should be machine readable.Public domain image from Wikipedia.

WikiPathways described biological processes. Entities in these processes are genes, gene products, like miRNAs, proteins, and metabolites. The pathways do not describe what these entities are, but only provide identifiers in external databases allowing you to study the identity in those databases. Therefore, for metabolites you will not find chemical graphs but identifiers from HDMB, CAS, KEGG, ChEBI, and others.

Therefore, for these databases it is easy to make links between those identifiers and the pathways in which entities with those identifiers are found. For example, to create a link between Ensembl identifiers and pathways, we could do something like:

Doing searches in RDF stores is commonly done with SPARQL queries. I have been using this with the semantic web translation of WikiPathways by Andra to find common content issues, though sometimes combined with some additional Java code. For example, find PubMed identifiers that are not numbers.

Based on Ryan's work on interactions, a more complex curation query I recently wrote in reply to issues that Alex ran into with converting pathways to BioPax, is to find interactions that convert a gene to another gene. Such occurred in WikiPathways because graphically you do not see the difference. I originally had this query:

This query properly found all gene-gene conversions to be fixed. However, it was also horribly slow with my JUnit/Apache Jena set up. The queries runs very efficiently on the Virtuoso-based SPARQL end point. I had been trying to speed it up in the past, but without much success. Instead, I ended up batching the testing on our Jenkins instance. But this got a bit silly, with at some point subsets of less than 100 pathways.

Observation #1
So, I turned to twitter, and quite soon got threeusefulleads. The first two suggestions did not help, but helped me rule out the problem. Of course, there is literature about optimizing, like this recent paper by Antonis (doi:10.1016/j.websem.2014.11.003), but I haven't been able to convert this knowledge into practical steps either. After ruling out these options (though I kept the sameTerm() suggestion), and realized it had to be the first two triples with the variables ?gene1 and ?gene2. So, I tried using FILTER there too, resulting with this query:

That did it! The time to run a query halved. Not so surprising, in retrospect, but it all depends on the SPARQL engine: which parts does it run first. Apparently, Jena's SPARQL engine starts at the top. This seems to be confirmed by the third comment I got. However, I always understood engine can also start at the bottom.

Observation #2
But that's not all. This speed up made me wonder something else. The problem clearly seems to engine approach to run parts of the query. So, what if I remove further choices in what to run first? That leads me to a second observation. It helps significantly if you reduce the number of subgraphs it should later "merge". Instead, if possible, use property paths. That again, about halved the runtime of the query. I ended up with the below query, which, obviously, no longer give me access to the pathway resources, but I can live with that:

Search This Blog

This blog deals with chemblaics in the broader sense. Chemblaics (pronounced chem-bla-ics) is the science that uses computers to solve problems in chemistry, biochemistry and related fields. The big difference between chemblaics and areas such as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, open data, and open standards, making experimental results reproducible and validatable. And this is a big difference!

About Me

Assistant professor at the Dept of Bioinformatics - BiGCaT at NUTRIM, Maastricht University, studying biology at an unsupervised and atomic level. Open Science is my main hobby resulting in participation in, among many others, Bioclipse, CDK and WikiPathways. ORCID:0000-0001-7542-0286. Posts on G+ are personal.

Cookies

In the EU there is a directive upcoming requiring websites to warn people about HTTP cookies. This website uses the Blogger.com platform, Google Adsense (not that is it actually paying anything significantly), and a few scripts to count how often a blog post was tweeted, using Topsy and LinkedIn. These services undoubtedly make use of cookies, which you can disallow in your browser.