Archive for the ChemSpider Services Category

For some time now it has been possible to access relevant SureChem patent information from a ChemSpider compound page in the Patents Infobox. ChemSpider compounds are also linked to and from the relevant RSC articles, which has allowed us to form a new partnership between RSC Publishing and SureChem which relies on ChemSpider taking the pivotal role of linking internet chemistry together.

In the RSC article landing pages there is a “Compounds” tab which shows the key compounds that the article is about – as shown in this example. For each compound there is now a link to view the SureChem patent information associated with that compound as below:

SureChem and SureChem’s new free offering, SureChemOpen, offer a suite of patent chemistry data solutions, for example allowing their patents to be found from a structure or substructure search. Now, for each compound returned from such a search it is possible to view any linked ChemSpider compound pages and the number of associated RSC publications (and follow a link to view these articles).

This linking between SureChem and the RSC publication platform relies on ChemSpider (and the standard InChI chemical identifier) providing a bridging link to both, which ensures that the system is accessible, standards-based and scalable, making it easy for future partners to join.

As the ChemSpider content and data mappings have continued to expand, the demands on our web services have increased dramatically. With the popularity of the site continuing to increase we anticipate even heavier usage of our web services. This is true for our involvement with the Open PHACTS project as well as from a number of software packages served up by analytical instrument vendors, especially in the mass spectrometry domain. Because of the increasing load on our systems, we have taken steps to prevent us from outgrowing our existing infrastructure and have implemented a new scalable, future-proof web services offering that your applications can rely upon.

Continual availability and business continuity for subscribers and academics

We have reinvented our web service infrastructure using Microsoft SQL Server replication technology in order to maintain multiple copies of the ChemSpider database. As a result all system resources are dedicated purely to web services with no background tasks running to affect the performance. Also, the databases are read-only which results in database lock contention being completely eliminated.

A standalone and scalable web service establishment for faster response times

The ChemSpider servers run on the VMWare virtualization platform which allows us to scale out the hardware by assigning more resources as required. In the future we can easily provide a consistently high-performance service even as usage further increases.

Over 1/4 million calls in the first 18 hours

Although ChemSpider web services are fast becoming a priority for us, we are still dedicated to ensuring the website experience is optimal. The changes we have implemented will reduce traffic to the website so you should already have noticed improvements in website performance and reliability.

Some examples of implementations of ChemSpider web service usage can be found here.

Access to the ChemSpider API is free to academic users; for commercial use please contact us at chemspider-at-rsc.org.

The RSC’s objective is to advance the chemical sciences, not only at a research level but also to provide tools to train the next generation of chemists. ChemSpider contains a lot of useful information for students learning Chemistry but there is also a lot of information which is not relevant to their studies which might be confusing and distracting. For some time we have been considering the concept of an educational version of ChemSpider, aimed at students (and their teachers or lecturers) in their last years of school, and first years of university (ages 16-19), which restricts the compounds and the properties, spectra and links displayed for each, to those relevant to their studies. As a result, we are pleased to announce the launch of the Learn Chemistry Wiki which not only fulfils this aim, but also takes it further. This project was developed in a collaboration between Dr Martin Walker at the State University of New York at Potsdam, ChemSpider and the Royal Society of Chemistry’s Education team.
The Learn Chemistry Wiki contains over 2000 “substance” pages which correspond to simple compounds that would commonly be encountered during the last years of school and first years of University. Each of these pages corresponds to a ChemSpider compound, from which it dynamically retrieves compound images, a summary of its properties(molecular formula, mass, IUPAC name, appearance, melting and boiling points, solubility, etc.) and links to view safety sheets and spectra. It also contains text from Wikipedia to display in the substance page based on the Wikipedia links in ChemSpider.

The Learn Chemistry Wiki also goes a step further and not only contains compound information in isolation but also contains laboratory experiments (with parallel sections which contain an overview, teachers’ notes and students’ handouts) for each, quizzes, and tutorials which are linked to the compound information to put them into context. The wiki is based on the MediaWiki platform (which allows multiple users to contribute collaboratively since the website is intended to be a community website), but extends it to incorporate functionality similar to that of ChemSpider, invoked via custom-made extensions. For example, it is possible to draw structures using GGA’s Ketcher in order to find structures, or to draw answers to quiz questions (for example to specify the product of a particular reaction). It is also possible to include an interactive spectrum retrieved from ChemSpider in any wiki page, using the ChemDoodle spectrum viewing widget in browsers which support canvases or JSpecView applet in those that don’t.

The Learn Chemistry Wiki is part of the new RSC’s new Learn Chemistry platform which provides a central access point and search facility to make it easier to access the various different RSC teaching resources that it provides.

KNIME is an open-source data integration, processing, analysis, and exploration platform which can be used to create workflows to analyse data.

We have experimented with adding a node to a project which would call the ChemSpider webservices to perform a simple search on it and the instructions below outline how to reproduce our experimentation. This was done with KNIME 2.5.0, with the KNIME extension “Generic Webservice Client” installed.

From the Node Repository find the “Generic Webservice Client” under the “Misc” folder and drag it into the Knime project to add a new node

Right-click on this “Generic Webservice Client” and click on the “Configure…” option

The WSDL for each ChemSpider webservice can be found using the link from the page for the appropriate webservice. For example, the WSDL for the Search webservice is at http://www.chemspider.com/Search.asmx. However, if you enter this as the WSDL location you’ll get an error when you click the “Analyze” button (due to a SOAP exception “undefined simple or complext type ‘soapenc:Array’. This is something that we’re looking into addressing in ChemSpider, but for now a workaround is to copy the WSDL, replace the old fashioned soapenc:Array type with tns:ArrayOfString, and save and use this ammended WSDL locally. I have done this with the Search webservice and the resulting WSDL is available for download here. This file should be downloaded, adn extracted somwhere locally. It can then be entered in the “WSDL Location” field of the Generic Webservice client in KNIME (using a location of the form: file:/C:/temp/ChemSpiderSearchWSDL_no_soapencArray.WSDL) which will then be processed correctly on clicking the “Analyze” button

Set the Port, operation, inputs and outputs as required – see screencapture below for settings for my demonstration. Note that you should use your own token as the value for the token input – if you don’t have one already then see the instructions here for instructions.

Add input and output nodes which connect to and from this Generic WebService Client node as required. For example, you could add a FileReader node as the initial node, which reads in the contents of a text file that simply contains a search term as an input (and adapt the Input value accepted as the query input value of the SimpleSearch to map to this column, rather than hardcoding in a value to search for). And the output csid could be written to a csv file using a CSV Writer node.

On executing the workflow, an output csv file is created which contains the ChemSpider ID(s) of any compounds that match the search term. In the case of a search for “benzene” the csid retrieved is 236.

The functionality of electronic lab notebooks (ELNs) and that of ChemSpider overlap to a certain extent – both store chemical information including structures, data, spectra and reactions. However, the focus of most ELNs is to manage, track and audit that data, and that of ChemSpider is to publish and disseminate it to the world. We have been considering how best to use this complementary functionality to integrate an ELN with ChemSpider.

Some ELNs already currently look up information and link to ChemSpider. For example the blog3 Web-logging (“blogging”) engine by Jeremy Frey, Simon Coles and Mark Borkum at Southampton University, which allows links to compounds from the ChemSpider database to be embedded directly into the content of a post. When a link to ChemSpider is detected, blog3 follows the link to retrieve additional information that is relevant to the compound, including: experimental and theoretical data; two- and three- dimensional depictions; and links to papers and journal articles. Another example is the eScience tool that Stephen Wan from CSIRO has developed with the University of New South Wales to text mine LabTrove ELN blog posts to identify chemical names and link these to the relevant ChemSpider compounds.

At the meeting “The Smart Laboratory: Towards a national ELN” meeting (organised as part of the Dial-a-Molecule EPSRC Grand Challenge) in August this year, the seeds were sown to take the integration between ELNs and ChemSpider a step further. Cambridge University has the first Chemistry department in the UK to roll out a department-wide Electronic Lab notebook system, and the software that they’re using is IDBS’s E-WorkBook Suite. In collaboration with IDBS and Cambridge’s Chemistry department, we at ChemSpider have made a plug-in which could both dynamically retrieve information from ChemSpider into their ELN, and publish to it the other way. The Chemistry department at Cambridge (Dr Tim Dickens, Dr Brian Brooks, Prof Bobby Glenn and Prof Steven Ley) have been very helpful in granting access to their ELN to write the plug-in, and will be its first users, but the results will be freely available for any existing IDBS E-WorkBook suite user.

About the extension Prof Bobby Glenn has said: “Much of Chemistry is lost, it is simply not published and languishes in forgotten lab notebooks. Capturing novel molecules soon after synthesis on a searchable database like Chemspider is now an effortless process directly from the ELN, which will greatly encourage sharing of compounds, synthetic methods and all the associated data. It’s instant messaging for chemists”. Antony Williams (Vice-President of Strategic Development of ChemSpider) added “The ability to now publish compound data from the IDBS ELN directly to ChemSpider offers a path to direct exposure of novel chemistry as well as the chemist doing the work. This public compound registration capability is the first move towards ultimately exposing synthetic methods and associated experimental data to the community. Our vision is coming to fruition through this collaboration.”

Compounds can be published to ChemSpider if they have been drawn out in full in an experiment – whether this is as an individual structure or part of a reaction, and whether they are simply uploaded into the experiment as a reaction file, or included in for example a spreadsheet item. Likewise, compound structures can be automatically loaded into a search of ChemSpider if you would like to find out more information about compounds that have been drawn out in full in an experiment, or if you have published a compound to ChemSpider and wish to see the resulting compound pages. The resulting compound pages in ChemSpider will have the data source “IDBS E-WorkBook Suite”. The external ID will show the ID of the experiment from which the structures are from, and the depositor details as defined in the ChemSpider Settings of the ELN.

The ChemSpider IDBS E-WorkBook Suite plug-in is freely available to customers of IDBS E-WorkBook Suite by downloading it from IDBS, and copying it the appropriate place in their IDBS E-WorkBook Suite program files. It is compatible with E-WorkBook Suite versions 9.0 and 9.1.

This plug-in is an initial proof-of-concept to demonstrate that we can pass compound information between ChemSpider and an ELN in both directions. Future versions will allow more of the information within an experiment to be published to ChemSpider – for example to allow reactions along with a description of their methods to be published to ChemSpider SyntheticPages, or to deposit spectra along with compounds to ChemSpider. We would also like to integrate other ELNs with ChemSpider.

In a way this is a taster, as we’re looking at our Search as part of the refresh of ChemSpider, and more detail will follow. Another motivation for posting was a couple of recent requests for ChemSpider functionality which is already available – a great pointer to how we think about offering massive functionality in a clear interface. The two requests? One was that it would be great if a user could do a search from an input image (so to load an image, convert to structure and launch a search). The other wanted a way to just look for compounds with a specific element included. Both of these can be done on ChemSpider – and tragically both came in as amonymous feature requests. So, because I don’t think they’ve even been fully itemised before, let me count the ways by which you can search ChemSpider.

18. Use our web services for mass spectrometry to search by molecular mass or elemental composition within ChemSpider or within particular data sources,
19. Use our web services to search by chemical identifier, retrieve information about ChemSpider record, retrieve the chemical structure thumbnail
20. Use our web services for spectra to return all Open Data spectral information from ChemSpider, return spectral information on a compound, return identified spectra
21. You can show all spectra of a particular type on the spectra page

We’ve regenerated all of the InChIs in the database with version 1.03 of the InChI code.

What does that mean?

The InChI (international chemical identifier) is a short piece of text that describes the structure of a molecule. Each one is generated by a free and open-source computer program, which guarantees that it should be the same and there shouldn’t be conflicting InChIs for the same molecule. You can’t really write them by hand, because they look like this:

ChemSpider is built on InChIs. If two molecules have the same InChI, then they’re the same record in ChemSpider, and if you can’t InChIfy it, you can’t put it in ChemSpider. That’s why we can’t do, for example, polymers yet.

We’re proud to be founder members of the InChI Trust, which supports this critical element in the sharing of chemical compound information.

What does all this mean for ChemSpider?

Because there is an active community supporting InChI who look out for these things, version 1.03 contained some bug fixes which mean that a very small number of the InChIs themselves, only a few dozen out of the whole database, have changed.

P+–O– bonds and P+–S– are now treated slightly differently. This means that it will be easier to find the exact molecule you’re looking for, regardless of how it’s been drawn. (In principle this will also apply to analogous bonds containing arsenic, selenium, tellurium and antimony, but I can’t see any examples of this in the database.)

There was a small bug where the InChI generated for a molecule with an azide group in it sometimes varied according to the input drawing. But that doesn’t happen now.

This regeneration has also allowed us to catch and clean up some errors in the data.

What happens next?

Version 1.04 of the InChI code will be released soon. With our new framework for processing large amounts of data we’ll be able to update our InChIs much quicker. The main changes in 1.04 that affect the InChI are to how it handles radical atoms in aromatic rings, nobelium, lawrencium and rutherfordium, so we anticipate that there shouldn’t be very many changed InChIs!

The RSC’s free chemical database ChemSpider has added RDF functionality to its interface, in collaboration with the University of Southampton’s School of Chemistry. The availability of RDF allows the database records to be found and understood by semantic web tools, another step in ChemSpider’s mission to create a public chemical information infrastructure.

Richard Kidd, Informatics Manager at the RSC says “we are delighted to work with top academic teams pushing forward what’s possible with semantic chemistry, and we hope others will use the RDF representation of ChemSpider to support their own developments”

ChemSpider as a Linked Data source for oreChem

The machine-processable representation was specifically developed in order to leverage the core competencies of the ChemSpider database: resolvable identifiers; high-quality, curated metadata; and rich linking to the extensive RSC corpus. Furthermore, as part of the Microsoft Research-funded oreChem project, OAI-ORE technology is being used to facilitate the discovery and re-use of the chemical information in the correct context.

Prof Jeremy Frey and Dr Simon Coles commented “it is a pleasure for Southampton to work with the RSC’s ChemSpider as a culmination of our contribution to the Microsoft-funded oreChem project. As a member of the Southampton Chemistry eResearch team, this work forms the core of graduate student Mark Borkum’s PhD thesis. ”

“Enabling open, semantic chemistry in this way is a monumental step forward for the domain,” notes Lee Dirks, director of Education & Scholarly Communication for Microsoft Research, “We’re thrilled to have played a role in facilitating the creation of this resource and extremely pleased to see Southampton and the RSC innovating and leading the field.”

Another oreChem participant, Carl Lagoze, the Associate Professor, Cornell University Information Science, Co-Director Open Archives Initiative added “it’s wonderful to see the results of our work on OAI-ORE in this exciting application. It fulfils our goal of making the results of research easier to disseminate and reuse”

Dotmatics Limited is pleased to announce that it will provide its web-based structure drawing tool, Elemental, to the leading chemistry community website ChemSpider. Elemental provides a zero install drawing tool that lets users draw simple chemical structures or complex structure queries directly within a webpage.

Antony Williams, Vice President of Strategic Development for ChemSpider comments “Elemental offers ease of deployment and flexibility in structure drawing to our community of users and we are happy to embrace this web-based structure drawing platform as an entry point to the rich resources of ChemSpider.”

Dr Mike Hartshorn, Director and CSO of Dotmatics, said “We are delighted to be working with such a well-known chemistry resource as ChemSpider. The new tools will allow simple access to the wide range of structures and related information that is maintained by ChemSpider and the RSC”.

About Dotmatics
Dotmatics Limited (www.dotmatics.com) is a leading provider of web-based database integration and visualisation tools for use within the life sciences industry.

About the Royal Society of Chemistry
The Royal Society of Chemistry is the UK Professional Body for chemical scientists and an international Learned Society for the chemical sciences with more than 47,500 members worldwide. It is a major international publisher of chemical information, supports the teaching of chemical sciences at all levels and is a leader in bringing science to the public. www.rsc.org

About ChemSpider
ChemSpider offers a structure-centric community for chemists to resource data. Offering access to over 25 million unique chemical entities from over 400 data sources and by providing a platform for crowd-sourced deposition, annotation, and curation, it is the richest source of free integrated chemistry information available online. ChemSpider delivers data and services to enable the semantic web for chemistry. www.chemspider.com

The ChemSpider web services are intended to allow you to use the functionality of ChemSpider and query the data in it in your own website or program or script. There are many different webservices as described here, and also many different ways to use them.

One example of how to use them was sent to us by Jimmy Moore from the University of Manchester. He includes a call on the SimpleSearch operation of the Search web service in a perl script. THis searches the whole of ChemSpider by an input value which can be the molecule’s name, SMILES string, InChI, InChIKey, and returns the ChemSpider ID:

For further background, and also an example of a perl script which uses the SMILESToInChI operation of the InChI web service see his blog page.

Please note that to use this (and some of the other) web services you need to obtain a token, by registering with ChemSpider (if you have not already), and then logging into ChemSpider and viewing your Profile page. The Security Token shown needs to be copied into the perl script itself in Jimmy’s example.

Also note that you will need to install the SOAP::Lite for Perl modules to your Perl library to run this script if you don’t already.

If you have an example of how you have used the ChemSpider web services then please reply to this ChemSpider forum post. More examples will inspire more new ideas, and also make it easier for other people trying to do similar things.

The Chemicalize website from ChemAxon is gaining interest (1,2) and, likely, LOTS of users! Chemicalize is both a website for recognizing chemical names and converting to chemical structures as well as an integration path to their property prediction algorithms. Some basic testing of chemicalize shows that their chemical name detection and conversion to structures using either name to structure conversion (algorithmically) or name lookup (via dictionaries) is very good. Not perfect, but very good. Perfect chemical name lookups are impossible as the associated dictionaries grow every time a new natural product is found for example, or a new drug is released.

Now, with ChemSpider ChemAxon were kind enough (and I mean applaud them, acknowledge them and send flowers!!) to give us a way to pass through a structure and initiate the predictions on the Chemicalize site. This is tremendous news for you all! Under the properties Infobox we provide a list of properties from ACD/Labs, a list of properties from EPISuite, a list of experimental properties, sourced from various places and now, the link to Predict Properties using Chemicalize.

Clicking the tab for Predict Properties from ChemAxon display the link through to Chemicalize as shown below.

So, now we have sets of prediction capabilities linked up to ChemSpider. The ACD/Labs predictions are pre-calculated and every time there is an update to the algorithms in theory we would have to recalculate across the database and publish. This would take weeks of time across the almost 25 million structures so it is not a frequent task. It is the same issue with EPISuite. With the Chemicalize integration however the predictions are live, on the structure at the time it is passed to the algorithms. This has the advantage that the prediction algorithms can be incrementally improved and you will always get the latest and greatest results. However, having the predicted values from ACD/Labs available allows flexible searching as shown below. We are grateful to ChemAxon to allowing us to integrate Chemicalize. It gives LIVE access to the latest and greatest predictions as well as access to a whole series of new predictions for which we don’t have data on the database…especially pKa values, topology analysis, geometry and others. Thanks ChemAxon!

In the first of many integration projects presently underway inside the RSC to bring together the benefits of ChemSpider with existing systems we’re happy to announce that the Prospected compound pages are now using structure images from ChemSpider as shown below. We spent a lot of time creating aesthetically pleasing structure images for ChemSpider and especially for display on webpages and blogs so we’re happy to see them show up in other venues too.

Following on from other posts in this series from this week I’m going to continue to list new functionality over the holiday season. I’ll continue with the “Social Widget”. What IS the Social Widget? Well…it’s this thing to the left….it is an AddThis Button that is available for every compound page on ChemSpider now. If there is a particular chemical of interest on ChemSpider that you want to include into your social networking then you can do so by choosing the social networking site of interest and “adding” the link in there. For some it posts the link and for others it posts a thumbnail of the structure there that is linked back directly into ChemSpider.

So, if I posted to Friendfeed it will send the link directly into Friendfeed. I just did it..worked perfectly. For Facebook it actually carries the thumbnail as shown below on my Facebook page. SO, deposit some of your molecules onto ChemSpider and let the world know! Add some data, tell a story, post a reaction…and use AddThis to tell your network!

While some say “Silence is Golden” some of us find it deafening! One of my common statements regarding Press Releases and political commentaries is there is as much said in the “unsaid”. Why this lead in to this blog post? Well….the truth is we haven’t been very productive in the past few weeks with the delivery of new functionality onto ChemSpider and people have been asking me why we haven’t been so prolific with our updates. Well….in this case Silence is Golden based on the new functionality and data rolling out soon!

Historically we were introducing new functionality every few days and rolling it out with a “continuous beta” approach to delivery. We were also working on only three computers and were challenged with issues of uptime and handling. At the RSC we have access to development, test and live environments, we have a stable compute environment supporting the system that provides power support where previously we would have been at risk of outages. We have a support team who have “got our backs” and we are not dealing with all of the issues regarding keeping the environment healthy for the ChemSpider platform. With our new hosted environment and the drive to move away from our previous constant and ongoing updates to a more controlled process for rollout, specifically including internal testing prior to going Live, we have been working on procedures to ensure the best delivery. In parallel we have been working on a series of internal projects that are very exciting and you should see the results soon!

With our new processes in place, and our new systems now established we have been working on new functionality development and are happy to announce that we will now be moving towards regular updates, every few weeks. We’re starting this week with the roll out of a set of new capabilities for you to try out. I’ll highlight these in a series of blog posts over the coming days.Let’s start with this one…

We are happy to announce an improved integration to the patent web service provided to us via our collaboration with SureChem. We announced our initial integration to this service at the ACS meeting last fall in Washington and received a lot of positive feedback regarding the implementation. That rollout only provided integration to a subset of the entire collection, the USPTO. SureChem host data from a number of patent agencies and the collection includes USPTO Granted, USPTO Applications, European Granted, European Applications, WO/PCT and Japanese Abstracts. Thanks to their web service we now have the ability to retrieve information regarding those sources also. The image below shows the patents retrieved for Xanax. Check it out…give us your feedback and extend holiday cheer to SureChem also for their contribution to the community.

As an active member of the Wikipedia Chemistry team I continue to be impressed with the dedication and commitment that the members have to improving the quality AND quantity of information available on Wikipedia for chemists. The number of lost hours of sleep freely given to the benefit of Wikipedia, and in this specific case to the chemistry community, is immense. The number of “Compound Pages” on Wikipedia dedicated to drugs/chemicals has continued to grow and, despite a sincere effort on our part to keep everything linked up from ChemSpider to Wikipedia it’s a little like chasing the Road Runner….we’re always behind!

We have been working with the WikiChem team of late to embed links from Wikipedia back to ChemSpider. I am humbled to know that our hard work to establish ChemSpider as a source of quality information has reached a level of trust such that Wikipedia now links from the ChemBoxes out to ChemSpider. The links are being updated on an on going basis at present with hundreds of new links already established and more being generated on an ongoing basis. Wikipedia User: Beetstra has written a ‘bot that is inserting ChemSpiderIDs across the database (see below) and we ARE doing rigorous checking of all of the links.This was using a file that we generated on our side showing links to Wikipedia from ChemSpider.

We will then be able to generate a list of all ChemBoxes/DrugBoxes without links from Wikipedia to ChemSpider and we will then make the links on our side, manually curating the structures, and then hand back a file to finish all linking. At this point we will have the backfile under control and we can perform ongoing updates as new compound pages are created on ChemSpider and, if we curate and find errors on Wikipedia or ChemSpider making a few manual edits is easy.

There are very dedicated teams on Wikipedia and ChemSpider carefully poring over data with their robots and eyeballs to create a linked data set of quality chemistry. It’s long, tedious AND important work. When its done we will have an expanded set of data to semantically link from RSC articles when we do markup.

I’ve been in discussions with JC Bradley and Andy Lang about the Open Notebook Science Solubility Data project. Specifically we’ve been comparing logP predictions from the CDK versus those listed on ChemSpider. We actually have six values of logP listed for some records. For example, for toluene we have 4 predicted values, 1 experimental value from a database and 1 experimental value from a publication. These are shown below:

There are three predicted logP values from three different algorithms (ACD/LogP, XlogP and AlogPs) as shown at the top of the figure. There is a predicted value and a database value from the EPISuite from the EPA (middle of the figure) and there is a LogP value from a publication with the link out indicated by the arrow (this datum was deposited by Egon Willighagen when he deposited the data from his publication). If you examine the list of data, both experimental and predicted, you will see a general value of around 2.65+/- error. This should be compared with the CDK value listed in the ONS spreadsheet that gives a predicted value of 0.64. This was the primary reason that we were discussing the comparison…the values of predicted logP from CDK were different from the predicted values listed on ChemSpider for a number of examples in the spreadsheet.

Egon and I exchanged a couple of emails discussing the fact that logP predictions could be generated by a number of parties if there was a good Open Data training set available. A recent publication entitled “Calculation of Molecular Lipophilicity:State of the Art and Comparison of Log P Methods on More Than 96000 Compounds” performed a thorough analysis of different logP methods on a very large dataset. The publication is available online here. They compared “the predictive power of representative methods for one public (N = 266) and two in house datasets from Nycomed(N = 882) and Pfizer (N = 95 809). A total of 30 and 18 methods were tested for public and industrial datasets, respectively.” During the work they derived a simple equation based on the number of carbon atoms, NC, and the number of hetero atoms, NHET: log P = 1.46(±0.02) + 0.11(±0.001) NC – 0.11(±0.001) NHET. This equation was shown to outperform a large number of programs benchmarked in this study.This would certainly be easy to implement on ChemSpider and, just out of interest, applying this equation to toluene gives us a value of 2.23. Compare this with the values listed above.”

However, even though there are a lot of predictors available it still makes sense to gather data and provide it as an experimental dataset, made available as Open Data for the developers of such algorithms to ake the benefits of structural diversity and fresh data to potentially improve their models. If you have any logP data available please point me to the data to download or contact me offline to discuss. We are presently working on enhancing our data model to provide improved access to experimental data on ChemSpider as well as access to the predicted data via web services. More to follow…

Last week I had the pleasure of being on an agenda with a number of people whose work I applaud and who I genuinely enjoy spending time with and sharing thoughts about “what if?” Martin Walker, one of the people I collaborate with on Wikipedia, invited me to speak in his session “Publishing and Promoting Chemistry in the Internet Age“. Martin gave an introduction to the session and spoke about Chemistry on the Internet. Beth Brown gave an overview of the Chemist’s Toolkit for Publishing and Promoting your work on the Internet. I followed with an overview about what’s going on with ChemSpider and the issues of connectedness and quality of chemistry on the internet. JC Bradley spoke about transparency and Open Notebook Science. My hat’s off to Martin for arranging the speakers in that order. Considering we didn’t coordinate our talks it was an excellent trajectory throughout the session and very much an integrated overview of activities regarding chemistry on the internet.

My talk is posted on SlideShare here and is available below. Any comments and questions are welcomed.

Beth Brown has her talk online here and JC Bradley will post his online here.

JC Bradley and I had a good talk about ways we can collaborate together more closely on Open Notebook Science. We have a path forward so that ChemSpider can provide additional support and will be discussing the path forward offline.

Google are riding the surf associated with their release of Wave, even to a very small group of testers. Just do a search of Google Wave and you’ll see what I mean. There is a certain amount of “wave envy” in our domain right now as people want to get accounts to test. Test accounts are however being freed up quite quickly and there will be a number of cheminformaticians eager to insert their code into Wave as robots and enable specific integrations. When I was at Scifoo a few weeks ago we were granted Wave accounts to play around. I was impressed with the possibilities but found the system to be a little underwhelming in terms of stability and a little unfriendly in terms of usability. But, these are issues acknowledged by the team and, like many things Google, we are sure to see Wave get picked up by the masses when it’s released. And, if WILL release, with great fanfare.

Cameron Neylon has been the most vocal advocate of Google Wave ever since the first announcements were made about the platform. He has been pivotal in getting a voice for science with the Google Wave team and coordinated a meeting for us with members of the dev team at SciFoo. It was clear in that meeting that the meshing of ChemSpider web services into Google Wave would enable Waves to be enhanced with (semi-)semantic markups so that, at a minimum, chemical names could be used to lookup chemicals on ChemSpider and retrieve a structure image so that hovering over the name in the document would sow the structure image. Unfortunately we’ve been swamped with migrating ChemSpider to RSC servers and preparing for and attending the IUPAC Congress and ACS Fall Meeting in Washington. So, we got a grand sum of nothing done integrating Wave and ChemSpider.

Fortunately, we did well when the web services were built and Cameron has moved ahead with coding up ChemSpidey on his own. He announced that ChemSpider is alive and kicking, with all eight legs, in his blog post here. Stealing shamelessly from Cameron’s post:

“If ChemSpidey is added to a wave it watches for text of the form “chem[ChemicalName{;weight {m}g}]” where the curly bracketed parts are optional. When a blip is submitted by hitting the “done” button ChemSpidey searches through the blip looking for this text and if it finds it, strips out the name and sends it) to the ChemSpider SimpleSearch service. ChemSpider returns a list of database ids and the robot currently just pulls the top one off the list and adds the text ChemicalName (csid:####) to the wave, where the id is linked back to ChemSpider. If there is a weight present it asks the ChemSpider MassSpec API for the nominal molecular weight calculates the number of moles and inserts that. You can see video of it working here (look along the timeline for the ChemSpidey tag).”

Go nd watch the movie. You’ll likely have to watch it while zoomed in to see what is gong on. Cameron went on further than I’d originally consider by pulling back Mw from our MassSpec Web service in order to do calculations on the fly etc. The display of the structure by hovering over the CSID embedded in the Wave is not yet implemented and we need to cover this for sure.

This is a good start to build on and some things that we have to work on…

1) If a call is made to retrieve a chemical based on a chemical name and there are MULTIPLE compounds with that name then figure out how to allow the user to select the one they want

2) Display the structure image with direct link back to ChemSpider – and if appropriate extend to include links to PubChem, Wikipedia, RSC journal articles etc, presence of analytical data etc. (all the things we were going to do with ChemMantis!)

3) Change data model to mark “Fully Curated” structures so that when a structure image and associated meta data are passed to ChemSpidey the robot knows that this isn’t just a name-structure relationship but that humans have curated the data and say “it’s correct”. Then of course…humans can be wrong too!

We are now working in multiweek development sprints and will look to include some time for ChemSpidey enhancement/development in a future sprint. I have a lot of faith in wha Google Wave will bring to us all and despite the early teething troubles,as with all things Google (as far as I can tell) it will improve in terms of stability and usability but may be in perptual beta for a few years!

For those of you who have been using ChemSpider for the past few months you will be aware that historically we had an integration in place to SureChem’s Patent Portal. A few months ago that integration was unfortunately broken as SureChem improved their service. Also, we were un-synchronized with their growing set of chemical structures as they updated their patents. The previous integration was very limited in nature anyway as it simply showed the presence of patents associated with the ChemSpider structure in the SureChem database. Certainly a more ideal solution is the one that we introduced just in time for the ACS meeting in Washington.

The new solution lists not only the number of patents containing the chemical compound shown in the ChemSpider record but also show the first 10 patents, by title, and provides direct link-throughs to the patents on SureChem. This is a much improved integration and we hope you enjoy it. The next stage is to deposit the latest SureChem structure collection that has grown significantly since our last deposition. Thanks to our collaborators at SureChem from offering you, our users, access to their service.

The InChI resolver was rolled out to the community in March 2009 with the purpose of providing a centralized resource for chemists to resolve InChIs (International Chemical Identifiers). This presentation will provide an overview of the development of the underlying technologies associated with the InChI resolver, and how the resolver is being used, integrated and enhanced to provide additional value to the chemistry community. We will discuss present limitations to application of the resolver for providing access to databases and chemistry information distributed across the internet and define our vision for enhancing interconnectivity across Open databases using the InChI resolver as the glue.

ChemSpider: Building a knowledge-based community for chemists using social and data networking technologies (Link to Slideshare)

In less than 2 years ChemSpider has become one of the primary online resources for chemists providing access to an unsurpassed aggregate of free-access knowledge and data. ChemSpider was developed with the intention of providing a structure centric community for chemists that would be enhanced by data depositions, curations and annotations by the community. The system presently hosts over 21.5 million chemical compounds from over 200 data sources. Working with a network of advisors, collaborators and data providers ChemSpider has created a unique resource of integrated information for chemists. These efforts have enabled us to support the curation of the Wikipedia chemistry pages, the production of a community supported Open Access chemistry journal and provision of web services integrated to spectrometer systems distributed around the world. This talk will provide an overview of how ChemSpider utilized social and data networking to create a community for chemistry.

Building an integrated system for chemistry markup and online publishing integrated to online chemistry resources (Link to Slideshare)

The extraction of chemical entities from documents such as patents and publications has been pursued for a number of years. We wish to report on ChemMantis, an integrated system for chemistry-based entity extraction and document mark-up enabling access to the rich resource of online chemistry know as ChemSpider. We will discuss the development of the platform from its inception as a series of dictionaries to the integration of an entity extraction algorithm and its expansion to a public deposition and publishing platform for chemistry. Chemistry articles can now be deposited, marked-up and exposed to the public within a few minutes in many cases making it an ideal platform for communicating research and providing integrated access to data sources including PubChem, ChEBI, Wikipedia and Entrez.

ChemSpider will go offline today for the next 24 hours. We will switch the servers off at around 11am today (give or take some latitude). We will do a differential backup and restore to the RSC servers all changes to the database and switch over to their systems overnight. Testing performed over the weekend has proceeded rather well and we are hoping for a seamless transition, acknowledging that we will have this one day of downtime.

We apologize in advance for any disruptions. We know that there are a lot of people now using ChemSpider services to feed your own systems so our apologies in advance. We expect improved service for all when this transition is complete.

We’ll see you on the other side of this transition in just over 24 hours. Wish us luck…

I blogged yesterday about our release of Wikipedia Services on ChemSpider and how we are working to support authors on Wikipedia articles. Of course there are MANY languages of Wikipedia (as shown below) and we are willing to produce multilingual support. All we need is someone from the specific language version of Wikipedia to contact us and map the ChemBoxes and Drugboxes into their relevant languages. Let us know if you are interested.

Wikipedia is great. I use it regularly. I’ve been working, with a team of experts, on curating and validating the “structure-based data” in the ChemBoxes and DrugBoxes for almost a year and a half. It’s been a long path and on the journey I have met some great people and made some true friends. I also HAVE NOT met most of the people I share the IRC chats with. We are a highly opinionated bunch of people but with a common focus of making Wikipedia better and making the data and content as accurate as possible.

We have the Wikipedia article lead in thousands of records on ChemSpider now. They are updated regularly as Wikipedia itself expands. One of the areas we have been focused on since the inception of the work was getting correct structures in place with the associated data. This includes the molecular formula, molecular weight, SMILES, InChI String, InChIKey, systematic name and so on. In order to help the process of expanding Wikipedia with new records and to provide a lot of these data automatically we have set about providing a Wikipedia Service so that Wikipedians can use ChemSpider as the source of the chemical structures of interest and generate the DrugBox and ChemBox content from ChemSpider. It’s a rather simple process…

Assume that you wanted to create a ChemBox for Domoic Acid you would search Domoic Acid on ChemSpider. You would then validate whether the structure on ChemSpider named domoic acid is correct and. if so, you would generate the Wikibox by clicking on the link to the right of the Quick Links

Following this simple button click the user is shown a new window displaying the “Design Wikibox” functionality. There are various flavors of ChemBoxes and Drugboxes which can be generated and the image below shows the “Simple ChemBox”

At present we fill the box with those data we have easy access to from ChemSpider and based on the chemical structure. We list all other fields for Wiki depositors to populate. For the Simple ChemBox this looks like this for Domoic Acid

We insert the PubChemID associated with the particular structure if there is a related PubChem record. We also insert the ChemSpider ID in case the user wants to link back to ChemSpider. A Full ChemBox is much longer:

The user can also use the ChemSpider image and can resize it and click on the image to download it as a PNG file. We believe that our images are attractive and appropriate for web display. Wikipedia present favors the ACS format so based on feedback we can change the config file behind the image generator to produce a different format for display.

We are considering extending the system to support direct uploads of Molfiles and/or other structure formats rather than depending on a compound being on ChemSpider. However, it is VERY likely that chemical compounds of value to the Wikipedia encyclopedic content already exist on ChemSpider. The trick is to find them since they may not have the Wikipedia article chemical name associated with the record. An InChI-based, SMILES-based or alternative name search might help locate the record. Alternatively a full structure search via the applet will find the record OR the user can DEPOSIT the structure to ChemSpider and work from there. The system is flexible enough.

This is our first release of the Wikipedia Services so we welcome any and all feedback. It’s one more way we are giving back to the Wikipedia community for their service. The outcome for us will also be crowdsourced curation of ChemSpider…as Wikipedia articles are written we will clean up related structures on ChemSpider. Everyone wins.

By the way…check OUR structure for Domoic Acid with that one on ChemSpider. Does anyone know which is correct?

I’ve blogged previously about ChemSpider in your hand. I use ChemSpider in my hand daily via Safari on an iPhone but a mobile app is under development by James Jack (Symyx consultant). James has been burning the candle at both ends progressing the iPhone application…and not without a lot of hurdles. In Skype discussions with him yesterday he has progressed well and will be finished shortly. The first screenshots through the iPhone emulator look good and one is shown below.