This is a presentation I gave today at Bio-IT 2014 here in Boston. I was in the company of a number of my favorite people to be o the agenda with… Steve Heller, Steve Boyer, Evan Bolton and Chris Southan.

The Importance of the InChI Identifier as a Foundation Technology for eScience Platforms at the Royal Society of Chemistry

The Royal Society of Chemistry hosts one of the largest online chemistry databases containing almost 30 million unique chemical structures. The database, ChemSpider, provides the underpinning for a series of eScience projects allowing for the integration of chemical compounds with our archive of scientific publications, the delivery of a reaction database containing millions of reactions as well as a chemical validation and standardization platform developed to help improve the quality of structural representations on the internet. The InChI has been a fundamental part of each of our projects and has been pivotal in our support of international projects such as the Open PHACTS semantic web project integrating chemistry and biology data and the PharmaSea project focused on identifying novel chemical components from the ocean with the intention of identifying new antibiotics. This presentation will provide an overview of the importance of InChI in the development of many of our eScience platforms and how we have used it specifically in the ChemSpider project to provide integration across hundreds of websites and chemistry databases across the web. We will discuss how we are now expanding our efforts to develop a Global Chemistry Network encompassing efforts in Open Source Drug Discovery and the support of data management for neglected diseases.

As part of a collaborative project with Jean-Claude Bradley from Drexel University (and member of our Advisory Group) we are in the process of delivering new capabilities for the upload of “activity data” associated with one or more structures, the display of these data with the associated structures on ChemSpider and the download of these data to the desktop. Our efforts are part of JC’s overall workflow outlined here.

So far we have enabled the ability to upload CSV files containing SMILES strings and associated data, converting the CSV files on the fly to SDF files for deposition onto ChemSpider. We have also enabled a general capability for the download of collections of data using checkbox selection and download as an SDF file with associated properties. Rather than insert images into this blog posting please click here to see the PDF of the Powerpoint overview.

While we have the deposition process and downloading process essentially completed (except for testing) we now need to resolve the process for deduplicating the submitted data onto the database (or generating new structure records) as well as defining the format for display on the site. Watch this space.

When we first started the ChemSpider project we made a commitment to “Build a Structure Centric Community for Chemists”. We are well on the way to facilitating that we believe. We have talked about a “wiki” environment for collaboration. In this framework we see wiki to indicate a “collaborative environment”, not necessarily adherence to a specific wiki-platform. Our intention is to provide the ability for users of ChemSpider to collaborate in the co-management of content on the ChemSpider site. A number of our readers have taken our statements to indicate that we will be using the same wiki platform as that utilized on Wikipedia. We have looked at and considered a number of “wiki” tools, platforms, interfaces and user-experiences. At this time we have made a decision to utilize Microsoft Sharepoint as the platform on which to construct our wiki-environment. With a clear commitment to Web 2.0 already declared and our platform built on SQL server and ASP.NET we feel it is the appropriate platform for us to build on. We believe the correct platform choice has already demonstrated that we can deploy a good solution very quickly because of our technology choices.

Now, we realize that this might result in a series of jabs about us not using Open Source solutions and so on but we are more focused on delivering an appropriate scalable solution than building ChemSpider only on Open Source software. We will support anyone who wishes to do the same on Open Source though.

We will keep you informed of our progress. Now we need to migrate ourselves to .NET3 and we hope this will be a short term disruption in the future as we switch over. Watch this space.

Since we went live in March we had a link on the web page for Registration. It is only now enabled so our apologies for the delays. Was it worth the wait? Absolutely. Visit the registration page and sign up to benefit from a number of advantages as well as to become a “contributor” to the site.

I have posted a number of times about the intention for ChemSpider to become Wiki-enabled. While we have not yet layered the full wiki capability onto the system we are about to unveil single structure deposition, multiple-structure deposition and spectrum association with a structure. We already have beta testers testing the spectral association and over 40 spectra have already been added to the database.

The reason we require registration should be fairly obvious. For the purpose of providing access to beta-testers and granting the appropriate rights to submitters and curators we need to have traceability regarding who is making the submissions, the comments and the edits. While we understand this might be slightly invasive it is appropriate to retain a certain level of control over what might show up on the database. We hope we have your support.

Some side benefits of registration include an additional way for us to deliver you updated information regarding new capabilities being introduced to ChemSpider, updates regarding content enhancements and, when the time is right, delivering information to you via a ChemSpider newsletter.

If you are a frequent user of ChemSpider you have likely been using the text-based searches to query the system. However, there are definitely those of you who are more adventurous and have initiated structure and substructure searches utilizing input via SMILES, InChI or the Structure Drawing Applet (See Section 2.3.3 of the online ChemSpider manual).

You might have discovered that some of those substructure searches can be a little time-consuming and you might want to let the search continue and “move on”. Alternatively, you might have performed a number of searches during a particular session with ChemSpider or may want to return to some results from searches performed in previous sessions. In any case, what you need in these cases is access to the details of the searches you have performed. The screen grab below speaks volumes.

As an example of a saved search in MY personal history of search transactions please visit the link listed here…note it’s unique nature in terms of identifier. http://www.chemspider.com/Search.aspx?rid=105ecb44-303b-4239-9559-75753bc9777c.
While this search was very fast in reality, we are about to introduce Structure Similarity Searching onto ChemSpider. This can be a very time-consuming operation and certainly the deferred transaction tracking is likely to be of value when this new feature is rolled out.

For now, give us feedback on access to your personal history of transactions. Simply visit the link http://www.chemspider.com/history.aspx after you have performed some transactions on the system and should see your list of actions. Enjoy…

So, ChemSpider is still in beta even though we are moving fairly quickly in the “back room”, and the development team has grown. There is a lot of infrastructure work going on despite what you see on the site when you come to crawl with the Spider. You should sense performance improvements when you use it though.

I’m the whiner on the ChemSpider team. That means I use the “whine” function in our bug-tracking system Bugzilla, our Open Source bug-tracking and feature request tracking system. It’s a great system for our needs and if you need a system out-of-the-box which will suffice for moist of your needs go grab it. So, what am I whining about? Here’s my list of Five Things I Don’t Like About ChemSpider.

1) Many of our molecules are simply ugly. The connection table is correct but some of our displays of the molecule are far from perfect – take a look at the two structures below. The one on the left does NOT really have a chlorine attached to the oxygen. The one on the right is simply a mess.

This is an issue of Structure Cleaning. It IS difficult. Even the drawing package vendors struggle with this. It needs improving…whine…

2) We really need to do something with the curated data. People have been curating data on our site for a few weeks. We need to do something with it to show that their efforts matter….whine, whine…

3) We need to allow people to deposit data. There are people wanting to flood the system with data…in some cases one structure and in many cases thousands of structures. Right now its a very manual process…data has to come to us. I want a user to be able to submit their own data…sure we’ll validate and review but let people deposit the data at least….whine, whine, whine…

4) I want people to add content and information to structures. Not everything can be done with robots and process…people need to contribute. Wikipedia is all about community participation. Chemists have information about structures, about reactions..about connected data. When they have info they want to associate/link/dump and connect to a structure I want them to be able to do it….whine, whine, whine, whine…

5) ChemSpider was a rushed release. We’ve never hidden it…it was rolled out in time for the ACS in Chicago…and rushed. It went live with a lot of holes. It’s still beta. It’s working, and it’s moving but it’s time for some of the work flows to be improved and the website to be “prettied up”……whine, whine, whine, whine, whine…

Ok…those are my top five whines….and none of them are from a bottle. So, with these in mind it’s what we’re off to work on. Our short to midterm efforts will be in these areas.

What are the things YOU don’t like about ChemSpider? Whine away…we’ll Bugzilla your comments!

ChemSpider was released with the ability to initiate searching of structures using two drawing tools – an Applet and ACD/ChemSketch. Other ways to submit structures include the copying and pasting of either SMILES strings or of InChI strings. We have just celebrated 2 months online and are averaging about 800 users per day at present. Examination of the usage has shown that, in order, users submit structures in the following rank order: SMILES strings, applet, ChemSketch and then InChI.

During the two month beta period we have received numerous suggestions about how to improve the system. A number of these have included new ways to query the database with a chemical structure input. Specifically, some of the statements have been: 1) Use a better structure drawing applet, 2) Provide integration to ChemDraw, 3) Provide integration to ISIS, 4) Allow copying and pasting of a molfile. All are feasible of course…it is all about priorities.

In terms of applets we already have permission to use Peter Ertl’s JME applet and are aware of other potential options including Marvin, the JChemPaint applet and the MCDL applet. We’re very happy with the present applet ourselves but welcome your comments if you believe that other applets should be made available as an option.

Your comments as welcome either as a response to this blog posting or directly at development AT chemspider DOT COM (longhanded to prevent spam).

I was in an exchange with a friend this weekend about his interest in depositing data onto ChemSpider. Due to our travel schedules and family commitments we rarely talk by phone. This gentlemen is a retired chemist, though highly active. He is an expert in nomenclature and has an incredible eye for quality and is a master curator of chemistry databases.

So, he is very interested in ChemSpider and the potential of exposing his databases. However, his expressed concern is that he will lose all the efforts he has invested in developing the databases. Again, these are manually curated, with an experts eye and, based on my experiences working with him are of the highest quality. They amount to tens if not hundreds of hours of work and are a source of revenue for this gentleman.

WIth this in mind, and based on other blog posts I have seen, it appears that we have not clearly defined the intention of ChemSpider. What we are NOT doing is aggregating all data from all publicly available data sources or even supplied databases. Our intention for the immediate future is to form a structure centric environment linking out to the initial data source providers via the chemical structure. The individual providers continue to provide their content and retain their value proposition.

For example, The NIST webbook is a container for a lot of information including spectral data. As discussed in another post about the sodium chloride dimer ChemSpider will provide the link to the webbook to display relevant data for this gas phase species. A search for diazepam will provide links out to all original data sources as shown here and they include ChemBank, ChEBI, NIST Webbook and many others.

ChemSpider is an aggregator of chemical structures and associated identifiers (enabling connectivity to other sites). We are NOT duplicating all content available at other sites. This removes the burden of updating associated data across multiple data sources as individual providers curate and update their own sources. It also keeps ChemSpider on task of linking together multiple sources of data via chemical structures rather than grabbing the work of other groups and reposting.

So, back to my friend who is worried about depositing data on ChemSpider. All we will be taking delivery of are the structures, the structure IDs (if available) and a link to information about the database. In this way we are directing individuals to rich sources of information for ChemSpider users to pursue as they see fit. Just as many depositions into the public online databases are from chemical vendors intending to potentially sell their materials the same model applies to database providers. After all, if information content is of value it is up to the user to choose to pay for the right to access.

Taking this one step further one has to consider the following question. For the large database providers (Beilstein – now MDL, Derwent, CAS, Cambridge Crystallogrpahic Databases, DiscoveryGate and others) why not put their structure collections into the public domain for the purpose of searching and connecting back to the actual content of value. The structures themselves, as far as I know, are in all cases in the public domain since they are published (I might be wrong here but I cannot find statements to the contrary). The value comes from the information associated with the structures – one or more publictaions, reaction details, experimental or predicted properties, connection to a patent, and other such content association.

What’s at risk to provide public access to the structure database(s) for searching and charging the appropropriate fees to access the information once identified? There is little value in simply knowing that a structure exists in a database is there? Isn’t it the information associated that has value? If this wasn’t true then that would suggest that a large database of algorithmically generated structures created with something like MolGen or the structure generator in Structure Elucidator would have value. In fact it does….see the work of Reymond et al in their “Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F“. The value however comes not from the computer structure itself but rather the virtual screening response.

I judge there are two challenges – a decision at the management level to expose the large structural repositories and the enormous hurdles in migrating certain classes of chemical structures to SDF format to be hosted by general services – specifically polymers, organometallic complexes and inorganics (also all challenges for ChemSpider!). I think the primary challenge is the decision to expose the data…I judge it’s the right decision to make with the increasing availability of Open Access databases such as ChemSpider. It’s a BIG decision …

We’re at 5 weeks since we let people onto ChemSpider to crawl our web. The service is now enabling over 650 visitors per day on average in the past week to search the database and utilize the services. At present the database has grown to 10.6 million structures. However, as stated at the time of release we would limit the database to around 10 million structures solely for the purpose of testing.At present we have 800,000 molecules from 4 new contributors waiting to be added to the database after prediction of all associated properties and then de-duplicated with the existing content. We are in the process of converting over 10 million NEW structures from SMILES format. These will also be passed through all prediction algorithms, added to the database and then de-duplicated. There are a number of other databases to be delivered to ChemSpider for preparation in the next few days. With full disclosure you should be aware that ChemSpider is served up from two Dell servers. They host the transactions, the database, the web server, the webpage, our email system. They are also, in parallel, converting SMILES to connection tables (millions of them) and are predicting a series of properties on every structure (check ChemSpider..youâ€™ll see them all). Some of these properties take many seconds since they are complex calculations.Bottom lineâ€¦these servers are in dire need of air conditioning systems. They are running flat out. Our ability to provide fast searches, especially structure and substructure searches, while also being able to perform transaction-based predictions is already starting to fail. People are reporting performance issues so we have moved our predictions to the evening. It is clear we need a new server already (likely two) and much earlier than expected. I guess that’s what you call one of the struggles of success. I spent today reading about how one of the founders of YouTube kept extending his credit card bill to cover their technical costs (Time, January 2007). Well, we are not YouTube, this is not Silicon Valley and we’re not putting our families at risk. As already discussed on this blog we may need to seek sponsorship. As it is we made the painful decision today that we have to start some form of advertising. We will be judged on this. And we acknowledge it. But our intention is to stay faithful to the community to have the service remain free but offer the best services and throughput that we can. Only more computing power will allow this at this stage. We will stumble along for the next month with what we have but if the dataset grows to the expected 15-20 million structures we will need to expand our plastic boxes. Hopefully our users will understand.