The National Chemical Database Service Allowing Depositions

20Oct

The UK National Chemical Database Service (available here) has been online a few years now, since 2012. When I worked at RSC I was intimately involved in writing the technical response to the EPSRC call for the service and, in this blog, I outlined a lot of intentions for the project. A key part of the project from my point of view was to deliver a repository to store structures, spectra, reactions, CIF files etc as I outlined in the blog post.

“Our intention is to allow the repository to host data including chemicals, syntheses, property data, analytical data and various other types of chemistry related data. The details of this will be scoped out with the user-community, prioritized and delivered to the best of our abilities during the lifetime of the tender. With storage of structured data comes the ability to generate models, to deliver reference data as the community contributes to its validation, and to integrate and disseminate the data, as allowed by both licensing and technology, to a growing internet of the chemical sciences.”

In March 2014 at the ACS Meeting in Dallas I presented on our progress towards providing the repository (see this Slidedeck). ChemSpider has been online for over ten years and we were accepting structure depositions in the first 3 months and spectra a few weeks later (see blogpost). The ability to deposit structures as molfiles or SDF files has been available on ChemSpider for a long time and we delivered the ability to validate and standardize using the CVSP platform (http://cvsp.chemspider.com/) that we submitted for publication three years ago (October 28th, 2014) and is published here: https://jcheminf.springeropen.com/articles/10.1186/s13321-015-0072-8. With structure and spectra deposition in place for over a decade, a validation and standardization platform made public three years ago, and a lot of experience with depositing data onto ChemSpider, all building blocks have been in place for the repository.

Today I received an email into my inbox announcing “Compound and Spectra Deposition into ChemSpider“. I read it with interest as I guess it meant it was “going mainstream” in some way as it’s been around for a decade as capability. Refactoring for any mature platform should be a constant so my expectation was that this would show a more seamless process of depositing various types of data, a more beautiful interface, new whizz-bang visualization widgets building on a decade of legacy development and taking the best of what we built as data registration, structure validation and standardization (and all of its lessons!) and rebuilds of some of the spectral display components that we had. It’s not quite what I found when I tested it.

Here’s my review.

My expectations would be to go to http://deposit.chemspider.com and deposit data to ChemSpider. The website is simply a blue button with “Log in with your ORCID”. There is language recognizing that the OpenPHACTS project funded the validation and standardization platform work which is definitely appropriate but some MORE GUIDANCE as to what the site is would be good!

“Validation and standardisation of the chemical structures was developed as part of the Open PHACTS project and received support from the Innovative Medicines Initiative Joint Undertaking under grant agreement no. 115191, resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007-2013) and EFPIA companies’ in-kind contribution.”

This means that it should be possible to deposit a molfile, have it checked (validated) and standardized then deposited into ChemSpider, having passed through CVSP. So what happened?

I downloaded the structure of Chlorothalonil from our dashboard and loaded it. The result is shown below. The structure was standardized and correctly recognized as a V3000 molfile. The original structure was not visible, there were no errors or warnings and the structure DID standardize.

The original isotope labels were removed, the layout was recognized as congested and partially defined stereo recognized. But it wouldn’t deposit. I tried many others and they would not deposit and was going to give up but tried Benzene, V2000, downloaded from ChemSpider. And….YAY….it went in. The result is below.

A unique DOI is issued to the record, associated with my name. It is NOT deposited into ChemSpider as far as I can tell because the structure is already in ChemSpider. There is also no link from ChemSPider back to my deposition, that I can find. My next try was to find a chemical NOT in ChemSpider and to deposit that. That failed. I tried Benzene again and it worked a second time. I judged that maybe a simple alkyl chain would work for deposition. The result is below.

The warning “Contains completely undefined stereo: mixtures” does not make sense at all for this chemical. PLUS it wouldn’t deposit.

I then tried to register a sugar as a projection with the result shown below. I consider this one to have some real errors and do not AT ALL like the standardized version.

I tried a simple inorganic. I think KCl should be recognized as an ionic compound as K+Cl-, at least SOME warning!?

The testing I did took about an hour overall. I identified a LOT of issues. I think this release, while it may be a beta release for feedback, is way premature and needs a lot more testing. I am hopeful that more people will fully test the platform as the ABILITY to deposit data, get a DOI, and associate it with your ORCID account, but it’s not obvious that anything is linked back to ORCID and it is nothing more than being used for login.

I did NOT test spectral deposition but am concerned that the request seems to be for original data. In binary vendor file format? Uh-oh. That’s not a good idea!

I hope this blog motivates the community to test, give feedback and push the deposition system to deal with complex chemistries so at least the boundary conditions of performance for Deposit.ChemSpider.Com, which appears to be more of writing a chemical to some other repository as there is no real connection to ChemSpider I can find (?), can be defined, the system can be improved and a community can be built around the functionality.

Building public domain chemistry databases is hard work. User feedback and guidance is essential. Please give your feedback and test the system.

About tony

Antony (Tony) J. Williams received his BSc in 1985 from the University of Liverpool (UK) and PhD in 1988 from the University of London (UK). His PhD research interests were in studying the effects of high pressure on molecular motions within lubricant related systems using Nuclear Magnetic Resonance. He moved to Ottawa, Canada to work for the National Research Council performing fundamental research on the electron paramagnetic resonance of radicals trapped in single crystals. Following his postdoctoral position he became the NMR Facility Manager for Ottawa University.
Tony joined the Eastman Kodak Company in Rochester, New York as their NMR Technology Leader. He led the laboratory to develop quality control across multiple spectroscopy labs and helped establish walk-up laboratories providing NMR, LC-MS and other forms of spectroscopy to hundreds of chemists across multiple sites. This included the delivery of spectroscopic data to the desktop, automated processing and his initial interests in computer-assisted structure elucidation (CASE) systems. He also worked with a team to develop the worlds’ first web-based LIMS system, WIMS, capable of allowing chemical structure searching and spectral display.
With his developing cheminformatic skills and passion for data management he left corporate America to join a small start-up company working out of Toronto, Canada. He joined ACD/Labs as their NMR Product Manager and various roles, including Chief Science Officer, during his 10 years with the company. His responsibilities included managing over 50 products at one time prior to developing a product management team, managing sales, marketing, technical support and technical services. ACD/Labs was one of Canada’s Fast 50 Tech Companies, and Forbes Fast 500 companies in 2001. His primary passions during his tenure with ACD/Labs was the continued adoption of web-based technologies and developing automated structure verification and elucidation platforms. While at ACD/Labs he suggested the possibility of developing a public resource for chemists attempting to integrate internet available chemical data. He finally pursued this vision with some close friends as a hobby project in the evenings and the result was the ChemSpider database (www.chemspider.com). Even while running out of a basement on hand built servers the website developed a large community following that eventually culminated in the acquisition of the website by the Royal Society of Chemistry (RSC) based in Cambridge, United Kingdom. Tony joined the organization, together with some of the other ChemSpider team, and became their Vice President of Strategic Development.
At RSC he continued to develop cheminformatics tools, specifically ChemSpider, and was the technical lead for the chemistry aspects of the Open PHACTS project (http://www.openphacts.org), a project focused on the delivery of open data, open source and open systems to support the pharmaceutical sciences. He was also the technical lead for the UK National Chemical Database Service (http://cds.rsc.org/) and the RSC lead for the PharmaSea project (http://www.pharma-sea.eu/) attempting to identify novel natural products from the ocean.
He left RSC in 2015 to become a Computational Chemist in the National Center of Computational Toxicology at the Environmental Protection Agency where he is bringing his skills to bear working with a team on the delivery of a new software architecture for the management and delivery of data, algorithms and visualization tools. The “Chemistry Dashboard” was released on April 1st, no fooling, at https://comptox.epa.gov, and provides access to over 700,000 chemicals, experimental and predicted properties and a developing link network to support the environmental sciences.
Tony remains passionate about computer-assisted structure elucidation and verification approaches and continues to publish in this area. He is also passionate about teaching scientists to benefit from the developing array of social networking tools for scientists and is known as the ChemConnector on the networks. Over the years he has had adjunct roles at a number of institutions and presently enjoys working with scientists at both UNC Chapel Hill and NC State University. He is widely published with over 200 papers and book chapters and was the recipient of the Jim Gray Award for eScience in 2012. In 2016 he was awarded the North Carolina ACS Distinguished Speaker Award.

4 Responses to The National Chemical Database Service Allowing Depositions

Emma Schymanski

October 21, 2017 at 1:13 pm

I received an email from my husband, sitting in an Open Science workshop, pointing me to this service. He knows I have been a huge ChemSpider fan and user over many years now – and wondered if I knew about it.
I found the entry screen off-putting (just the login, no clarification what the service offers, what format etc) and could only see a contact form, no details about who was involved (I was curious). I didn’t have my ORCID details on hand and waited until a few days later, on a laptop with the ID integrated, to try deposition. Only one mol file at a time?!?! As a heavy user, I would at least like SDF or SMILES or InChI or some choice. I rarely have just mol files lying around, so also tried depositing a structure Tony and I helped get into the Dashboard, because I knew it wasn’t in ChemSpider. I received the same “Oops”, tried a couple of times, then used the contact button to report the issue. I received a nice reply saying that the issue is their side and they are working on it and will let me know when they know more. A good response.
One mol at a time will not serve my deposition needs. There are so many other options that could make this more efficient, I would wish for some ability to add more at once.

PS: I first tried to comment this ~10 hrs ago and failed twice during identification authorisation (two different methods). Not only structure deposition has its hiccups!

With regards to this comment:
I did NOT test spectral deposition but am concerned that the request seems to be for original data. In binary vendor file format? Uh-oh. That’s not a good idea!

If the choice is to offer one or the other as a first step, probably asking for the raw (original data) is the best choice. From our experience as a workflow using mzML, it is often best for us to have the raw data to validate and do the conversion ourselves – with the converter we know and trust. That way one avoids potential losses if the user chose a different converter that lost information. Plus you still have the raw data if converters get better in the future. However, that being said, there are still some cases where specific vendors offer better conversion options or data quality within their vendor software. Thus, the ability to upload either raw or open (or both) would be beneficial. Requiring both by default is a burden and will put some people off.

Tony (and Emma), thanks for the comprehensive feedback. As you alluded to, it is a pilot system with only basic features. Community feedback will help us to prioritise which features to add in the future.

Hi Mark, it would be great if you would also allow a user to comment with the upload, in addition to the DOI or a spectrum as a file. I would have liked to have been able to add a URL pointing to more resources for the compound I wished to upload, but there was no place for this. Sometimes we as a user also need to communicate something with the deposition. Furthermore, please add the ability to select multiple files at once for the spectra. MS/MS is best communicated with multiple collision energies and thus multiple files…

[WORDPRESS HASHCASH] The poster sent us ‘0 which is not a hashcash value.