What’s the value of redepositing the same structures to PubChem? There are actually many – structures on PubChem connected out to ChemSpider will now be connected to Patents, those structures will be connected to analytical data, they will be connected to additional identifiers not available on PubChem, they will be connected to curated identifiers (compare the list of names for methane on PubChem versus ChemSpider), they will be connected to Supplementary Data, and they will be connected to additional predicted properties. So, there is actually a LOT of value in having the links back to ChemSpider from PubChem.

Our best estimate is that there will be about 8 million new structures finding their way to PubChem from ChemSpider.

Now, I was close to thinking that we could declare that the ChemSpider structure collection was Open Data. I’ve posted about Open Data and it’s definitions and challenges previously (1,2). PMR , one of the primary evangelists of Open Data and its definitions is continuing to refine the definitions of Open Data. Extracting from Peter’s post

“I reiterate some guidelines. I’m still working these out and would welcome comment. (I don’t feel we should stray too far from the The Open Knowledge Foundation guidelines. ) As a start I would suggest the following:

There must be some mechanism whereby the community could, if it wished, capture the resource for public archival without permission. This could be as simple as spidering the site, or a relational dump, or a massive file, or an iterator.

There must be no permission barriers to re-use including commercial re-use.

The data must either be the whole work (at a given point in time) or be clearly bounded (i.e. there should be no hidden data that the world cannot get access to in the same way).

There should be no time limits on access and re-use.”

For right now I am giving up on trying to track where Open Data might end up. Based on my previous discussions with Peter Suber regarding navigating the complexities of Open Access definitions, I understand there is a need to define our own policies. I’m not going to do that here but what I will be clear with is that once the ChemSpider structure set is deposited in PubChem then we are at the mercies of THEIR data sharing policies. I believe Peter holds up PubChem as the primary example of Open Data (but maybe not). So, I believe it should be true to say that the ChemSpider structure set IS Open Data when accessed/downloaded/shared from PubChem. But I understand that will then be the PubChem data set and all association with us will likely be lost. But that is fully acceptable!

This entry was posted on Wednesday, November 28th, 2007 at 8:46 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed.
You can skip to the end and leave a response. Pinging is currently not allowed.

4 Responses to “The Entire ChemSpider Database is On Its Way to PubChem!”

So one way would be for Substance records to link back to the appropriate ChemSpider compound summary page. Another would be for ChemSpider compound IDs to be added to the PUBCHEM_EXT_DATASOURCE_REGID field. Or will another mechanism be used?

Regardless of how exactly linkage occurs, the end result would be that any third party could, independently of ChemSpider, reconstruct the entire ChemSpider compound database. By using the ChemSpider Web APIs, they could develop a parallel service that re-processes the ChemSpider analytical data and patent/primary literature data, possibly mashing up the data from other sources as well.

This sets the bar very high for Open data in chemistry. I’m not sure what to call it, but it’s a game-changer.

I believe PubChem update on a fairly regular basis but they’ve never dealt with over 17.5 M structures so I don’t know how ling it will take them.

Regarding the linkage to patents – PubChem will link back to ChemSpider via the ChemSpider ID and Data Source connection. Since the SureChem patent structures are on ChemSpider users will land from PubChem onto ChemSpider and only determine the link to patents at that point. The link from PubChem to SureChem patents directly would come if SureChem chose to deposit their own structure collection directly.

Regarding “This sets the bar very high for Open data in chemistry. I’m not sure what to call it, but it’s a game-changer.” You’ve likely seen over the past few months we’ve been challenged on Open Access and Open Data. We’re declaring neither. We’re declaring Free Access. Our Web APIs have opened up access to certain data only. If people choose to scrape other data such as predicted data they are violating rights of other groups who have allowed us to use their algorithms. While SureChem remains free access then mashups of some form might be possible but would likely take a lot of work.

Last friday I did update the Pubchem data (FTP download of compounds) locally in a oracle database server using Jchem (chemaxon tools). my statistics (excluding the error structures deducted by Jchem) is reproduced below (total: 11,623,278) . Each row represent one million counts. Please note that 13-16 has low count of actual entries for every one million molecules expected. To my surprise today the total list of molecules available for download at Pubchem (compounds) increased to 23 millions. I am in the process of updating the local database for actual counts.