Welcome

The Blue Obelisk Exchange is the place to ask about the use and development of Open Data, Open Source, and Open Standards: how to perform tasks and solve chemical problems with these, or if an ODOSOS tools is available for some task. Or even to ask if someone can provide such a tool. The questions do not require to be about Blue Obelisk solutions itself; they can be about any ODOSOS chemistry tool, service, or database.

4 answers

The Journal of Chemical Education Digital Library has two listings for collections of pKa data. There is an extensive compilation of aqueous pKa data compiled by R. Williams available as a PDF, and also a link to pKa's in DMSO measured by F. Bordwell.

Hej Tony, I have started with the compilation by R. Williams, which has many references. QSAR literature is largely useless, not even thinking about the non-Open Data issues, but the mere fact that they are compilations without source annotation.

I agree the Williams pKa collection is nice, but its PDF, the semantics destroyer (well its actually not PDFs fault), no molecule coordinates, and its rather a handout for the organic chemistry lab. Probably the widest known and available solution.

A large collection can be found in commercial DBs (ACDLabs, MDL MDDR,CMC3D)
and others.

A large freely available pKa database can be found from the EPA EPIsuite's
Physprop distribution. It requires to manually enter SMILES or CAS numbers.
* PHYSPROP DB (frei, its not only logP but pka)
http:\\esc.syrres.com/interkow/physdemo.htm
http:\\www.epa.gov/opptintr/exposure/docs/episuite.htm

The largest publicly available dataset is the freely available
ChEMBL DB by EMBL-EBI DB. It is a database of ca. 500000 bioactive compounds and ADMET values. It is availabe online, with a query tool, but also offline as SDF, MYSQL and Oracle dataset.

The free DB contains around 4600 pKa values with full annotation, curated, solvents, SMILES and SDF structure codes and literature data. You can download it from ftp.ebi.ac.uk/pub/databases/chembl/ or export the pKa values only to an EXCEL file. This is done with the query function by selecting

The annotation set is very important. the pKa values in the database can not be taken "as is" to create QSPR models. There are 360 different (case intensive) annotations, it is up to the modeler itself to cluster those annotations. See PS1. If there is any new model which claims to have the data obtained from 4179 pka values from the EMBL-EBI DB, check if they consider solvents (ETOH, DMF water) temperatures and so on. Furthermore values are sometimes obtained for the same compound at different temperatures.

The database was obtained with a 9.3 million dollar gift from the Wellcome Trust to EMBL. The starlite DB was obtained from Galapagos for 2.8 million dollars and is now curated and maintained at EBI. The DB is CC-by-SA meaning

"Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license." which is a "hard" LIC I guess there was a lot of discussion about the license form (why not cc-by, which allows commercial and non-commercial use, for the greater good, for science, for the world).

The post here itself is licensed under CC0 (CC-0) creativecommons.org/choose/zero
So I hope this is considered a similar or compatible LIC to CC-BY-SA. (Creative Commons has not approved any licenses for compatibility"

The database includes other ADMET values IC50, logP, solubilty, GI50, logD so it can be used for QSAR and QSPR model building and validation. See PS2 below.

Cheers
Tobias

PS1: That means the DB values can not be taken "as is" to create QSPR models
The pka database table contains the following annotations:
pKa(dissociation constant) of the compound

pKa (association constant) value of the compound at a physiological pH of 7.4

pKa(dissociation constant) of the compound

pKa (association constant) value of the compound at a physiological pH of 7.4

pKa(dissociation constant) of the compound

pKa (association constant) value of the compound at a physiological pH of 7.4

With all respect to the amount of information. I think its a mess. Is there ANY ontology mapping between those readouts? Which can be combined, which cannot be combined? Do I read this cerect as every new line is a different type of experiment?

I’m interested to know whether anyone was able to grab the data from ChEMBL that Tobias was discussing. A poke around doesn’t turn up the data in the way outlined. Has anyone ever figured out how to assemble the data? WOuld be great to get hold of a copy of the SDF