Transcript of "Data model"

2.
Why there is a need for a new data model?• Details of the proposed new data model• Concept of Substance and Compound• Standardization Workflow• The benefit of having parents (we all know this )

3.
Deposited Record – “Substance”Unique Substance Identifier (SID) assigned for eachdeposited record (a record identified by combination of DataSource and depositor’s internal database registry identifier)Benefits of having separate independent layers of depositedrecord data (“Substances”) and standardized record data(“Compounds”) - archive model - are:• Depositors’ records – Substance - gets preserved as they are with no alteration• Depositors records get versioned when changes occur. Only last most-to-date version is used in links and calculations• Same chemical may be deposited by several depositors – each of them will have different substance ID, but all of them will be linked to same standardized compound• Any records can be accepted – even those not producing InChI (e.g. plant extracts, blood samples, polymers, etc.)• Substance identifier (SID) guaranteed not to change

5.
What happens when standardization rules adjust?Would that affect Substance-Compound relationships?Would SID-CSID change?Yep, it is possible!!• After occasional total ChemSpider re-standardization we can’t guarantee that same standardized compound (CSID) will be linked to Substance – the mapping may change. This, however, will not in any way affect depositors’ SIDs.• It should be encouraged that depositors use their substance identifiers (SIDs) when referring to ChemSpider• Need to develop a compound permalinks (URL) that depositors can always use to get to their up-to-date CSIDs via SIDs. In this case, our re-standardization wouldn’t affect external references.

6.
What happens when depositor revokes Substance (SID)?• Revoking is still versioning the substance record. A new version of record will be created with “not alive” flag.• Revoked substances are no longer indexed• If there are no more Substances point at Compound then the Compound gets deleted. Otherwise, the data from revoked substances is pulled off the compound• If revoke substance gets re-deposited a new version is created with “live” flag

7.
FDA Structure Registration SystemVersion 5c, 2007• This guide is used to standardize the entry of substances into the Food and Drug Administration (FDA) Substance Registration System (SRS)• The primary purpose of this guide is to prevent duplicate entries of a single substance• Conventions for drawing structures and for organizing the characteristics of substances are included• The lack of standardization system at FDA gave birth to SRS SOP that served as guidelines for curators to draw chemicals the same way to avoid duplication in database

8.
Standardization – is it possible to please allinterested parties?Depending on the area of specialization:• Some folks may insist on neutralizing charges while others may feel differently• Some may think that canonical tautomer should always be in specific form• We believe that combining “mild” standardization supplemented with parents may be the right choice to please as many interested parties as possible

10.
Standardization – Step II (CVSP only)Tautomer Canonicalization In CVSP tautomer canonicalization is a part of standardization In OpenPhacts model tautomer canonicalization is not part of standardization. Instead, a tautomer-unsensitive (canonicalized tautomer) parent is being generated. Why OpenPHACTS approach is different?  Having different tautomers of the same family to be mapped to different standardized compounds would give better tautomer-specific annotation mapping (e.g. tautomer-specific NMR spectra, calculated properties, etc)  Standardized compounds representing same tautomeric family will have same tautomeric parent – canonicalized tautomer

13.
For each Compound (CSID) parent generation isattempted“Tautomerism in large databases”, Sitzmann andothers, J.Comput Aided Mol Des (2010) Parent DescriptionFragment-Unsensitive Largest fragment is identified and set as fragment parent. Parent set to the biggest organic fragment.Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear.Isotope-Unsensitive Isotopes replaced by common weightStereo-Unsensitive Stereo is strippedTautomer-Unsensitive Tautomer canonicalization is attempting to generate a “reasonable” tautomerSuper-Unsensitive This parent is all of the above