Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Biochemical ontologies aim to capture and represent biochemical entities and the relations that exist between them in an accurate and precise manner. A fundamental starting point is the use of identifiers that precisely and uniquely identify some biochemical entity, whether it be a substance, a quality or some biological process. Yet, our current approach for generating identifiers doing so is often haphazard and incomplete. This prevents us from accurately integrating knowledge and also leads to under specification of our knowledge. This talk aims to initiate a discussion on plausible structure-based strategies for biochemical identity, ultimately to generate identifiers in an automatic and curator/database independent fashion, whether it be at molecular level or some part thereof (e.g. residues, collection of residues, atoms, collection of atoms, functional groups). With structure-based identifiers in hand, we will be in a position to accurately capture specific biochemical knowledge, such as how a set of residues in a binding site are involved in a chemical reaction including the fact that a key nitrogen atom must first be de-protonated. Thus, this will enhance our current representation of biochemical knowledge and make it fundamentally more useful.

6.
NO!!!! <ul><li>These are structurally different </li></ul><ul><li>Each exhibits distinct functionality! </li></ul><ul><li>Yet most databases ( Uniprot / Genbank ) don’t have separate identifiers for them </li></ul><ul><li>Reactome has an internal identifier for referring to different forms, but links to Uniprot entries and doesn’t provide an explicit description of the structure that it corresponds to! </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier

7.
So <ul><li>We have a clear need for being able to refer to distinct biochemical entities, based at least on their structure. </li></ul><ul><li>We also need to refer to arbitrary structural parts. </li></ul><ul><li>Should we generate all the combinations a priori??? </li></ul><ul><li> NO!! </li></ul><ul><li>Should we be able to automatically generate the identifier from the structural attributes? </li></ul><ul><li>-> YES!!! </li></ul><ul><li>Should we semantically annotate (manually or otherwise) those forms known to be involved in specific processes??? </li></ul><ul><li>-> YES!!! </li></ul><ul><li>What identifiers are unique for a given structure? </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier

12.
<ul><li>Possible... but a 1000 residue protein would contain ~15,000 atoms on average.... </li></ul><ul><ul><li>OpenBabel seemed to struggle with anything over 100 residues </li></ul></ul><ul><ul><ul><li>Maybe needs some performance tweaking? </li></ul></ul></ul><ul><ul><li>Size of the string will be enormous </li></ul></ul><ul><ul><ul><li>We can use InChiKeys (SHA1 hash), but then we need to provide a you-submit-InChI , we-store-both and they-look-it-up service. </li></ul></ul></ul><ul><ul><li>Modularize InChI construction for (linear) polymers? </li></ul></ul><ul><ul><ul><li>Make InChi strings for each residue, and concatenate – rename the atoms according to the residue position </li></ul></ul></ul><ul><ul><li>We still need to translate the InChi string ... </li></ul></ul>InCHI for Proteins??? 01/04/2009 NCBO Seminar Series::Michel Dumontier

24.
Identifiers for Atoms <ul><li>Atom identifiers can be consistently retrieved from the OpenBabel model. </li></ul><ul><ul><li>Canonical numbering means we can reliably refer to a specific region rather than a (possibly degenerate) sub-graph match. </li></ul></ul><ul><ul><li>In our plugin, URI component naming was based on the assigned molecule identifier </li></ul></ul><ul><ul><ul><li>e.g. pubchemid#aN, where N is the number </li></ul></ul></ul><ul><ul><li>Use InChiKey as base? </li></ul></ul><ul><ul><ul><li>e.g. InChiKey#aN </li></ul></ul></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier

29.
But what if we have a modification that isn’t contained in the ontology! <ul><li>No problem... define your own term, with the corresponding structural description (InChi, SMILES), and add to an ontology document... </li></ul><ul><ul><li>If you’re using OWL, you can add the import statement and publish it. </li></ul></ul><ul><li>And, of course, you should submit it to the appropriate ontology development teams. (and later make it equivalent to) </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier

31.
So what if... <ul><li>we describe the structural features of the molecule with OWL (sequence + PTMs), and generate an identifier from one of its serializations (RDF/XML?) </li></ul><ul><li>that way we have the explicit description as the identifier in a form that is compatible with the semantic web. </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier

36.
Summary <ul><li>We need a precise method to generate identifiers for biopolymers and arbitrary sets of their parts. </li></ul><ul><li>Consistent identifier generation will allow anybody to specify findings according to the biopolymers for which it was observed, whether it exists in a database or not, and will allow us to link biochemical knowledge at finer levels of granularity. </li></ul><ul><li>(at least) two identifier schemes were put forward to initiate discussion, with the goal of setting a standard naming convention. </li></ul>01/04/2009 NCBO Seminar Series::Michel Dumontier