Table of contents

This is the second in a series of blogs in which we will try to introduce you to the concepts behind the TOPSAN Protein Syntax and the TOPSAN semantic notation system. Our first entry introduced how to imbed semantic web notations into TOPSAN (Part I). In this article we will describe how to use obtain and use the semantic information that has been embedded in TOPSAN, compose queries and analyze the available information.

SPARQL

The main reason to embed semantic information into TOPSAN pages is to allow for easy extraction of information from the site. Every day, all of the embedded semantic information on the site is scrapped and compiled into a single file, which you can find at http://files.topsan.org/topsan.n3.gz This ‘N3′ format file is a collection of all the semantic triples found on the TOPSAN site. Once you’ve downloaded and unzipped it, it is very easy to import it into a large number of different styles of semantic web ‘triple store’ systems. These systems will have the ability to parse database queries written in the ‘SPARQL’ syntax. One such system is RDFLib, found at http://www.rdflib.net/. Once you installed this library, you can use it to open and query the topsan.n3 file.

By calling this script, and passing the path to the topsan.n3 file, it will scan the file looking for all of the molecular weights assigned to proteins stored in TOPSAN. You should get something that looks like:$ ./testSparql.py topsan.n3 http://topsan.org/purl/TPS11889 : 21055.88
http://topsan.org/purl/TPS11896 : 66251.10
http://topsan.org/purl/TPS11891 : 21321.14
http://topsan.org/purl/TPS11892 : 20223.71
http://topsan.org/purl/TPS11894 : 56539.67

….

Example 2: Retrieve PFAM to GO mapping

A more complicated call would be to extract all of the PFAM to GO mappings stored in TOPSAN. To do this, replace the queryStr contents from ‘testSparql.py’ with the following query:

Example 3: Finding Domains of Unknown Function with PDB structures.

This query finds all of PFAM families that start with ‘DUF’ with PDB structures on TOPSAN that are associated with them. The associations are found with the ‘seeAlso’ type link to a common tag, whose title is filtered with the regular expression “^PF”.

(Note, this query involves external data sources that are not part of the ‘core’ TOPSAN extract files. We will outline how to obtain the extended database needed to do these types of queries in later posts)

At this point, the query will take several minutes and over a GB of memory. In the next article we’ll demonstrate how to set up a Joseki based server to query Semantic Topsan data.