We have some existing documentation ([http://biopython.org/DIST/docs/biosql/python_biosql_basic.html HTML], [http://biopython.org/DIST/docs/biosql/python_biosql_basic.pdf PDF]) for the Biopython interfaces to BioSQL, covering installing Python database adaptors and basic usage of BioSQL. This is a little old, and I am hoping to use this wiki page to update the above documentation in future.

We have some existing documentation ([http://biopython.org/DIST/docs/biosql/python_biosql_basic.html HTML], [http://biopython.org/DIST/docs/biosql/python_biosql_basic.pdf PDF]) for the Biopython interfaces to BioSQL, covering installing Python database adaptors and basic usage of BioSQL. This is a little old, and I am hoping to use this wiki page to update the above documentation in future.

−

NOTE - At the time of writing, there are a few problems with BioSQL and Biopython 1.44 which are being fixed in CVS.

+

The following text applies to Biopython 1.45 or later (and won't work with Biopython 1.44).

= Installation =

= Installation =

Line 29:

Line 29:

Once the software is installed, your next task is to setup a database and import the BioSQL schema (i.e. setup the relevant tables within the database). See [http://www.biosql.org/wiki/Downloads BioSQL downloads] -- you'll need to unzip the archive.

Once the software is installed, your next task is to setup a database and import the BioSQL schema (i.e. setup the relevant tables within the database). See [http://www.biosql.org/wiki/Downloads BioSQL downloads] -- you'll need to unzip the archive.

−

Alternatively to get the very latest BioSQL, check out their SVN repository.

+

Alternatively to get the very latest BioSQL, check out their git repository.

−

Or, navigate to the relevant schema file for your database and download just that, e.g. [http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/sql/biosqldb-mysql.sql biosqldb-mysql.sql] for MySQL. You will also want the NCBI Taxonomy loading perl script, [http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/scripts/load_ncbi_taxonomy.pl load_ncbi_taxonomy.pl].

+

Or, navigate to the relevant schema file for your database and download just that, e.g. [https://raw.github.com/biosql/biosql/master/sql/biosqldb-mysql.sql biosqldb-mysql.sql] for MySQL. You will also want the NCBI Taxonomy loading perl script, [https://raw.github.com/biosql/biosql/master/scripts/load_ncbi_taxonomy.pl load_ncbi_taxonomy.pl].

== Creating the empty database ==

== Creating the empty database ==

−

Assuming you are using MySQL, the following command line should create a new database on your own computer called ''bioseqdb'', belonging to the ''root'' user account:

+

=== MySQL ===

+

+

The following command line should create a new database on your own computer called ''bioseqdb'', belonging to the ''root'' user account:

mysqladmin -u root create bioseqdb

mysqladmin -u root create bioseqdb

Line 86:

Line 88:

This should return no rows as the table is empty.

This should return no rows as the table is empty.

+

+

=== PostgreSQL ===

+

+

+

'''IMPORTANT NOTE FOR POSTRESQL USERS''': Before loading the biosqldb-pg.sql schema into Postgres you must delete the two RULES named '''rule_bioentry_i1''' and '''rule_bioentry_i2'''; lines 771-791 in biosqldb-pg.sql BioSQL version 1.0.1

+

+

First you need to set up user permissions, if you are not sure how to do this, try:

+

+

su - postgres

+

createuser <your user name>

+

+

Then, assuming you are logged-in as <your user name> and Postgres is running on the local machine, you should be able to do the following:

+

+

createdb biosqldb

+

psql biosqldb < biosqldb-pg.sql

+

+

Run ''psql'' and type enter ''\d <ENTER>'' to see all the entities created.

== NCBI Taxonomy ==

== NCBI Taxonomy ==

−

Before you start trying to load sequences into the database, it is a good idea to load the NCBI taxonomy database using the scripts/load_ncbi_taxonomy.pl script in the BioSQL package.

+

The BioSQL package includes a perl script under scripts/load_ncbi_taxonomy.pl to download and update the taxonomy tables.

+

The script should be able to download the files it needs from the [ftp://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy FTP site] automatically.

−

The script should be able to download the files it needs from the [ftp://ftp.ncbi.nih.gov/pub/taxonomy/ NCBI taxonomy FTP site] automatically. Change to the scripts subdirectory from the unzipped BioSQL download, then:

+

Prior to Biopython 1.49, if you wanted to work with the NCBI taxonomy database it was good idea to pre-load the NCBI taxonomy before you start trying to load sequences into the database. This isn't so important with Biopython 1.49 onwards, where you can instead opt to have the information needed downloaded as needed from Entrez.

+

+

To update the NCBI taxonomy, change to the scripts subdirectory from the unzipped BioSQL download, then:

There is about 10MB to fetch, so it can take a little while (and doesn't give any feedback while this happens). If you are worried, open a file browser window and check to see it is downloading a file called taxdump.tar.gz to the taxdata subdirectory.

There is about 10MB to fetch, so it can take a little while (and doesn't give any feedback while this happens). If you are worried, open a file browser window and check to see it is downloading a file called taxdump.tar.gz to the taxdata subdirectory.

Line 113:

Line 137:

This might be a good point for a tea break - I didn't time this but it was over ten minutes.

This might be a good point for a tea break - I didn't time this but it was over ten minutes.

−

== Running the unit tests ==

+

One the initial tables have been populated, re-running the script is much faster. You can run this script again to update the taxonomy tables, which the NCBI do add to regularly. You may want to setup a scheduled job to do this automatically (say once a fortnight).

−

Because there are so many ways you could have setup your BioSQL database, you my have to tell the unit test a few bits of information. If you have followed these instructions, then the unit test should just work (using CVS, what will be the next release after Biopython 1.44).

+

P.S. It is a particularly good idea to do update the taxonomy if you will be working with the left/right values in the taxon table (see also [http://bugzilla.open-bio.org/show_bug.cgi?id=2493 BioSQL enhancement request 2493]). Biopython ignores these optional fields when loading or retrieving sequences - instead using just the parent link. See http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html for more about how this alternative tree representation works.

+

+

== Running the unit tests ==

−

If you have done things differently (e.g. PostgreSQL instead of MySQL, or using a different database username and password), then you need to edit the script Tests/setup_BioSQL.py and fill in the following fields:

+

Because there are so many ways you could have setup your BioSQL database, you have to tell the unit test a few bits of information by editing the file Tests/setup_BioSQL.py and filling in the following fields:

<python>

<python>

Line 136:

Line 162:

python run_tests.py test_BioSQL test_BioSQL_SeqIO

python run_tests.py test_BioSQL test_BioSQL_SeqIO

+

+

For PostgreSQL, use:

+

+

<python>

+

DBDRIVER = 'psycopg2'

+

DBTYPE = 'pg'

+

</python>

= Creating a (sub) database =

= Creating a (sub) database =

−

BioSQL lets us define named "sub" databases within the single SQL database (which we called ''bioseqdb'' earlier). For this example, lets create a one for some orchid sequences:

+

BioSQL lets us define named "sub" databases or "namespaces" within the single SQL database (which we called ''bioseqdb'' earlier). For this example, lets create a one for some orchid sequences:

<python>

<python>

Line 146:

Line 179:

passwd = "", host = "localhost", db="bioseqdb")

passwd = "", host = "localhost", db="bioseqdb")

db = server.new_database("orchids", description="Just for testing")

db = server.new_database("orchids", description="Just for testing")

−

server.adaptor.commit()

+

server.commit() #On Biopython 1.49 or older, server.adaptor.commit()

</python>

</python>

−

The call ''server.adaptor.commit()'' tells the database to save the changes so far (commit the SQL transaction). It is up to you to decide when to commit the SQL transaction(s), and/or rollback changes, rather than having Biopython try and decide for you and risk getting it wrong. See ''Explicit is better than implicit'' ([http://www.python.org/dev/peps/pep-0020/ The Zen of Python]) and [http://bugzilla.open-bio.org/show_bug.cgi?id=2395 bug 2395].

+

(If you are using PostgreSQL rather than MySQL, just change the driver argument to "psycopg2" instead. The same applies to the other examples in this document)

+

+

The ''commit'' call tells the database to save the changes so far (commit the SQL transaction). It is up to you to decide when to commit the SQL transaction(s), and/or rollback changes, rather than having Biopython try and decide for you and risk getting it wrong. See ''Explicit is better than implicit'' ([http://www.python.org/dev/peps/pep-0020/ The Zen of Python]) and [http://bugzilla.open-bio.org/show_bug.cgi?id=2395 bug 2395].

There should now be a single row in the ''biodatabase'' table for

There should now be a single row in the ''biodatabase'' table for

−

our new orchid database. You can check this at the command line:

+

our new orchid namespace. You can check this at the command line:

+

+

MySQL:

mysql --user=root bioseqdb -e "select * from biodatabase;"

mysql --user=root bioseqdb -e "select * from biodatabase;"

+

+

PostgreSQL:

+

+

psql -c "SELECT * FROM biodatabase;" bioseqdb

Which should give something like this (assuming you haven't done any other testing yet):

Which should give something like this (assuming you haven't done any other testing yet):

Line 164:

Line 205:

+----------------+---------+-----------+------------------+

+----------------+---------+-----------+------------------+

−

Now that we have setup an ''orchids'' database within our ''biosqldb'' MySQL database, lets add some sequences to it.

+

Now that we have setup an ''orchids'' namespace within our ''biosqldb'' MySQL database, lets add some sequences to it.

Again, you must explicitly call ''server.adaptor.commit()'' to record the SQL transaction which is otherwise left pending.

+

Again, you must explicitly call ''commit'' to record the SQL transaction which is otherwise left pending.

The ''db.load()'' function should have returned the number of records loaded (three in this example), and again have a look in the database and you should see new rows in several tables.

The ''db.load()'' function should have returned the number of records loaded (three in this example), and again have a look in the database and you should see new rows in several tables.

Line 221:

Line 263:

= Extracting Sequences from the database =

= Extracting Sequences from the database =

−

This continues from the previous example, where we loaded three records into an ''orchids'' database:

+

This continues from the previous example, where we loaded three records into an ''orchids'' database (namespace):

<python>

<python>

Line 244:

Line 286:

The objects you get back from BioSQL act like a [[SeqRecord]] object with a [[Seq]] object as the sequence, but they are not exactly the same. You actually get their BioSQL database equivalent, a '''DBSeqRecord''' object with a '''DBSeq''' object for the sequence. These will only load the sequence and annotation from the database on demand.

The objects you get back from BioSQL act like a [[SeqRecord]] object with a [[Seq]] object as the sequence, but they are not exactly the same. You actually get their BioSQL database equivalent, a '''DBSeqRecord''' object with a '''DBSeq''' object for the sequence. These will only load the sequence and annotation from the database on demand.

+

+

There are other ways to pull out records - the 'db' object here acts somewhat like a dictionary (including supporting deleting entries from their key). The python-dictionary-keys are actually the database-primary-keys used inside the database for the bioentry table. e.g.

As mentioned above, BioSQL lets us define named "sub" databases within the single SQL database (which we called ''bioseqdb''). In the previous example, we created a sub-database for some orchid sequences. The following code will delete the ''orchid'' database (and all the records in it):

+

As mentioned above, BioSQL lets us define named "sub" databases (aka namespaces) within the single SQL database (which we called ''bioseqdb''). In the previous example, we created a sub-database for some orchid sequences. The following code will delete the ''orchid'' database (and all the records in it):

<python>

<python>

Line 254:

Line 308:

passwd = "", host = "localhost", db="bioseqdb")

passwd = "", host = "localhost", db="bioseqdb")

server.remove_database("orchids")

server.remove_database("orchids")

−

server.adaptor.commit()

+

server.commit() #On Biopython 1.49 or older, server.adaptor.commit()

</python>

</python>

−

Again, you must explicitly finialise the SQL transaction with ''server.adaptor.commit()'' to make this change.

+

Again, you must explicitly finialise the SQL transaction with a commit to apply this change.

There should now be one less row in the ''biodatabase'' table, check this at the command line:

There should now be one less row in the ''biodatabase'' table, check this at the command line:

Line 267:

Line 321:

= How is the data stored =

= How is the data stored =

−

If you need or want to go direct to the data, bypassing the Biopython methods to retrieve records, then it would help to know more about how the underlying tables are used. For this, we refer you to the BioSQL documentation, starting with their page on [http://www.biosql.org/wiki/Annotation_Mapping Annotation Mapping].

+

If you need or want to go direct to the data, bypassing the Biopython methods to retrieve records, then it would help to know more about how the underlying tables are used. For this, we refer you to the BioSQL documentation, starting with their [http://www.biosql.org/wiki/Schema_Overview Schema Overview] and the page on [http://www.biosql.org/wiki/Annotation_Mapping Annotation Mapping].

= MySQL Tips and Tricks =

= MySQL Tips and Tricks =

Line 280:

Line 334:

Use at your own risk!

Use at your own risk!

+

[[Category:Wiki Documentation]]

Revision as of 21:36, 12 June 2012

BioSQL is a joint effort between the OBF projects (BioPerl, BioJava etc) to support a shared database schema for storing sequence data. In theory, you could load a GenBank file into the database with BioPerl, then using Biopython extract this from the database as a record object with featues - and get more or less the same thing as if you had loaded the GenBank file directly as a SeqRecord using SeqIO.

We have some existing documentation (HTML, PDF) for the Biopython interfaces to BioSQL, covering installing Python database adaptors and basic usage of BioSQL. This is a little old, and I am hoping to use this wiki page to update the above documentation in future.

The following text applies to Biopython 1.45 or later (and won't work with Biopython 1.44).

Installation

This is fairly complicated - partly because there are so many options. For example, you can use a range of different SQL database packages (we'll focus on MySQL), you can have the database on your own computer (the assumption here) or on a separate server, and of course there are usernames and passwords associated with database. And finally the details will also vary depending on your operating system.

Installing Required Software

You will need to install some database software plus the associated python library so that Biopython can "talk" to the database. In this example we'll talk about the most common choice, MySQL. How you do this will also depend on your operating system, for example on a Debian or Ubuntu Linux machine try this:

sudo apt-get install mysql-common mysql-server python-mysqldb

It will also be important to have perl (to run some of the setup scripts). Again, on a Debian or Ubuntu Linux machine try this:

Downloading the BioSQL Schema & Scripts

Once the software is installed, your next task is to setup a database and import the BioSQL schema (i.e. setup the relevant tables within the database). See BioSQL downloads -- you'll need to unzip the archive.

Alternatively to get the very latest BioSQL, check out their git repository.
Or, navigate to the relevant schema file for your database and download just that, e.g. biosqldb-mysql.sql for MySQL. You will also want the NCBI Taxonomy loading perl script, load_ncbi_taxonomy.pl.

Creating the empty database

MySQL

The following command line should create a new database on your own computer called bioseqdb, belonging to the root user account:

mysqladmin -u root create bioseqdb

We can then tell MySQL to load the BioSQL scheme we downloaded above. Change to the scripts subdirectory from the unzipped BioSQL download, then:

mysql -u root bioseqdb < biosqldb-mysql.sql

You can have a quick play using the mysql command line tool, for example:

PostgreSQL

IMPORTANT NOTE FOR POSTRESQL USERS: Before loading the biosqldb-pg.sql schema into Postgres you must delete the two RULES named rule_bioentry_i1 and rule_bioentry_i2; lines 771-791 in biosqldb-pg.sql BioSQL version 1.0.1

First you need to set up user permissions, if you are not sure how to do this, try:

su - postgres
createuser <your user name>

Then, assuming you are logged-in as <your user name> and Postgres is running on the local machine, you should be able to do the following:

createdb biosqldb
psql biosqldb < biosqldb-pg.sql

Run psql and type enter \d <ENTER> to see all the entities created.

NCBI Taxonomy

The BioSQL package includes a perl script under scripts/load_ncbi_taxonomy.pl to download and update the taxonomy tables.
The script should be able to download the files it needs from the NCBI taxonomy FTP site automatically.

Prior to Biopython 1.49, if you wanted to work with the NCBI taxonomy database it was good idea to pre-load the NCBI taxonomy before you start trying to load sequences into the database. This isn't so important with Biopython 1.49 onwards, where you can instead opt to have the information needed downloaded as needed from Entrez.

To update the NCBI taxonomy, change to the scripts subdirectory from the unzipped BioSQL download, then:

There is about 10MB to fetch, so it can take a little while (and doesn't give any feedback while this happens). If you are worried, open a file browser window and check to see it is downloading a file called taxdump.tar.gz to the taxdata subdirectory.

You should see this output at the command prompt - be warned that some of these steps do take a while (especially rebuilding nested set left/right values):

This might be a good point for a tea break - I didn't time this but it was over ten minutes.

One the initial tables have been populated, re-running the script is much faster. You can run this script again to update the taxonomy tables, which the NCBI do add to regularly. You may want to setup a scheduled job to do this automatically (say once a fortnight).

Running the unit tests

Because there are so many ways you could have setup your BioSQL database, you have to tell the unit test a few bits of information by editing the file Tests/setup_BioSQL.py and filling in the following fields:

(If you are using PostgreSQL rather than MySQL, just change the driver argument to "psycopg2" instead. The same applies to the other examples in this document)

The commit call tells the database to save the changes so far (commit the SQL transaction). It is up to you to decide when to commit the SQL transaction(s), and/or rollback changes, rather than having Biopython try and decide for you and risk getting it wrong. See Explicit is better than implicit (The Zen of Python) and bug 2395.

There should now be a single row in the biodatabase table for
our new orchid namespace. You can check this at the command line:

MySQL:

mysql --user=root bioseqdb -e "select * from biodatabase;"

PostgreSQL:

psql -c "SELECT * FROM biodatabase;" bioseqdb

Which should give something like this (assuming you haven't done any other testing yet):

Now that we have setup an orchids namespace within our biosqldb MySQL database, lets add some sequences to it.

Loading Sequences into a database

When loading sequences into a BioSQL database with Biopython we have to provide annotated SeqRecord objects. This gives us another excuse to use the SeqIO module! A quick recap on reading in sequences as SeqRecords, based on one of the orchid examples in the Biopython Tutorial:

The objects you get back from BioSQL act like a SeqRecord object with a Seq object as the sequence, but they are not exactly the same. You actually get their BioSQL database equivalent, a DBSeqRecord object with a DBSeq object for the sequence. These will only load the sequence and annotation from the database on demand.

There are other ways to pull out records - the 'db' object here acts somewhat like a dictionary (including supporting deleting entries from their key). The python-dictionary-keys are actually the database-primary-keys used inside the database for the bioentry table. e.g.

Deleting a (sub) database

As mentioned above, BioSQL lets us define named "sub" databases (aka namespaces) within the single SQL database (which we called bioseqdb). In the previous example, we created a sub-database for some orchid sequences. The following code will delete the orchid database (and all the records in it):

Again, you must explicitly finialise the SQL transaction with a commit to apply this change.

There should now be one less row in the biodatabase table, check this at the command line:

mysql --user=root bioseqdb -e "select * from biodatabase;"

You can also check that the three orchid sequences have gone from the other tables.

How is the data stored

If you need or want to go direct to the data, bypassing the Biopython methods to retrieve records, then it would help to know more about how the underlying tables are used. For this, we refer you to the BioSQL documentation, starting with their Schema Overview and the page on Annotation Mapping.

MySQL Tips and Tricks

If you are getting timeout errors, check to see if your SQL server has any orphaned threads.