One the software is installed, your next task is to setup a database and import the BioSQL schema (i.e. setup the relevant tables within the database). See [http://www.biosql.org/wiki/Downloads BioSQL downloads] -- you'll need to unzip the archive.

+

Once the software is installed, your next task is to setup a database and import the BioSQL schema (i.e. setup the relevant tables within the database). See [http://www.biosql.org/wiki/Downloads BioSQL downloads] -- you'll need to unzip the archive.

Alternatively to get the very latest BioSQL, check out their SVN repository.

Alternatively to get the very latest BioSQL, check out their SVN repository.

−

Or, navigate to the relevant schema file for your database and download just that, e.g. [http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/sql/biosqldb-mysql.sql] for MySQL. You will also want the NCBI Taxonomy loading perl script, [http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/scripts/load_ncbi_taxonomy.pl].

+

Or, navigate to the relevant schema file for your database and download just that, e.g. [http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/sql/biosqldb-mysql.sql biosqldb-mysql.sql] for MySQL. You will also want the NCBI Taxonomy loading perl script, [http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/scripts/load_ncbi_taxonomy.pl load_ncbi_taxonomy.pl].

== Creating the empty database ==

== Creating the empty database ==

Revision as of 17:37, 11 March 2008

BioSQL is a joint effort between the OBF projects (BioPerl, BioJava etc) to support a shared database schema for storing sequence data. In theory, you could load a GenBank file into the database with BioPerl, then using Biopython extract this from the database as a record object with featues - and get more or less the same thing as if you had loaded the GenBank file directly as a SeqRecord using SeqIO.

We have some existing documentation (HTML, PDF) for the Biopython interfaces to BioSQL, covering installing Python database adaptors and basic usage of BioSQL. This is a little old, and I am hoping to use this wiki page to update the above documentation in future.

NOTE - At the time of writing, there are a few problems with BioSQL and Biopython 1.44 which are being fixed in CVS.

Installation

This is fairly complicated - partly because there are so many options. For example, you can use a range of different SQL database packages (we'll focus on MySQL), you can have the database on your own computer (the assumption here) or on a separate server, and of course there are usernames and passwords associated with database. And finally the details will also vary depending on your operating system.

Installing Required Software

You will need to install some database software plus the associated python library so that Biopython can "talk" to the database. In this example we'll talk about the most common choice, MySQL. How you do this will also depend on your operating system, for example on a Debian or Ubuntu Linux machine try this:

sudo apt-get install mysql-common mysql-server python-mysqldb

It will also be important to have perl (to run some of the setup scripts). Again, on a Debian or Ubuntu Linux machine try this:

Downloading the BioSQL Schema & Scripts

Once the software is installed, your next task is to setup a database and import the BioSQL schema (i.e. setup the relevant tables within the database). See BioSQL downloads -- you'll need to unzip the archive.

Alternatively to get the very latest BioSQL, check out their SVN repository.
Or, navigate to the relevant schema file for your database and download just that, e.g. biosqldb-mysql.sql for MySQL. You will also want the NCBI Taxonomy loading perl script, load_ncbi_taxonomy.pl.

Creating the empty database

Assuming you are using MySQL, the following command line should create a new database on your own computer called bioseqdb, belonging to the root user account:

mysqladmin -u root create bioseqdb

We can then tell MySQL to load the BioSQL scheme we downloaded above. Change to the scripts subdirectory from the unzipped BioSQL download, then:

mysql -u root bioseqdb < biosqldb-mysql.sql

You can have a quick play using the mysql command line tool, for example:

There is about 10MB to fetch, so it can take a little while (and doesn't give any feedback while this happens). If you are worried, open a file browser window and check to see it is downloading a file called taxdump.tar.gz to the taxdata subdirectory.

You should see this output at the command prompt - be warned that some of these steps do take a while (especially rebuilding nested set left/right values):

This might be a good point for a tea break - I didn't time this but it was over ten minutes.

Running the unit tests

Because there are so many ways you could have setup your BioSQL database, you my have to tell the unit test a few bits of information. If you have followed these instructions, then the unit test should just work (using CVS, what will be the next release after Biopython 1.44).

If you have done things differently (e.g. PostgreSQL instead of MySQL, or using a different database username and password), then you need to edit the script Tests/setup_BioSQL.py and fill in the following fields:

The call server.adaptor.commit() tells the database to save the changes so far (commit the SQL transaction). It is up to you to decide when to commit the SQL transaction(s), and/or rollback changes, rather than having Biopython try and decide for you and risk getting it wrong. See Explicit is better than implicit (The Zen of Python) and bug 2395.

There should now be a single row in the biodatabase table for
our new orchid database. You can check this at the command line:

mysql --user=root bioseqdb -e "select * from biodatabase;"

Which should give something like this (assuming you haven't done any other testing yet):

Now that we have setup an orchids database within our biosqldb MySQL database, lets add some sequences to it.

Loading Sequences into a database

When loading sequences into a BioSQL database with Biopython we have to provide annotated SeqRecord objects. This gives us another excuse to use the SeqIO module! A quick recap on reading in sequences as SeqRecords, based on one of the orchid examples in the Biopython Tutorial:

The objects you get back from BioSQL act like a SeqRecord object with a Seq object as the sequence, but they are not exactly the same. You actually get their BioSQL database equivalent, a DBSeqRecord object with a DBSeq object for the sequence. These will only load the sequence and annotation from the database on demand.

Deleting a (sub) database

As mentioned above, BioSQL lets us define named "sub" databases within the single SQL database (which we called bioseqdb). In the previous example, we created a sub-database for some orchid sequences. The following code will delete the orchid database (and all the records in it):

Again, you must explicitly finialise the SQL transaction with server.adaptor.commit() to make this change.

There should now be one less row in the biodatabase table, check this at the command line:

mysql --user=root bioseqdb -e "select * from biodatabase;"

You can also check that the three orchid sequences have gone from the other tables.

How is the data stored

If you need or want to go direct to the data, bypassing the Biopython methods to retrieve records, then it would help to know more about how the underlying tables are used. For this, we refer you to the BioSQL documentation, starting with their page on Annotation Mapping.

MySQL Tips and Tricks

If you are getting timeout errors, check to see if your SQL server has any orphaned threads.