This should all work on Windows, Linux and Mac OS X, although you may need to adjust path or file names accordingly.

What is BLAST?

The NCBI's Basic Local Alignment Search Tool (BLAST) is a suite of programs used to find regions of local similarity between biological sequences. The program usually compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.

One of the tool's many variants is RPS-BLAST (reversed position specific blast). In this case you submit one or more query protein sequences for comparison against a database of Hidden Markov Models (HMM) representing many different protein domains or families. The NCBI have a web version, but you can also install the BLAST programs on your own computer.

In this article I'll go quickly go through installing standalone BLAST, building an RPS-BLAST database, and then using the rpsblast program with Biopython to search bacterial genomes for particular protein motifs.

Installing Standalone Blast

On their downloads page you can get standalone BLAST for all the major operating systems including Windows and Linux.

Once you have the program installed, you also need a database or two. For "normal" BLAST you can download blast sequence databases or make your own using the supplied formatdb program.

The NCBI also make available ready made RPS-BLAST databases for PFAM, SMART, COG, KOG and their own meta-domain database, CDD. You can download these rpsblast databases ready to use on your own machine.

Preparing a reduced RPS-BLAST database

One nice trick if you are running RPS-BLAST on your own machine is you can build a sub-database containing only selected protein domains. This has two major advantages: you need less RAM, and it will run much faster - which means even my old laptop computer can manage!

To make your own sub-database, you'll first need to download the all the raw HMM models as position specific scoring matrix (PSSM) text files in this archive cdd.tar.gz - while the archive is only 250MB, be warned that it unzips to about 2GB! If you are using Windows, then WinZip will cope with this file (eventually).

The idea is to select particular domains of interest, get their *.smp files from the archive, and then use the formatrpsdb program to build a little database the rpsblast program can use.

For example, suppose you are interested in searching for sigma factors, and had decided that you only want to look for the following PFAM domains, typically found in Sigma 70 proteins:

Next, extract just these *.smp files from the large archive (cdd.tar.gz).

Then run the formatrpsdb tool to build a database:

formatrpsdb -t Sigma.v001 -i Sigma.pn -o T -f 9.82 -n Sigma -S 100.0

The above assumed you are in the same directory as the Sigma.pn file and all the *.smp files. You may need to provide an explicit path for the formatrpsdb program, e.g. C:\Blast\bin\formatrpsdb.exe

These options are based on the examples in the README file. What this does is creates the eight files: Sigma.aux, Sigma.loo, Sigma.phr, Sigma.pin, Sigma.psd, Sigma.psi, Sigma.psq and Sigma.rps which together make up the database.

It will also create (or add to) the file formatrpsdb.log, why not open this and check what it says. For example confirm it wrote the expected number of models in the database.

Running RPS-BLAST at the command line

We're going to use standalone BLAST, but before that for an idea of what the results should look like, goto the online RPS-BLAST and enter the search term "39546362" (the GI number for rpoD, a sigma factor in Salmonella typhimurium) and pick the PFAM database. You should get strong hits to the six PFAM domains listed above.

Now we'll run the same search with standalone blast. You'll need to prepare an input file in FASTA format - prepare a text file called rpoD.faa and paste in this:

You can now run rpsblast with this input file (-i rpoD.faa) using the Sigma factor database we prepared above (-d Sigma) and a harsh expectation cut-off (-e 0.00001) with the command:

rpsblast -i rpoD.faa -d Sigma -e 0.00001

This should print lots of text on screen which should agree pretty well with what the web version told you earlier. I have assumed in this example that all the files are in the current directory - you may need to provide explicit paths depending on where you have put them, e.g.

You might want to save the BLAST output as a plain text file, for example rpoD.txt, and to do that we can use the -o option to specify the output file:

rpsblast -i rpoD.faa -d Sigma -e 0.00001 -o rpoD.txt

And let's also save the output as an XML file called rpoD.xml (using output format 7), ready for the next step where we'll read this data in with Biopython:

rpsblast -i rpoD.faa -d Sigma -e 0.00001 -m 7 -o rpoD.xml

If instead of using the Sigma mini-database, you wanted to use the full PFAM database (see download links above - you'll need quite a lot of RAM), simply change the database argument:

rpsblast -i rpoD.faa -d Pfam -e 0.00001

For this particular gene, the results should be the same (baring slight variations in the estimated expectation p-values). However, it will take a lot longer because this time rpsblast is searching thousands of domain motifs instead of just the handful in our little database.

Next let's download an entire genomes worth of protein sequences, Salmonella typhimurium - FASTA file NC_003197.faa (1MB), and search all of them for Sigma 70 domains:

rpsblast -i NC_003197.faa -d Sigma -e 0.00001 -o NC_003197.txt

or:

rpsblast -i NC_003197.faa -d Sigma -e 0.00001 -m 7 -o NC_003197.xml

This will of course take longer than just searching one gene, but as the Sigma database is very small it still takes less than a minute on my laptop.

Analyzing RPS-BLAST output with Biopython

The BLAST tools' default output is a simple plain text format designed for people to read (see the files rpoD.txt and NC_003197.txt created above as an example). While you can write a computer program to read this in, the NCBI have a habit of tweaking the file format fairly often and 'breaking' things - in fact Biopython probably won't like these files.

Instead, we asked BLAST to produce XML output (see files rpoD.xml and NC_003197.xml created above) which is intended to be read by another program (such as Biopython). With a bit of practice you can actually read XML files by hand, but its not pleasant.

Lets start with the small example rpoD.xml - you can read in this file, and print out some of its information in Biopython like this:

Running RPS-BLAST from Biopython

Its great that we can read the BLAST XML output from within Biopython, but having to type cryptic command lines to run the BLAST program isn't very nice. Instead, we can get Biopython to do this for us too!

The NCBIStandalone.rpsblast() function runs RPS-BLAST for us (using XML output by default), and captures its standard output (and any error output) as two handle objects. We can treat these like handles to files opened for reading, so we can re-use the code from before:

QUERY: gi|39546362|ref|NP_462126.2| rpoD from Salmonella typhimurium
gnl|CDD|68130 HSP, e=0.000000, from position 139 to 350
gnl|CDD|68124 HSP, e=0.000000, from position 455 to 537
gnl|CDD|68127 HSP, e=0.000000, from position 381 to 451
gnl|CDD|68129 HSP, e=0.000000, from position 549 to 602
gnl|CDD|64025 HSP, e=0.000000, from position 97 to 128
Done

You'll notice that in this example we don't ever save the BLAST's XML output as a file on disc. You might want to do this while debugging your code (e.g. use blast_out.read() to get all the data as a string, write this to a file, and then open the file). If anything fails, its often worth checking the for any error mesages printed by rpsblast using:

print error_handle.read()

So, searching all the genes in a Bactera for Sigma70 PFAM domains was quick and easy. You can of course go one step further, and download FASTA files for all the sequenced Bacteria, all.faa.tar.gz (nearly 400MB compressed), and then search them all!

Notes on versions

You will need BioPython 1.43 or later, but any recent version of python and the NCBI's standalone BLAST tools should work too.