Boot an m1.medium instance on Jetstream and connect to your shell prompt.

Make sure you are starting in your home directory:

cd~/

and let’s make a new subdirectory to work in:

mkdir-p~/blastcd~/blast

Creating a subdirectory will allow us to keep our home directory tidy
and help keep us organized. Staying organized will make it easier to
locate important files and prevent us from being overwhelmed. As you will
find, we will create and use many files.

BLAST is the Basic Local Alignment Search Tool.
It uses an index to rapdily search large sequence databases;
it starts by finding small matches between the two sequences and extending those matches.
For more information on how BLAST works and the different BLAST functionality,
check out the summary on Wikipedia or
the NCBI’s list of BLAST resources.

BLAST can be helpful for identifying the source of a sequence,
or finding a similar sequence in another organism.
In this lesson, we will use BLAST to find zebrafish proteins that
are similar to a small set of mouse proteins.

The three files you just downloaded are the last three on the list - the
.faa.gz files.

All three of the files are FASTA protein files (that’s what the .faa
suggests) that are compressed with gzip (that’s what the .gz means).

Uncompress them:

gunzip*.faa.gz

and let’s look at the first few sequences in the file:

headmouse.1.protein.faa

These are protein sequences in FASTA format. FASTA format is something
many of you have probably seen in one form or another – it’s pretty
ubiquitous. It’s a text file, containing records; each record
starts with a line beginning with a ‘>’, and then contains one or more
lines of sequence text.

Let’s take those first two sequences and save them to a file. We’ll
do this using output redirection with ‘>’, which says “take
all the output and put it into this file here.”

head-n11mouse.1.protein.faa>mm-first.faa

So now, for example, you can do catmm-first.faa to see the contents of
that file (or lessmm-first.faa). TIP: if you try lessmm-first.faa you will need to exit by pressing the q key in your keyboard.

Now let’s BLAST these two sequences against the entire zebrafish
protein data set. First, we need to tell BLAST that the zebrafish
sequences are (a) a database, and (b) a protein database. That’s done
by calling ‘makeblastdb’:

makeblastdb-inzebrafish.1.protein.faa-dbtypeprot

Next, we call BLAST to do the search:

blastp-querymm-first.faa-dbzebrafish.1.protein.faa

This should run pretty quickly, but you’re going to get a lot of output!!
To save it to a file instead of watching it go past on the screen,
ask BLAST to save the output to a file that we’ll name mm-first.x.zebrafish.txt: