MultipleSequenceAlignmentExercises

Aligning sequences by hand

To give you practice spotting patterns in protein sequences by eye, try loading this set of short peptide sequences into a text editor (type "kate" at a command prompt to start one).

By adding gap characters ("-") at appropriate places within the sequences, try and find a good alignment of the sequences - you may find it helpful to consult the chart of amino acid properties we have provided you with.

Once you have your favourite alignment of the sequences, change them into FASTA format (add an extra row above each sequence which begins with an ">" followed by a name for the sequence (e.g. ">seq1") - note that all sequence names must be different from each other. Then load them into CLUSTALX to see how it judges conservation by assigning colour to the residues.

Remove all gaps from the alignment in CLUSTALX and realign them using CLUSTALX - does it come up with the same alignment you did by hand?

(To get the sequence of this, or other domains, try querying your sequence against SMART and following the links)

Use the sequence as a query with BLAST either at NCBI or EBI against the SwissProt database

Collect sequences in fasta format (do not use all of them - 20 or 30 should be plenty) and align them (either using the MUSCLE or MAFFT servers at EBI, or loading the sequences into ClustalX and aligning them locally)

During the presentation, you saw that different secondary structure elements have different characteristics in alignments of this kind

With this in mind, examine the alignment you have generated, and find regions where you are reasonably confident you can predict the secondary structure.

Then look up the pdb entry for this domain at the Macromolecular Structure Database, you can get an overview there of the secondary structure assigned to the sequence by following the "Sequence" link on the left side of the page.

You can also compare the results of your "guessing", along with the secondary structures described in PDB, with some secondary structure prediction software. Try querying with the IRPL1_HUMAN sequences against a JPRED, a secondary-structure prediction server available online

If you have time after trying the next exercises, come back to this section and try the same thing on the following sequences

What kinds of sequences make it easier (or more difficult) to come up with a reasonable guess at at the secondary structure?

Hand-editing alignments

Assuming you have already obtained an automaticly-calculated MSA that contains a set of sequences appropriate for your analysis, the next step is to examine the alignment in detail, to identify potential regions that are misaligned, and to alter (and hence hopefully correct) these regions.

Load the alignment into CLUSTALX and examine the alignment for regions that may have been misaligned.

Load the alignment into SEAVIEW (keep it open in CLUSTALX), save it under a different name (e.g. add "ed" just before the ".aln" section of the name), and edit the alignment to correct regions you think are misaligned.

Repeat this process (editing alignment in SEAVIEW, saving, reloading into CLUSTALX) until you feel you have prepared the best possible alignment of your sequences.

Compare your version of the alignment with this hand-edited alignment of KH-domain sequences. How similar are your alignment and this one? Which one do you think is better? (Please feel free to ask the demonstrators to help you judge this).

Here are some other sets of sequences for you to try hand-editing. Each of these files is a high-quality manually-curated alignment based on comparison to structures - to obtain your low-quality alignment that needs improving, remove all gaps from the alignment using CLUSTALX, and then realign using your alignment software of choice. Then try and fix the resulting (presumably non-ideal) alignments and compare them to the initial alignment. You might like to try editing the several alignments of the same set of sequences that differ in terms of the alignment software you used to create them.

Comparing automatic MSA software

You should also note that using different automatic MSA software on the same set of sequences will almost always result in a different MSA - additionally, using different parameters for the same software will also usually give different alignments.

To get some feeling for the variation in the results obtained from different pieces of software, use the following sequences (of a set of S1 proteases) to examine the relative performance of CLUSTALW, MAFFT, MUSCLE, and T-COFFEE at aligning these sequences (four of the most widely-used software packages of this type).

For each of these programmes, carry out an automatic alignment of these sequences

Compare these alignments with each other, and also with this hand-edited version of the alignment.

Which programme gives the best alignment?

When considering the quality of the alignments, think about:

How many columns the calculated alignments have that are the same as in the reference alignment

Whether there are any sequences that are clearly mis-aligned for the majority of their length