Monday, 5 August 2013

Clustering proteins using blastclust

A simple way to cluster proteins is using the blastclust program from NCBI. For example, if you have a fasta file of proteins, proteins.fa, you can cluster them by typing:
% blastclust -i proteins.fa -o proteins.fa.blastclust -p T -L .9 -b T -S 95
where '-o proteins.fa.blastclust' means the output file will be proteins.fa.blastclust; '-p T' means the proteins.fa file contains protein sequences; '-L .9 -S 95' means proteins are clustered together if they are >=95% identical over >=90% of their length; and '-b T' means that for two proteins A and B to be clustered, the length threshold must be reached with respect to both A and B.