Simple uses for EMBOSS

I’m a big fan of EMBOSS and I’m always finding new uses for it. Here’s a really simple fix that you might call “clean-up”.
Let’s say that you have a fasta file with an ID like “>gi|15232491|ref|NP_188759.1|”. Run that through any EMBOSS application (e.g. iep) and you’ll get a results line such as:

IEP of NP_188759.1 from 1 to 348
Isoelectric Point = 8.8631

Hmm. The application has decided to strip down the fasta ID. What if we want to parse the output, grab the ID and match it to the original fasta sequence? Well, we could try some regex matching and string processing but that’s error-prone, especially if we don’t know in advance with what IDs we might be dealing.

Seqret to the rescue. Seqret is a deceptively simple-looking EMBOSS app that can retrieve, read and write sequence. We can feed our fasta file to seqret like so:

seqret -sequence myfasta.fa -outseq myfasta2.fa

All that we’ve done is read in a fasta sequence and write it out again. However, because all EMBOSS apps strip fasta headers in the same way, the ID of our sequence in myfasta2.fa will read “>NP_188759.1”. Now when we pass myfasta2.fa to other EMBOSS apps, the IDs will match up. If you wanted, it wouldn’t be hard to create e.g. a Perl hash mapping the original IDs in myfasta to the stripped IDs in myfasta2.

Like this:

Related

Post navigation

3 thoughts on “Simple uses for EMBOSS”

Interesting. I’ve never gotten round to playing with EMBOSS, although you’re starting to change my mind. fasta IDs have been problematic for me in the past, and I’ve always used klugdy regex workarounds, which seem both ugly and unstable. I shall have to investigate…
Mind you, my current source of despair is the perennial multiple ID sources problem, where I have a set of expression probes imperfectly annotated from a variety of sources. If I really wanted to look up the probes for a list of genes, I’d have to get all their various IDs to compensate for possible holes in my (presupplied) lookup table… Head hurts now.

EMBOSS rocks. It’s a toolkit in the true *NIX sense – hundreds of sturdy, fast reliable tools that do one thing well and pipe to and from each other. You can also whip up some great Bioperl pipelines by feeding sequence objects directly to a Bio::Factory::EMBOSS object.