How hard can it be?

Sunday evening I sent the post-doc who's been learning Perl an email asking if we could write a little 'script' that would take a long DNA sequence and break it into shorter fragments of a specified length. (Short Perl programs are often called 'scripts'; I suspect this may be a more general term for program modules.) The sequences need to be in a style called 'FASTA format', which requires each sequence to start on a new line, and to be preceded by a line starting with ">". Even with our very limited Perl skills I figured this should only take us a little while.

I need this done because the 'unbiased motif search' program I'm using to look at G-USSs (or whatever we're calling them) in the genome will only accept sequences that are ≤10,000bp long. The genome is about 1,830,000bp long, and I didn't want to have to go through it by hand finding the right places to insert carriage returns and ">"s. I'm also starting to think we should do similar motif searches on lots of other genomes, so having this process automated would be a big help.

This afternoon we (the post-doc, mostly) finally got it to work. The problem wasn't so much our lack of Perl skills as our realization that we had to include other steps that would clean up the sequence before it could be used.

For example the motif-search program rejects any sequences containing letters other than A, C, G, T and N. A, C, G and T are the bases in DNA, and N is used when we're not sure what base is present at a particular position. The sequences we're using include other letters indicating other kinds of uncertainties; N means "could be any of A, C, G or T', W means "could be A or T", Y means "could be C or T", etc. So we needed our script to go through the sequence, find all the Ys and Ws (and Ks and Rs and Ss and Ms) and replace them with Ns.

Then we realized that we also had to get rid of any line feeds (end-of-line characters) in the input sequence. These can be invisible in the sequence but will send Perl into a tailspin, especially if they're the peculiar line feeds that Macs use. We finally gave up on trying to get Perl to remove them, but found an easy way to have Unix remove them from the sequence before we give it to the Perl script.

One big obstacle was getting the script to insert the ">" breaks where we wanted them. Our strategy was to have the script divide the base number by the interval at which we wanted the breaks, and only put a break when this remainder equaled zero. (example: Say the desired interval is 500bp. Only at positions 500, 1000, 1500, 2000 etc. will the remainder when divided by 500 = 0.) Coding this should have been easy, because algorithms for calculating remainders are built into Perl. But either we had the instructions for calculating the remainder backwards, or we just got 'divisor' and 'dividend' confused (hey, grade 4 was a long time ago).

But now it works.

I think I also discovered why I wasn't getting responses from the in-line interface to the motif search software. RTFMing (RingTFM?) revealed that I had been putting the wrong kind of numbers into a couple of the boxes. Because the responses from this interface usually take a day to arrive (by email) I won't know until tomorrow whether that was really the problem - I know it was A problem, but I don't know if it was the reason I got no responses, as I also got no error messages.

The post-doc also may have succeeded in getting a version of the motif-search program running just for us on the whiz-bang computer system WestGrid. It will be a bit more cumbersome to set up than the online version, but we should be able to get lots of results fast.

Added the next day: The guy in the next office just showed me some bioinformatics software that, among other things, does just what our new Perl script does. Oh well, we're still proud of having done it ourselves.

Short Perl programs are often called 'scripts'; I suspect this may be a more general term for program modules

No, a module in Perl is something different. It's basically a self-contained unit of code that you can reuse in many applications. Most modules end in ".pm" - if you see "use XXX;" in a Perl script, there's a corresponding module, XXX.pm

If you've been into Perl for a while, you'll soon get into modules. There are tens of thousands ready-available, you obtain them at a site called CPAN (cpan.org) and they're a fantastic time saver. The first rule is - check if someone has written a module that does what you want before trying to do it yourself!

For biologists who use Perl, you can't beat the Bioperl modules (bioperl.org). With Bioperl experience, you could easily solve your sequence splitting problem in a few lines.