Speeding up the simulation of USS evolution

Our Perl model of USS evolution now runs fine and does just about everything we want it to. Unfortunately it takes a long time to do this. To date we haven't bothered much about speed, being more concerned with just being able to do the job. But now it's time to improve its efficiency. I see several points where big improvements can be made, but first I should describe the steps the simulation presently takes.

1. Create genome sequence. This is done only once per run so speed isn't important.

In each cycle: (Efficiency is much more important for steps done every cycle.)

2. Mutate the genome, and choose and mutate a specified number of fragments of the genome. I think the mutation step is already quite efficient. The number of fragments that need to be processed each cycle can probably be reduced - I'll describe this below.

3. Score each fragment using a sliding window. Score the whole genome too. This scoring is probably by far the most time-consuming step.

4. Decide whether a fragment should recombine back into the genome, and do the recombination. Not a big time-suck, I think.

5. Make decisions about whether results should be reported and whether the simulation has reached equilibrium. Then report, etc. Not a big time-suck, I think.

I can see three ways to reduce the amount of time used by the sliding-window scoring:

1. Don't score the genome every cycle. The genome score is used to give the operator a sense of how the run is progressing - this needs only be done at the reporting intervals, rather than after every cycle - and to decide whether equilibrium has been reached. To make this decision the simulation calculates a mean score over specified intervals and over the whole run. I think we could devise a way to get most of the information the equilibrium decision needs using genome scores calculated much less often, perhaps even calculated only at the reporting intervals.

2. Score only the number of fragments needed for recombination. The specified number of fragments to be mutated and scored in each run (F) is usually much larger than the maximum number of fragments allowed to recombine with the genome (M); at present these values are set at F=100 and M=10. When the genome has few or no strong matches to the USS consensus (at the start of a run), few or no fragments meet the recombination criterion, so every fragment needs to be considered. But later in the run much time is wasted scoring fragments that won't be considered for recombination.

Rather than mutating and scoring the full F fragments, then considering them for recombination until the M limit is reached, the simulation could just get M fragments, mutate, score and consider those, and then get another M fragments, until a total of M fragments have recombined.

3. Stop scoring each fragment once its score reaches a preset maximum. The program already has an implicit maximum score for fragments, set as the range of the probability a fragment will recombine. But at present the sliding window scores the whole fragment. It could instead stop scoring each fragment once this limit is reached. As in improvement 2, this won't have much effect at the start of a run, when few USS-like sequences are present, but should speed things up a lot later in the run.

4. Improve the efficiency of the sliding-window scoring algorithm. This is where we'd appreciate advice from any programing masters who read this blog. I'm going to paste the code for this bit below, but unfortunately I don't know how to format it so it's easily readable (all the tabs collapse). Maybe I'll first put the code without comments (tidy and compact) and then put it again with the comments.

my $t = 0; # t is the position of each base contained in the window, with respect to the 5' end of the window. my $forwardScore = 1; # Score for the 10 bp sequence at each window position, compared to forward USS.

# This code scores the 10 bases of the fragment that are in the window, from t=0 to t=9. # It first gets the base at each position t, and then gets a score for this base from the scoring matrix.

while($t < $windowSize)

{

my $currentBaseInWindow = substr($fragment,($counter+$t),1); # The base will be A or T or G or C.

4 comments:

It's a good idea to eliminate the "probably" and "I think" qualifiers from what you know about the time consuming parts of the program before you proceed. Imagine spending hours speeding up the part you think *should* be the bottleneck, only to find that it has little overall effect because the real bottleneck was somewhere unexpected.

Why guess, when you can measure? You can run the program in the Perl profiler, built in to Perl (run 'man Devel::DProf' to see the manpage). You run the script like this

perl -d:DProf myscript.pl

and it will create a big data file called tmon.out which you can analyse with a tool like dprofpp ("display perl profile data", which also comes with Perl). This has various display options and will tell you exactly where the time is being spent.

Armed with this knowledge you're ready to concentrate your effort to best effect. Then you can start to see where you can pick a more suitable algorithm or datastructure to overcome a bottleneck.

I'm not sure that you can improve the speed very much. Fact is, Perl is rather slow when it comes to mathematical algorithms such as your sliding window. It's great for parsing text (which is what it was designed for), not so great at CPU-intensive calculations.

Best solution is probably one of your ideas about scoring less often. Or else, find a helpful C programmer.

By the way, I notice that you are using switch(). Which must mean that:

- your version of Perl has switch()- you cunningly coded your own switch()- or you are using Switch.pm

If the latter - be careful. That module used to be riddled with bugs. I don't think it's a problem in your code, but many people quit using Switch.pm years ago. Do a Google search for "Switch.pm bug" to see what I mean.