Today we started using Komodo Edit as our editor for Perl. It's much better than Mi: better colouring (the numbers are all in red), more reliable indentation, and it catches syntax errors on the fly. And, very cute, when I typed "if (" , it automatically added the second bracket ")" on the far side of my cursor, to make sure I remembered that I needed to close the (if condition) brackets.

Our most immediate goal is to transfer some of the features of our old/abandoned program into this new one. This includes some basic/sensible features: reading the parameter settings for each run from a separate file, rather than changing the code of the program, and having a mechanism to neatly end the program and save the interim work when it's interrupted with a control-C.

There are also a couple of quite clever features specific to the issues our models raises. One is when to print interim reports on the model genome's status. Depending on the parameters being tested, runs can take anywhere from a few hundred to a hundred thousand or more cycles, with rapid change only at the beginning. Rather than printing reports at a fixed interval, we have the reports printed whenever the number of cycles increases by a specified percentage. This gives very frequent updates at the beginning of a run, and reports at increasingly long intervals as the number of cycles gets large.

The other clever feature is how the model decides that the evolving features of the genome have reached an equilibrium. This is checked by comparing the recent state of the genome (mean of some attribute in the most recent print interval) with the mean of the same attribute in the previous interval and over the whole run. By using the increasing print interval as the unit of measure, we get sensitivity appropriate to the rate of change.

But I did spend some time with a post-doc working on the Discussion of her manuscript. I had forgotten that last time we worked on it we tore the existing Discussion into shreds and came up with a new organization. So this morning I was discouraged to see that we had improved our Discussion out of existence, but our new organization is so much better that we soon had at least half of the text in place.

On the Perl side of things, a very helpful commenter (Neil) pointed out that finding the missing/extraneous curly bracket would have been easy if we were using an editor that highlights and validates syntax. We're using an editor called 'mi'. It lets us specify that our text is Perl, and uses colours to distinguish between different kinds of text (comments are red, text to be printed is grey, functions are green, operators are sort of purplish, 'while's and 'if's and 'my's are blue), but it doesn't sort out hierarchical stuff like the levels of brackets, and the levels of indentation keep going to hell (possibly my own fault). I'd love to hear about a better Perl editor for Macs, if any reader knows of one.

First the undergrad and I added the code that tallies up the scores of every sliding-window position in the genome. It didn't take us long to get it running (a few stray semicolons, etc.) However after now Keith's comment on yesterday's post I think there are more efficient ways to do what we've done.

Then we created code that does the recombination a completely different way, so each of the fragments being tested has its own chance to recombine (probability of recombination depends on its score). The code wasn't tricky, but getting it running took ages of tracking down stray curly brackets and discovering that I can't nest two 'while' loops (I changed one of them to an internal 'if' test).

I still need to add in the feature I came up with yesterday, that writes out the genome sequence and the tally of scores only at specified intervals, but now I have to dash off to a 'visioning education' meeting imposed on us by the administration.

The simplest version of our new Perl model of USS evolution has progressed to the state where it runs correctly. This afternoon I've been doing lots of runs, both with a 'positive control' version that replaces a random genome position with a single perfect USS core in every cycle, and with a test version that mutates random fragments and scores them for goodness of match to the USS motif, and then recombined the best-matched one back into the genome. Tomorrow the undergrad and I are going to create a modified version, to try a different way of having the fragments' scores determine whether they recombine with the genome.

With the positive-control version I've been examining the effect of changing the genomic mutation rate. If the mutation rate is zero, the only limit to USS accumulation is the way insertion of new USS disrupts existing USSs. (This happens only because each 10bp fragment is changed to a perfect USS before recombination, and so bears no relation to the original sequence at that position.) Not surprisingly, more USSs are present at equilibrium when the mutation rate is zero, and fewer when the mutation rate is 0.01 or 0.05 changes per position per cycle. The rate of increase in genome score is largely independent of the mutation rate. Because only a single USS is inserted per cycle, the number of cycles to equilibrium depends on the length of the genome.

Wait - good idea! I think we need to add some code to give us the frequency of each sliding-window score at the end of the run. This would let us make a histogram of how many USS-like sequences the genome has at the beginning of the run, and how many it ahs accumulated at the end. Basically, as the sliding-window is scoring match to the motif at each position, it should record the score in a tally (number of sites scoring 0, number scoring 1, number scoring 2, ...... number scoring 10). I could write some inefficient code to do this (barring about a thousand syntax errors - I really should go back and reread the first few chapters of Beginning Perl for Bioinformatics), but this sounds like something the undergrad might have learned an efficient way to do.

Did I learn anything else from the positive control runs? If the genome is very short the program runs very fast but the scores are noisy (no surprise there). I learned that I have no practical insight into how a sequence's USS 'score' reflects the quality of its matches to the motif - that's why we need the tally. I played around with the 'threshold' we use as a cutoff for insignificant matches to the USS consensus, but I think we can't really understand what this accomplishes until we have the score tally.

I also did some runs with the test version (not the positive control). The results of these mostly served to reinforced the importance of the genomic mutations. Under the present recombination system, USS can't accumulate in the genome because they mutate away faster than they're improved by recombination. I tried turning off the mutation of the genome, so that mutation only happens to fragments that are about to be scored for possible uptake. Even with this 'cheating', the genome's USS score crept up slowly and then plateaued at what looks (without the tally) to be a genome with only weak USS sites.

I spend the weekend, with the rest of my lab, at the Evo-WIBO meeting of evolutionary biologists. Over breakfast we got into a discussion of the role (non-role?) of 'model systems'. I did my usual rant about how evolutionary biologists don't even understand the concept of model systems, but I'm wondering whether the problem is partly just the nature of evolution research.

Molecular biology, biochemistry, cell biology and physiology have made dramatic advances, largely because many of them work on the same organisms, so that the findings of one study can be directly used as the groundwork for more studies. Evolutionary biologists (and ecologists) almost always work on different organisms, and although they publish lots of nice papers these rarely can be applied to studies by other research groups.

This is partly tradition - a mark of academic independence in evolutionary biology seems to be choosing your own research system (organism+field site+questions of interest), but it might also partly arise from the nature of the field. The process of evolution is intrinsically tied more to variation than to shared properties - natural selection acts on differences, not similarities. So maybe choosing to work on different systems just looks like the sensible thing to do.

But it's consequences are unfortunate, because although every nice bit of research claims to have big-picture implications, the lack of transferability means we haven't really gotten anywhere.

Below is just a list of the main sections of the program, in its present 'test' incarnation.

MAIN PROGRAM:

1. Get parameter settings from a file (except it doesn't, the settings are hard-coded in this version).

2. Create a random-sequence 'genome' of the specified length and base composition.

3. Simulate a set number of cycles (presently 100), each consisting of genome mutation, fragment creation, mutagenesis and scoring, and recombination.

3A. Mutate the genome by randomly changing bases with a specified probability. (This step should be later in the cycle, not here.)

3B. Select a specified number of segments of the same lengths, from random positions in the genome. This will represent fragments in the external gene pool.

3C. Record each fragment's sequence and 5' end position.

3D. Mutate each fragment's sequence.

3E. Score each fragment's sequence for goodness of match to the uptake sequence motif, using a sliding window. I think the sliding window scores are not being correctly cumulated.

3F. Choose the fragment with the highest score. Put its sequence at the corresponding genome position, replacing the original genome sequence of this fragment. (This is a simulated form of recombination by gene conversion.) Any mutations of this fragment that occurred in step 3D will thus become changes in the genome sequence.

3G. Score the genome for how well its sequence matches the USS motif.

4. At the end of all the specified cycles, stop and report.

-----------------------------------------------------

SUBROUTINES:

I. Creating the original random genome sequence: This is pretty simple; it just picks bases randomly, with probabilities specified by the base composition.

II. Mutating the genome or fragment sequence: This is more complex, partly because the mutations need to maintain the base composition (see subroutine III), but mainly because it does it a relatively non-obvious but more efficient way. It first decides how many mutations to make, by dividing the genome length by the mutation rate and taking the integer value. (Oops, this will only work if the genome or fragment is big enough to get more than one mutation per cycle. The 'test' version has only a 100nt genome and a specified mutation rate of 0.001, so it has a real mutation rate of zero.) The subroutine then randomly chooses positions for this number of mutations, and makes the mutations at these positions.

III. Doing the calculations for the mutagenesis probabilities: This creates arrays holding the mutation probabilities for each base (A or G or C or T) to mutate to each other base.

I'm working my way through the undergraduate's Perl code, annotating it with detailed comments to explain to myself what I think is going on.

I'm only about 30% through the code, but I think I've found a number of problems, one of them big. Tomorrow morning I'll tie down the undergraduate and go over everything with him. Or maybe tomorrow early afternoon - tomorrow morning I'm supposed to help one of the post-docs polish her talk for this weekend's Evo-WIBO meeting.

Our Perl-programming undergrad has just sent me a copy of the latest version of his program simulation the evolution of uptake sequences by molecular drive. So far I've gotten to about line 100 and found several trivial typos and one source of confusion (to me). I had thought that the order of steps in each cycle was as follows:

Choose random fragments of a specified length from genome and mutate them (as if they came from different daughter cells).

Score each fragment for its match to the USS consensus.

Mutate the original genome according to the same procedure used on the fragments.

Replace the corresponding segment of the genome with the fragment that has the best USS score.

But the standard version of this code seems to instead do the following:

Mutate the whole genome.

Chose random fragments and mutate them (again).

Score each fragment for its match to the USS consensus.

Replace the corresponding segment of the genome with the fragment that has the best USS score.

So the fragments are getting mutated twice.

In actuality, this 'test' version of the code has a couple of steps commented out, and short-circuits the fragment-generation and mutation steps by simply specifying the sequence of every fragment (as a perfect USS). I think this makes it a lot easier to confirm the the code that does the scoring is working as intended. Tomorrow I'll sit down with the undergrad and go over it.

The guy in the next office just got off the phone with a colleague, arguing whether atheists should rise up and express their views. Triggered by the pending visit of Richard Dawkins to UBC. This horrible movie is one of the reasons I think we should:

Let's see if I can get back to posting every day. First let's see if I can remember what's been on the back burner.

The USS-evolution computer simulation is up and running, and the programming assistant will be able to spend more time working with me and the post-doc to get it doing what we want. It's close, so I'm hoping for lots of advances and discoveries. I've signed up to give a short talk about this at an upcoming meeting on sex and recombination (in Iowa, just before the big evolution meeting in Minnesota at the end of June).

The "How do USS constrain genome function" project with the out-of-town bioinformaticist is ready for its final polishing. A couple of weeks ago she sent me an email which (I think) contains the final data, and it really shouldn't take long now to have the manuscript ready for submission. I've signed up to give a short talk on this work at the evolution meeting.

The post-doc who's been analysing the variation in competence in a diverse set of H. influenzae strains now has a manuscript that doesn't need a lot more work. She's going to give a short talk on this work at next week's Evo-WIBO meeting (evolutionary biologists in Washington, Idaho,British Columbia and Oregon).

Both of the molecular biology post-docs have data that has yet to be put into manuscripts (at least I have yet to see the manuscripts). One has been analysing how CRP binds to recognition sites, and the other has been doing microarrays to find out how CRP and Sxy regulate genes in E. coli.

On the teaching front, I still have to: fix up the final exam so its a valid assessment tool for our homework-research project as well as for students' understanding; grade 17 term papers that are evaluating intelligent design as a scientific alternative to natural selection; help the graders grade the other ~75 project reports; help our homework grader finish grading the last two homeworks, and prepare detailed keys for these; analyse and post the marks for the 'clicker' questions the students have been answering in classes; administer and help grade final exams for about 360 students; and get all the grades analysed and submitted.

And on the homework-research project front, I (and the wonderful teaching fellow I'm working with) still have to finish a proposal for a small grant to hire assistants to assess the quality of writing in papers and exams by this year's and last year's students; find someone with sufficient Excel skills to transform our clicker-collected survey data into something we can work with; read the literature (I'm hoping the teaching fellow will point me to the appropriate papers); analyse the data; and write the paper.

We are blessed this week by a visit from a potential collaborator - a computer scientist who's done work on uptake sequences. When we described our computer-simulation model of uptake sequence evolution, he quickly discovered a serious problem arising from the way we have set up the uptake bias function to work.

In our model, a pool of DNA fragments is created each cycle, and the fragment with the best uptake sequence score gets to replace the homologous segment of the genome whose evolution is being simulated. The uptake sequence score is determined by using a sliding window to compare the fragment's sequence to the designated uptake sequence. For a long fragment this score is expected to reflect the combined effect of multiple uptake sequences. But at present the model is using short fragments, so the score is likely to just be that of a single uptake sequence.

This means that, once the genome contains a 'perfect' uptake sequence at one position, a fragment homologous to that position is expected to always out-score any other fragments in the pool, and thus always be the one that replaces the resident sequence. Thus one good match prevents the gradual evolution of other not-quite-as-good matches at other positions.

There's a different way to do the competition that doesn't have this problem. Rather than having multiple fragments 'compete' for the best score, with the winner taking the only opportunity to recombine with the genome, we can have each fragment in the pool independently challenge the odds of being taken up. Uptake of one fragment would not affect the chance of any other fragment being taken up in the same cycle.

This will require some rewriting of the code. But it shouldn't be a big deal, and luckily our undergraduate programmer will have more time to work for us once exams are over at the end of the month.