(I'm posting this from a Starbucks in Iowa City, where the meeting on Sex and Recombination starts this evening (well, the meeting is in Iowa City but not at the Starbucks).)

I've been doing lots of Perl-simulation runs on my laptop, investigating how the amount of recombination per cycle affects the equilibrium accumulation of uptake sequences. I had originally thought that it might just affect the time to equilibrium, but instead it dramatically affects the state at equilibrium. More recombination gives a higher score, with score being independent of mutation rate over at least the two mutation rates I've been able to test (0.01 and 0.001 mutations per bp per cycle). This will become a nice graph.

My queries about finding a faster computer to run the simulations on gave lots of replies.

(I've deleted the rest of this post because I absentmindedly posted the same info a few days later.)

Here's a better way of looking at how the mutation rate affects (doesn't affect) the outcome of the simulations. It was suggested by a mathematically sophisticated family member.

Previously I plotted the score as a function of the number of cycles run, for each of the three mutation rates, with 3 or 4 replicate runs plotted on the same graph. I've done several things differently in the graph on the left (which PowerPoint has inexplicably warped). The first two differences are trivial - I've plotted the means of the replicate runs, so there are only three lines, and I've put the x-axis on a log scale so the points are spread out evenly.

The third difference is the important one. I've changed the x-axis so instead of being the cycle number it's the total number of mutations each genome has been exposed to, expressed per 100 bp. The µ=0.01 scale didn't change, but the scale for the µ=0.001 runs decreased by 10-fold and the scale for the µ=0.0001 runs decreased 100-fold.

Now we see that the runs with three different mutation rates have all given very similar (superimposable) results. I'll present this figure in the paper, as it will nicely justify our decision to do the bulk of the oher runs with µ=0.001.

This presentation also shows that, even after 20,000 cycles, the runs with µ=0.0001 are only just reaching equilibrium - I'll try to run them for longer, even if this means running them for a week or two on our lab computer.

In the old days, computer programs ran on big centrally located computers that belonged to universities or big corporations, not on personal desktops or laptops. It just occurred to me that this might still be possible. I'd gladly pay a modest sum to run a few of my simulations on something that was, say, 10-100 times faster than the MacBook Pro I'm typing this on. I tried googling "buy time on fast computer" and other permutations, but couldn't find anything (lots of "When is it time to buy a new fast computer" pages).

I think that there must be places to do this. Perhaps one of my occasional readers will know. But, in case you don't, I'm going to send an email to the evoldir list, which reaches thousands of evolutionary biologists.

I've been running a lot of simulations to confirm the preliminary conclusion that mutation rate doesn't change the equilibrium uptake sequence score of our simulated genomes.

These runs all used a 20 kb random-sequence 'genome' and our simple high-bias uptake matrix. Each cycle recombined 100 fragments of 100 bp each - that's half of the genome. And the bias decreased by a factor of 0.75 for each step of the cycle that didn't give enough recombination. I ran three replicates with mutation rate = 0.01, three with rate = 0.001, and 4 with rate 0.0001, each for 10,000 cycles.

Results: All three rates are indeed giving similar equilibrium scores. Each point in the graphs to the left shows the mean scores over the interval since the previous point - that's why the scatter gets less as the points get farther apart. I've calculated the means and standard devations for each rate, but the error bars are just as big as you would expect from the graphs.

As expected, the time to equilibrium depends on the mutation rate, and the noise is highest for the lowest rate. The lowest rate runs hadn't really reached equilibrium after the 10,000 cycles, so I've taken the final sequences these runs produced and used them to initiate new 10,000-cycle runs, to give results for 20,000 cycles. These aren't finished yet, but they do define an equilibrium in the same range as for the higher-rate runs.

Now that these runs have established that mutation rate doesn't have a (big) effect on outcome, we can discuss the results the former post-doc obtained using a mutation rate of 0.001 and a genome size of 200,000bp. This larger genome size dramaticaly decreases the noise in the runs, in the same way that the higher mutation rate does (compare µ=0.001 and µ=0.01 in the graph), giving us more confidence in the equilibrium scores we determine.

I'm at the part of the manuscript were we introduce the analyses done with our simulation model, and I want to begin by considering which parameters affect how fast the runs are, because run speed is what limits our ability to fully investigate how the different factors affect uptake sequence evolution.

Mutation rate. Runs with lower mutation rates take longer to reach equilibrium. I expect the effect to be proportional - a run with a ten-fold lower mutation rate will take 10 times as many cycles to reach equilibrium.

Amount of DNA recombined per cycle. The individual cycles will take longer if more DNA needs to be recombined. The amount of DNA is the product of the number of fragments to be recombined and the length of the fragments; although cycles using many short fragments may take a bit longer than the same amount of DNA in a few long fragments, I expect this effect to be minor. On the other hand, fewer cycles should be needed when more DNA is recombined in each cycle, but this only applies when the total amount of recombination is substantially smaller than the length of the genome.

Present state of the genome. More fragments will need to be tested per cycle when the genome has few good uptake sequences.

Length of the genome. Treated in isolation, this effect should be minor. Runs with long genomes will take longer to score, but scoring is only done once every 10 cycles. And longer genomes give better estimates of equilibrium US density. But since we set the amount of DNA recombined to be a constant fraction of the genome, genome length indirectly has a very large effect on run length.

Number of fragments tested at a time. Taken in isolation, this is largely irrelevant to run time, because it should only make a difference at the start of runs initiated with genomes very rich in uptake sequences. But it determines how fast the bias is decreased, which determines how many fragments have to be tested to get the required recombination, so fewer fragments per set will make the run go faster because bias will decrease faster.

Recombination adjustment factor. This directly controls how quickly the bias is reduced within each cycle; a value close to 1 will slow the bias reduction and so make the cycles take longer than with a factor of, say, 0.5.

Recombination multiplier. Higher values (e.g. 10 or 100) will make each cycle take longer because the initial bias value for the cycles will be higher. But I think we should just keep it constant at 10.

Length of fragments: Short fragments are faster to score, but are less likely to contain uptake sequences so will require testing more fragments.

Now, which of these factors might affect the nature of the equilibrium?

Mutation rate. Yes it certainly could. In earlier versions of the model it made a big difference, but the former post-doc's statistical analysis of the assorted preliminary runs found that it didn't matter with this version. I'm doing several runs right now specifically to confirm this, asking whether rates of 0.01, 0.001 and 0.0001 give the same equilibrium scores.

Amount of DNA recombined per cycle. This should be very important, because the bias only exerts its effect through recombination, whereas mutation is independent of recombination (I'm using the same rates for the genome and for the fragments).

Present state of the genome. By definition this shouldn't affect the equilibrium.

Length of the genome. The score isn't normalized for genome length; it's simply the sum of the scores of all the positions of the sliding window. (Aside on how the scoring is done: the window is the same size as the uptake-sequence, and each window-position score is the product of the scores of the individual bases in the window. Because a matched base is worth 100 times more than a mismatched base, only the highest scoring matches to the matrix make a significant contribution to the score.) We're interested in uptake sequence density, not absolute number, so this shouldn't matter. We can adjust the scores for fragment length at the manuscript level if appropriate.

Number of fragments tested at a time. Because of its effect on how quickly bias is reduced, this should matter.

Recombination adjustment factor. This should matter because it sets how quickly bias is reduced.

Recombination multiplier. We'll keep this constant at 10.

Length of fragments. This should matter because the bias acts on the whole fragment recombined, not just on the uptake sequence whose score determined the probability that recombination would happen. When a long fragment recombines it may contain mutations that worsen other uptake sequences it spans. So we might expect that short fragments would give a higher density of uptake sequences and thus higher scores.

If all goes well, my mutation-rate test will confirm that rate doesn't matter, and I can write up the former post-doc's set of analyses with a rate of 0.001. Then I'll add a few tests with real intergenic sequences and call it quits. The simulation results are only one part of the manuscript, so I shouldn't get carried away with trying to improve them more than necessary.

A poster from Lauren Bakaletz's group: T4P-dependent twitching motility can be assessed by formation of satellite colonies on special alkaline agar plates. Perhaps we should try this with our hypercompetent Rd strains.

Neutrophil nets: Neutrophils have a form of apoptosis in which they extrude DNA and other cell materials, forming nets that cover bacterial biofilms growing on epithelia. Probably related, some Camplyobacter mutants produce biofilms that are held together by networks of DNA fibers. Neisseria secretes a very stable DNase (one of yesterday's posters), maybe to cut through these nets.

From the "$1 Bacterial Genome" session: Right now, Illumina sequencing can do pools of 15-20 bacterial genomes per lane (~130 genomes total in 8 lanes) at about $200/genome. That's for 50-fold coverage, which is what George Weinstock recommends for Illumina. The big costs aren't the sequencing but the sample acquisition (the genome DNA preps need to be tagged with sequence 'barcodes' before they're pooled) and strain curation and the post-sequencing informatics. But I think we wouldn't need barcoding for our project. The H. influenzae genome is on the small side, so for our pooled recombinant genomes I think that means we could do an Illumina lane and get about 2000-fold coverage of the pool for about $3000!

Today (at the ASM GM) I went to two posters about DNA secretion by Neisseria. I'm interested because of the associated interpretation (assumption? conclusion?) that Neisseria have evolved the ability to secrete DNA because this promotes genetic exchange by natural transformation.

The first poster was about the genetic machinery responsible for secreting DNA, initially characterized in N. gonorrhoeae. It's a large 'island' of genes that strongly resemble the transfer genes of conjugative plasmids (like the E. coli F factor); they encode a type IV secretion system like that used by these plasmids to transfer a single DNA strand into a new host cell. The island appears to have been integrated into the genome by a single crossover with chromosomal DNA, likely mediated by the site-specific recombinase that normally functions to separate newly-replicated daughter DNAs when replication is finished. (This would mean that the island was originally a circular plasmid-like molecule.) Most but not all N. gonorrhoeae strains have some version of this island. Most N. meningitidis strains have it too, but usually with major deletions or other changes expected to make it non-functional.

So is this an adaptation to promote genetic exchange? The evidence looks good that this element does cause cells to release DNA into the medium. The student whose poster I was at said that the DNA is single-stranded and originates from a specific oriT-like site in the element, just as we would expect for a conjugation system.

Amount of DNA released: The first paper reports that cultures after 4 hr growth contained about 150-200 ng of DNA/ml; this was almost entirely eliminated by a mutation in a component of the secretion system. A later paper quantitates the DNA produced by different strains in log phase; producers had 0.2-0.6 ng DNA per µg of cell protein.

Transforming activity of the released DNA: Transformation frequencies are high, between 10^-4 and 10^-3, consistent with the presence of high concentrations of DNA in the medium. Transformation is reduced 250-fold by addition of DNase I, confirming that this is not cryptic conjugation.

Timing of DNA release: The DNA appears in the medium in log phase, whereas cell death is expected mainly when growth slows and stops. The secretion mutant doesn't produce much DNA until the end of growth.

Mechanism of DNA release: The genetic analysis supports the hypothesis that the secretion system causes the DNA release; cells with mutations in secretion genes don't release DNA. But it doesn't distinguish between active secretion and release caused by toxic effects of the element.

Strandedness of the released DNA: This is a critical point. If all the DNA that appears in the medium in log phase is released by secretion, it should all be single-stranded. But single-stranded DNA transforms poorly - a 1981 paper by Biswas and Sparling says it's 100-fold worse than double-stranded DNA - so perhaps much of the released DNA is double-stranded. If so, it's probably not secreted DNA. If it is secreted it should also all be from the same strand of the chromosome - I don't think this has been checked.

The actual state of the released DNA is not clear. The student whose poster this was said it was all single-stranded, and that the dye used to measure it was specific for single-stranded DNA. But the papers describe use of a dye, pico green, that is advertised for measuring double-stranded DNA, and although the authors say that in their hands it also detects single-stranded DNA, they appear to have used double-stranded DNA to calibrate their assay.

I've just sent an email to the senior author on these papers, asking all these quenstions.

In a couple of hours I'll be introducing myself as part of the round table discussion on Open Science at this year's Annual General Meeting of the American Society for Microbiology. So I'd better figure out what I'm going to say in the 5 minutes I've been allocated.

The first thing to say (after my name) is that my role here is as a representative of the ordinary scientist. I'm not a publisher or an editor or an expert in science communication. I don't have a big lab at a prestigious university. My research is on an evolutionary question, lwhy bacteria exchange genes ("Do bacteria have sex?"), and most people think I'm wrong.

But why me and not some other ordinary scientist? Probably because my lab is doing science in a more open way than most others. This openness manifests itself in several ways:

First, being as open as possible about what we've done. This means publishing in open-access journals such as PLoS where that's reasonable, and paying large sums for open-access publication of the papers we publish in subscription-based journals. It also means posting pdf copies of our papers on our web pages even where the copyright forms we signed say we mustn't.

Second, being as open as possible about what we're presently doing. I have a research blog, where I write about the experiments and analyses I'm doing (what I'm going to do today, the results of what I did yesterday). I also write about the other aspects of doing science, but I try to stick to my own research-related experiences. The other members of my lab have research blogs too.

Third, being as open as possible about what we're planning. So I blog about the grant proposals I'm writing, and when I'm finished writing them I post them on our web pages at the same time that I submit them.

Why do I do this? Partly from principle: I think these are the right things to do. The taxpayers deserve to see what they paid for, and science done openly is more likely to make a difference, and more likely to be good science. And partly for practical reasons: I find that writing about my research helps me think more clearly about it, both before I do it and when I'm thinking about what the results mean. I love reading the blog posts of my lab members - maybe blogging also helps them think about their research, or maybe it just lets me see how smart they really are. I also keep hoping that potential collaborators will read the proposals I've posted and then contact me.

Now, if I only had a printer in my hotel room I could take a copy of this post to the round table. But I don't, so I'd better spend the next few minutes copying out the main points for my 5-minute introduction.

I'm in Philly for the American Society for Microbiology meeting (talking in the panel on Open Science tomorrow afternoon). The meeting hasn't started yet, and I've been playing around with my Perl simulation of uptake sequence evolution, trying half-heartedly to figure out why uptake sequences aren't being maintained. But this isn't getting me much of anywhere, and what I really need to do is buckle down and sort out what we learned from all the runs the post-doc did before she moved on.

The playing around was largely motivated by a worry that something was wrong with the program component that gradually reduces the uptake bias when not enough fragments have recombined. The bias seemed to be becoming very low very fast, and this could explain why the model couldn't maintain the uptake sequences in the genome segment I've been giving it. I've now convinced myself that the bias-reduction component is behaving as it should, so the low bias must mean that the program is going through MANY sets of fragments before it finds enough it likes. Which probably means it's recombining the same fragments over and over within a single evolutionary cycle, not at all what I want to happen.

My new binder now has notes and graphs for both the Perl simulation results and the Gibbs motif sampler analysis.

The new N. meningitidis Gibbs searches specifying small numbers of 'expected' occurrences succeeded in finding the DUS motif. The searches expecting 200 found about 2500 occurrences, and those expecting 500 found about 2600. I've now queue'd up some searches expecting more (1000, 1500 and 2000), only because my analyses of other genomes1have used 1.5 times the number of perfect cores, and for N. meningitidis this needs 2809 occurrences.

The motifs Gibbs found from a genome sequence with the RS3 repeats removed are very similar to those from the full genome sequence. The 'residue frequencies', expressed as percent of each base at each position, are identical, and the 'motif probability models', expressed as probabilities to three decimal places, differ by no more than 0.001. This means that the RS3 repeats are not perturbing the results of whole-genome searches, but they may be perturbing the intergenic-sequence searches. I ran a bunch of intergenic searches last night, with and without RS3s removed. Some of these runs ran into 'segmentation fault' errors, which I remember dealing with a couple of years ago, and I haven't analyzed the outputs yet. I'll try to do that quickly because I really need to focus on the Perl simulation work today.

Update: I had two OK searches with the RS3 repeats and one without. They found nearly identical numbers of occurrences (1570, 1573 and 1572). I made logos of one of each type and they are indistinguishable, so I can conclude that RS3 repeats don't affect the searches of intergenic sequences, at least when the search is expecting a small number of repeats.

The analyses of N. meningitidis coding sequences also showed that it's better to expect an unreasonably low number. With exp=1500, the searches found about 6500, but most were not DUS and the motif was only strong for four of the 12 positions. But with exp=100 the replicate searches found 900 and 901 occurrences, and these gave a very strong DUS-type motif.

When I do benchwork I consistently keep pretty good notes. I write down everything I do as I do it, on numbered and dated sheets of paper that go into looseleaf binders, organized by experiment. If I make notes on a scrap of paper, I tape them in. I write a brief plan (a few sentences about the point and design of the experiment) and end with some sort of conclusion or summary.

But I don't seem to be able to apply these good record-keeping habits when I'm working with computers. Instead everything I do feels 'exploratory', as if everything I do is just a preliminary check to see what effect a modification will have, before I do something worth writing down. The settings I've used for a particular test get overwritten, or lost, or buried in some output file labelled only with the date and time. Sometimes I print out the results and scribble a few words on the sheet to remind me of its significance, but mostly I just rush on to the next test. Occasionally I resolve to keep better records, but I just write a few sheets of notes and then go back to the exploratory rush. The various printouts and scribbled notes eventually get shoved in a folder but are too disorganized to be much use.

I hope I'll do better with the work I'm now doing for the US variation manuscript. I have several relatively well structured goals this time, and a pretty good sense of how to go about accomplishing them. So yesterday I set aside a new binder, and a pad of proper science-notebook paper (not one of those ephemeral yellow pads). The first task for this morning is to organize all the sheets of paper on my desk into either this binder or the recycling bin. All the papers unrelated to the UV variation work have already been moved to a pile on the floor, where I'm doing my best to ignore them.

Well, my four Gibbs searches of the whole genome (with and without the RS3 repeats) hit the 36 hour wall when they were only about 1/3 of the way through their 100 seeds (= 100 replicate searches). And, judging by the scores reported for the results with each seed, none of them found the DUS.

So I went back to old blog posts, to see whether this is a problem I had already solved. (Yes, I know that's pathetic, I should be able to remember what I've learned, but the ability to find forgotten results in blog posts is one of the big benefits of blogging about my research.) In this post I considered using a prior that specifies the motif, but decided to instead seed the search with a few hundred bp enriched for the DUS and later remove these occurrences from the output. I'm not really sure that this is the best approach so maybe I'll try both now (using only 25 seeds and specifying a 48 hr walltime.

I'll queue up a lot of combinations of priors (just-length and base-frequencies), numbers of expected occurrences (200 and 2838), and seeded and unseeded sequences, but I'll do this only for the sequences with the RS3 repeats removed. That's because I already have results from sequences with the repeats, and the purpose of these new runs is just to find out whether removing the repeats makes a difference. If one or more of the runs succeeds in finding the DUS, I'll do the same run with the sequence with the repeats and see if the motifs differ.

I'd rather find out why Gibbs can't find the DUS but can find the less frequent and more complex USS with no trouble. But I don't have any ideas.

On the Gibbs front, my new analysis of the coding sequence DUSs did find them. Both replicates did: the trick was to reduce the expected number of occurrences (not something I would have predicted). I may try that with other difficult searches. My analyses of the whole genome are still running. I hope they don't run over the 36 hours I specified when I put them in the Westgrid queue, because they'll just be aborted and I'll have to start them over again.

On the Perl simulations front, I've got the program running and used it to do the control simulations. The first controls use random sequences the same lengths and base compositions as the concatenated H. influenzae or N. meningitidis intergenic sequences, run with matrices specifying the corresponding USS or DUS core but with no recombination. These controls tell us what the baseline USS or DUS score is for a genome that hasn't experienced any accumulation. The second controls use the real H. influenzae or N. meningitidis intergenic sequences instead of random sequences, and run for a long time to see how long the sequences take to degenerate to the predetermined baselines (i.e. to become randomized with respect to USS or DUS). The score isn't a very sensitive indicator for this degeneration, as the genome may still contain an excess of the imperfectly matched cores, but I'll be able to tell this from the final analysis done at the end of the run.

After screwing up the settings many times this afternoon (e.g. specifying the N. meningitidis sequence and matrix but forgetting to change to the corresponding base composition), I realized that I could save myself a lot of wasted time by making two versions of the program, each with its own matrix and sequence files and with a settings file that specifies the appropriate genome size, base composition, and matrix and sequence files. So I did. All of the analyses I've planned will be simulating the evolution of either USS in H. influenzae intergenic sequences or DUS in N. meningitidis intergenic sequences, so now I just need to open the right folder.

Quite a bit of progress on the Gibbs motif sampler analyses yesterday. I figured out what I'd done to remove the RS3 repeats from the N. meningitidis genome sequence (used Word to delete all occurrences of ATTCCCnCnnnnGnGGGAAT). So I then ran some small-scale Gibbs searches on my laptop and the fastest of the lab computers, to see whether removing the RS3 repeats changed the motif it found. But none of the searches found any DUS-like motif at all even when I used a prior file that specified the DUS motif base frequencies. So now I'm rerunning these on a much larger scale (2x100 replicates) on the Westgrid computers, with a prior that specifies the DUS size but not its sequence. The Westgrid computers are slow, but I can have multiple searches running simultaneously, freeing up my own computers to run some Perl simulations (see below).

I discovered that I don't need to repeat the leading/lagging strand analyses after all. I had forgotten that I'd already redone the N. meningitidis ones (showing that the original surprising result was a fluke), and I decided that the H. influenzae ones I've done don't need to be repeated.

I started analysis of the DUS in the N. meningitidis coding sequences. I ran 2x100 replicates overnight on Westgrid, but they didn't find the DUS even thought they used the prior that specified its sequence. Instead they found about 7000 instances of sequences that resemble it only in containing GCCG. I think the problem may be the low density of DUS in coding sequences (~650 perfect 10-mers in ~1.74 megabases; 0.37/kb); the whole genome has ~1900 in 2.2 megabases; 0.89/kb. So I've set up a couple more runs, this time telling the program to expect only about 100 occurrences (yesterday I told it to expect 1500).

Now I'm going to try to get some Perl simulations running, after at least skimming the copious notes and data the former post-doc left me.

I've been trying to sort out the massive jumble of files that might be relevant to the US-variation manuscript, but I've pretty much decided that it's a waste of time. Instead I'm just going to focus on finding the files and sheets of paper where we have summarized what's been done and what we learned. And then I'll try to either find or create the specific files I need (depending on whether finding or re-creating looks easier).

For today, I'm not going to try to do anything about the Perl simulation work. Instead I'll just focus on the Gibbs analyses. I did find my instructions-to-myself of how to do these. I haven't heard back from the RS3 expert (apparently because his spam filter didn't like the urls in my email signature), so the first step now is to find what I've done to analyze these repeats. In particular, I have a N. meningitidis genome sequence from which the RS3 repeats have been removed; maybe I can figure out how I did this, and then test whether it makes a difference to the Gibbs results. And I also need to figure out why I seem to have been using two slightly different counts of the number of perfect 10-mer DUSs in the N. meningitidis genome.

A conversation with the new post-doc, and then with the RA, raised questions about how each of us does the serial dilutions and platings we use to measure cell numbers. These measurements are so fundamental to our research that we haven't questioned whether we're all doing them right.

Do we always use a pipettor and disposable tips, or do we sometimes use glass pipettes?

Do we use relatively large volumes, in culture tubes, or small volumes in microfuge tubes?

Do we always make 1-in-10 dilutions, or sometimes make 1-in-100 or other proportions?

When we use a pipettor, do we use a fresh pipette tip every time, or do we only change tips when we think it matters? Every time we sample from a different tube? Only if the new tube has a higher concentration of cells? Only if the new tube has a lower concentration of cells? Only if the volume we need to measure changes? What about when we're using glass pipettes - do we use a fresh one every time?

Do we pipette liquid up and up and down in the tip or pipette before removing our sample? Do we pipette liquid up and down to rinse the tip or pipette out after putting the sample into the new tube?

Do we always plate the same volume of dilution onto each agar plate, or do we use different volumes to refine our measurements?

This afternoon we're going to do a test, to find out whether these differences matter. I've grown cultures of two E. coli strains overnight, one wildtype and one resistant to kanamycin. And I've poured lots of plates, with and without kanamycin. I'm going to mix a bit of the KanR culture into the wildtype culture, and then we're all going to dilute and plate the mixed culture to estimate the density of wildtype and KanR cells.

If we all get the same answers, we'll know that our differences in technique don't matter. But if we get different answers then we'll need to investigate further.

I'm going to sort out what data this manuscript needs before I do another thing!

........

OK. Progress.

Genome analyses needed: I need to reanalyze the Neisseria meningitidis genome with the Gibbs motif sampler, but not until I've decided whether or not to first remove the copies of the RS3 repeat. I've emailed the person who discovered them, asking him whether he thinks they are insertions or arise in situ like uptake sequences. If the former, I'll use the genome sequence that I've already removed them from. I'll do the analysis on the whole genome, and then on the strands sorted by their direction of replication. I did this before and got weird results; if the same thing happens this time I'll investigate further.

I've already done the corresponding analyses for H. influenzae, though I should probably repeat the replication-direction analysis because that was done with a slightly different dataset.

I should also analyze both genome datasets for the numbers of one-off and two-off motifs (singly and doubly mismatched); that will be easy because we have a little Perl script (somewhere) to do that now.

I should look at the effect of coding constraints by doing Gibbs searches with the coding and intergenic subsets of both genomes. But I won't split up the coding subset by the different reading frames - this is messy and not very informative.

The analysis of covariation has been done for Neisseria. I can't remember whether the H. influenzae covariation analysis was only done with the old dataset and so should be redone. The control analysis for Neisseria showed an odd pattern of weak covariation between every third position of random sequence segments. I don't think it's due to coding effects because I see the same pattern, a bit weaker, in the noncoding dataset. Maybe it's those blasted RS3 elements, so perhaps I should redo the Neisseria analysis with the RS3-deleted dataset.

The analysis of within-species variation at uptake sequences in H. influenzae is done, and there's no N. meningitidis equivalent to do.

And finally, what needs to be done with the Perl simulation of uptake sequence evolution? The few paragraphs I've found in the manuscript (written by me last fall) say that I'm going to take 200kb of intergenic sequence (or maybe all the intergenic sequence) of H. influenzae and of N. meningitidis, and find out what combinations of mutation rates and uptake bias the simulation needs to maintain their present levels of uptake sequences. Sounds straightforward, though I bet it isn't really.

Last night I reread what I've written so far on the manuscript about variation of uptake sequences. It's a bit of a mess, because the different parts were intended for different manuscripts. I got discouraged by all the work that still needs to be done, but this morning I'm more optimistic.

One big job is to finish up all the analyses based on Gibbs motif searches. I need to do the basic searches on the H. parasuis genome, do several analyses of subsets of the N. meningitidis genome, and repeat analyses of the H. influenzae genome that were originally done with a different dataset than the one I'm now using.

A second big job is to do more simulations with the Perl model of uptake sequence evolution, to give a clear picture of how a few basic factors affect their accumulation. Right now I don't even remember which factors this should address, but I think everything is clearly set out in earlier posts and in the documents the former post-doc left me.

And the final big job is to weave the different parts together to make a manuscript that tells a coherent story.

I've scored all my transformation plates and found no evidence of transformation. The Tet plates have no colonies at all. This could be because I used 20 µg/ml when I should have used 10; I'm checking the resistance of RR902 to see if I should repeat this transformation. The Kan plates had lots of colonies when cells were plated undiluted, but the numbers were similar for cells with and without DNA. With DH5alpha the colonies were all tiny even after two days, but quite a few of the colonies produced by the BW25113 derivative RR3015 were reasonably large.

What next? The RA is also going to repeat these experiments, using exactly the same method that gave apparent transformants previously. And I'm going to streak some of my KanR colonies onto MacConkey maltose (the recipeints are all Lac-) to see if any are crp- as true transformants should be.

I've been reading over the fellowship application our new post-doc submitted to NIH. NIH didn't fund it, but the comments were quite supportive so we want to fix it up and send it in for the next deadline (June sometime?). For me this is also an opportunity to think carefully about the immediate and long-term experiments we're planning. The immediate experiments will provide preliminary data that should make the proposal more compelling, and will also be a foundation for the full NIH R01 proposal we hope to submit in the fall. The medium-term is the experiments the proposals actually propose to do, and the long-term is where they'll lead, both for the post-doc's career (an issue raised in the reviews of his application) and for my lab's ongoing research.

One preliminary analysis we should do is comparison of the two genomes he'll use. Both are sequenced, and it would be good to provide table and/or a figure giving specific numbers of SNPs (is it called a polymorphism when you're only comparing two individuals?), numbers and lengths of indels, and information about specific differences relevant to the proposed analysis. This could be an appendix if such are allowed, or a small table in the text.

Another thing the proposal needs is a more explicit description of the calculations that underlie its claims that the scope of the sequencing is sufficient for the information desired. He'll be creating pools of DNA from different stages and sequencing these. Depending on the source of the pool of DNA, this will require a lot of sequencing, a ton of sequencing, and what until recently would have been an absurd amount of sequencing. We can afford to do some sequencing on our present budget, and the R01 proposal is mainly to get funding for the massive sequencing. I don't understand the sequencing methods he's proposing as well as I need to, and I haven't seen any of the calculations yet. We need to lay them out in enough detail that the reviewers will have confidence in us. Perhaps we can also create some simple diagrams illustrating how the different pools will be analyzed.

I'm trying to do the follow-up experiment to the previous unsuccessful attempt to transform E. coli with chromosomal DNA. This 'follow-up' is really a 'fall-back', as I'm now trying to replicate the RA's previous apparently successful transformations. I'm growing the two strains she's transformed previously (derivatives of BW25113 and DH5alpha, each containing a low-copy sxy expression plasmid), and will transform them with the two DNAs she's used successfully, one from RR902, an old E. coli strain containing a Tn10 insertion in purE*, and the other from RR1314, a BW25113 derivative containing a KanR cassette in crp. Both genotypes have been confirmed; RR902 grows fine on minimal supplemented with the purine inosine but not at all without inosine, and PCR shows the size of the crp fragment in RR1314 to be that predicted by the disruption.

Because she's found plating anomalies depending on cell density, I'm planning to plate a wide range of dilutions, and I'm also going to use two kanamycin concentrations - the 10 µg/ml she's already used and also 20 µg/ml.

BUT... one of the recipient strains (RR3013, the DH5alpha derivative) refuses to grow in LB with 20 µg/ml chloramphenicol (the selection for the sxy plasmid), although the other strain, which carries the same plasmid, is growing just fine. Both were inoculated at the same time, each from a plump single colony on a LB Cm10 plate. And when I tried to look at the non-growing culture under our very expensive microscope, I discovered that the high-power lens has somehow become all crudded up, and that lens cleaning solution doesn't help.

So I guess I'll just do the transformations into the other strain (RR3015). Update: my reinoculation of RR3013 crept up to the appropriate density so I included it too.

* Our strain list says this strain is also called NK6051 (constructed by Nancy Kleckner), and the E. coli Genetic Stock Center says NK6051 has its Tn10 insertion in purK, not purE. That's fine, purK is the gene next door to purE, so the original mapping was probably an error.