Nothing that has generated any big insights. The RA and I are working on the E. coli competence question: Given that E. coli appears to have all the genes needed for DNA uptake and transformation, can we detect competence or transformation? We have several strategies: 1. Artificially induce sxy expression to turn on the competence genes, using one of two IPTG-inducible sxy plasmid constructs. 2. Use a recombineering protein to increase the efficiency of recombination. 3. Screen strains of the ECOR collection.

I've been working through the data on regulation of competence in Vibrio cholerae. V. cholerae has most of the same competence genes as H. influenzae, and some of these have been shown to be needed for competence (especially the type IV pilin system and the inner membrane transport protein (Rec2 homolog). Most of these genes are controlled by promoters with sequences resembling the Sxy-dependent CRP-S sites we have characterized in H. influenzae. Sxy is known to be needed for competence development, as is the complex carbohydrate chitin.

Chitin is a polymer of N-acetyl-glucosamine (GlcNac) subunits, and is the main component of the exoskeletons of most arthropods, including the marine crustaceans that V. cholerae forms biofilms on. Because V. cholerae can break down and metabolize chitin, this is thought to be a major nutrient source in biofilms. So how does chitin availability regulate competence? Does it regulate sxy? Does anything else regulate sxy?

A recent paper by Yamamoto et al. (Gene 457:42-49, 2010) investigated transcriptional and translational control of sxy expression in V. cholerae, and I've spent the afternoon coming to grips with what they did and what they concluded. First they showed that the GlcNac dimer induces competence but the monomer does not. But the transformation frequencies are very low by H. influenzae standards, 1.4x10^-8 and 4.4x10-8 for two different wildtype strains, 14-fold and 44-fold above the detection limit of 10^-9. Transformation frequencies were 70 and 136-fold higher with a GlcNac tetramer. Perhaps because of the result in the next paragraph, the authors concluded that the activator was the GlcNac dimer, not the tetramer, and used this for the rest of their experiments.

Expression of a transcriptional fusion to lacZ was induced about 2-fold by both the dimer and the tetramer of GlcNac (2.1x and 2.3x), but expression of a translational fusion was induced 25- and 34-fold. This tells us that chitin's main contribution to the induction of competence is by increasing the translation of sxy mRNA. The GlcNac tetramer wasn't much more effective than the dimer - the difference between its effect on sxy and on competence may mean that it independently regulates another component of competence.

They also mapped the start site of sxy transcription to 104 nt upstream of the GTG start codon (so the V. cholerae sxy mRNA has a long untranslated leader like the H. influenzae and E. coli sxy mRNAs.), They identified various candidate regulatory elements: the -35 and -10 elemments of the promoter, the Shine Dalgarno sequence beside the start codon, and several inverted repeats that they hypothesized had regulatory roles.

They next analyzed a large set of transcriptional and translational fusions of parts of the sxy gene to lacZ. These showed the following:

First, removing the candidate sxy promoter eliminated expression, and replacing it promoter with a Ptac promoter increased expression of all fusions about 10-fold. This tells us that they have correctly identified the sxy promoter, and that it is relatively weak or not fully induced under the conditions used.

With the sxy promoter and the transcriptional fusion, the GlcNac dimer increased expression only 2-fold, but with the translational fusion the dimer increased expression 25-fold. With the Ptac promoter the effects were 0.9-fold and 30-fold. These effects again tell us that chitin's effect is mainly on translation. Nevertheless the authors concluded that chitin dimers act at the promoter to regulate sxy transcription. I think this conclusion is not justified by the small effect seen only with one fusion.

Deletion of only the second inverted repeat had no effect on either kind of fusion. But deletions between this and the third inverted repeat dramatically increased translation in the absence of GlcNac dimers (making translation constitutive), and deletions of coding sequences downstream of the dimer had the opposite effect, eliminating translation entirely.

In their inspection of the sxy sequence for candidate regulatory elements, the authors overlooked a very strong potential CRP-N site 50 nt upstream of the promoter. This suggests that V. cholerae sxy transcription may regulated by CRP/cAMP and thus by the phosphotransferase carbohydrate-utilization system (the PTS). The H. influenzae sxy gene is also regulated by CRP and the PTS, and the E. coli gene has a partial CRP site whose role hasn't been tested yet.

Bottom line: Transcription of V. cholerae sxy is likely regulated by CRP, and translation is tightly regulated by chitin dimers. Chitin tetramers may separately regulate another compoonent of competence. As in E. coli and H. influenzae, CRP activation is likely a signal of nutritional stress (that preferred sugar sources are unavailable). Regulation of competence by chitin may have evolved because of its role as a nutrient , but it may also signal that the cell is in a biofilm, where DNA is usually abundant.

The RA and I have summarized our various results about the E. coli CRP-S regulon and its competence (or lack of competence).

Although we (she, really) have done a lot of work, we still have no experimental evidence that the E. coli sxy gene is inducible at all. And we have no evidence that artificial induction of sxy causes competence. We need something positive for our paper, otherwise it's just a string of negative results that's not nearly comprehensive enough to warrant publication.

The only sxy-dependent phenotypes we have are (1) a pseudo-natural plasmid transformation, where the sxy+ frequency of 9x10-8 is reduced to less than 2x10-9 in sxy- cells, and (2) competitive fitness in long-term co-culture, where a sxy- mutant is outcompeted by a sxy+ strain (~10-fold difference in cfu after 6 days, ~100-fold difference after 15 days). Palchevskiy and Finkel have shown that Sxy-induced genes are needed for cells to use DNA as a nutrient, but we haven't been able to replicate this. And we have shown that artificial induction of sxy induces all of the genes in the CRP-S regulon, and that the pilin protein encoded by one of these, ppdD, is translated and correctly processed.

The strong conservation of the CRP-S regulon (including sxy) and the very strong parallels to the H. influenzae competence regulon justify our hypothesis that sxy expression is inducible, and that this induction causes E. coli cells to take up DNA from its environment. To make progress we need at least partial answers to these questions:

Question 1: What factors induce E. coli sxy expression? Another lab has already done extensive testing of culture conditions, assaying for pilin expression from ppdD, but found no induction. I've also tested some conditions for induction of other Sxy-regulated gene fusions, and the RA and former postdoc tested growth-condition dependence using quantitative PCR of sxy itself, again with no good evidence of induction. What we need to do now is to bring our molecular expertise to bear on this question.

We expect to find that E. coli sxy hasboth transcriptional and translational regulation, because such dual regulation has been demonstrated in both H. influenzae and V. cholerae, and because its transcript has a long untranslated leader (116 nt) like these species. Transcription may be regulated by CRP and cAMP, because the sxy promoter has a partial CRP site. This site looks like a reversed E. coli CRP-S site rather than like a CRP-N site, which might mean that sxy transcription is autoregulated by Sxy itself. Because the site is inverted relative to the sites found in the CRP-S promoters identified by our microarray analysis, this autoregulation may be negative (high Sxy may cause reduced transcription). We haven't yet directly tested either cAMP/CRP or Sxy for direct effects on sxy transcription. We can't do this with either of of our sxy-expression plasmids, because neither has an intact sxy promoter, but we could test wildtype cells for altered expression of the chromosomal sxy gene. We could also test whether cells with an internal insertion/deletion in sxy (but an intact promoter and 5' end) have more or less sxy transcript. Unfortunately we still don't have a Sxy antibody so we can't test for protein.

Question 2: Does Sxy induce competence? So far we have not found any evidence of genetic transformation in cells artificially induced to express moderate levels of Sxy. This could be because Sxy doesn't induce DNA uptake in E. coli, but it could also be because the level of Sxy is too low, or because the cells take up DNA but don't recombine it, or because Sxy doesn't induce DNA uptake in the K-12 strain but does in other strains.

The Sxy expression level we tested was low enough that it didn't interfere at all with growth. This was fully-induced expression from a low copy number plasmid with an IPTG-inducible lac promoter. The higher expression level we've used (for the microarray analysis) was very toxic and we couldn't test for transformation at all.
We can clarify some of these alternative. We can measure DNA uptake directly using radiolabelled DNA. In principle this is not nearly as sensitive as measuring transformation, but that's only true if transformation is very efficient. So maybe Sxy expression is making E. coli competent, and it's taking up lots of DNA, but just not producing any transformants.

We can also measure sxy transcript levels in cells artificially induced to different extents, and use this information to decide which conditions we should use to examine DNA uptake. The high copy plasmid gives massive induction and toxicity, and lower concentrations of the inducer still reduce the growth rate quite a bit. The low copy plasmid doesn't reduce the growth rate at all. I think we should examine an intermediate level, using a low concentration of inducer with the high copy plasmid. And we should add cAMP to these cultures as well as inducer - cAMP isn't needed for sxy induction from the plasmid but it may help with expression of some or all of the CRP-S genes.

There are lots of other E. coli strains we could test in addition to K-12. The RA has already screened the entire ECOR collection (~70 strains) for the level pilin expression in overnight cultures. She found no detectable expression in any strain. It would be good to test at least one strain more thoroughly, for transformation and DNA uptake, but how to decide which strain(s) to test?

I sat down with the RA yesterday, planning to look for riboswitches in her E. coli sxy mRNA sequence. But a couple of discoveries made this unnecessary.

First, she reminded me that Vibrio cholerae has two sxy homologs, not just one, and we quickly realized that the c-di-GMP ('GEMM') riboswitch is in the one that isn't known to have anything to do with competence.

We also checked the supplementary files for the paper that characterized this riboswitch, and discovered that the authors had done a very extensive search for GEMM riboswitches, not just in all the published bacterial genomes but in all microbial genomes and in a wide assortment of environmental genomics datasets. They found no GEMM riboswitches in any Pasteurellaceae; this isn't surprising because there's no evidence of c-di-GMP in this family. But they also found no GEMM riboswitches in any of the Enterobacteraceae.

So we decided that E. coli is veryunlikely to regulate its sxy gene by a GEMM riboswitch.

I've received a review request for a manuscript submitted to Frontiers in Antimicrobials, Resistance and Chemotherapy. The manuscript is in my area so normally I'd just say OK, but there are a lot of weird things about this 'journal'.

I put 'journal' in quotes because this appears to be one of many nascent efforts of the Frontiers online publishing group. Their home page has headings for 'Science' (7 Fields with a total of 117 Specialty journals, including the one that has contacted me for a review), Medicine (3 Fields, 58 Specialty journals) and Technology, Society and Culture, each with no Fields and no Specialty journals. Each Specialty Journal has an Editor and a panel of Associate Editors.

These are the same people who keep spamming us with Frontiers in Neuroscience notifications.

The review process described on the Review Guidelines pages is novel and very open. All submitted manuscripts are sent out for review after a simple filtering by the Editor to eliminate obvious junk. As soon as the reviewers have submitted their reviews the manuscript's Abstract is posted under a 'Paper Pending' heading and the manuscript and the reviews are placed in an Interactive Review Forum, where the (still anonymous) reviewers and the authors are supposed to discuss the manuscript (I think the Editor/Associate Editors can join in here). Eventually an agreement is reached on revisions. The authors then submit the final manuscript which is formatted and published online, along with the names of the reviewers. If no agreement can be reached the Editor may overrule the reviewers, or the paper may be withdrawn by the authors or rejected by the Editors

Most of the specialty journals have published no original research articles and few or no opinion/review articles. Many of the journal web pages look like they may just be place-holders. I chose one Microbiology journal at random - it has an Editor and 19 Associate Editors, but has published only one paper (an Opinion piece by the Editor) and has one original research Paper Pending.

I clicked on what I thought would be another information page about the reviewing process, and instead found myself with a 15-page pdf of instructions for budding journal editors in the Frontiers system. It's like a pyramid scheme, with the instructions explicitly recommending that the editors build their prestige by recruiting Associate Editors and soliciting authors and articles. This is how the Frontiers enterprise makes its money, by charging authors more to publish their papers. Because much of the work on the individual papers is done by unpaid editors and reviewers, the more papers Frontiers publishes the more money they make. No wonder they have so many 'journals'.

Nevertheless I think that, as an advocate of new forms of scientific communication, I should give this a try. I hope it's not too time-consuming.

LATER: Well, the paper was bad. Really really bad. Luckily it was also very short. And did you know that, if your institution subscribes to Turnitin, you can use this service to find evidence of plagiarism in manuscripts as well as in student submissions?

I've (at last) gotten back into working on our review article about the regulation of competence. This morning I was reading about competence regulation in Vibrio, and found out that that the 5'-end of the sxy (tfoX) mRNA has a riboswitch secondary structure that responds to cyclic-di-GMP (a 'GEMM' riboswitch).

The H. influenzaesxy mRNA has a long untranslated 5' leader whose secondary structure limits translation - might this be because it's also a GEMM riboswitch? A few years ago we checked it for similarities to the then-known riboswitches, and it didn't fit the pattern at all. But I found a useful genome-survey paper by Michael Galperin which found that H. influenzae (and the other sequenced Pasteurellaceae) have no homologs of the proteins that synthesize and break down c-di-GMP. So they're very unlikely to have any riboswitches that recognize this molecule.

However there are several reasons to suspect that c-di-GMP might regulate sxy expression and competence in E. coli. Like Vibrio speciess, E. coli strains have multiple proteins predicted to synthesize c-di-GMP. E. coli sxy mRNA also has a long leader. In many bacteria, increased levels of c-di-GMP repress flagellar genes, as does sxy overexpression in E. coli.

In principle we could check for regulatory effects by adding c-di-GMP to cultures of E. coli (or H. influenzae) and look for changes in expression of sxy or of genes it regulates. BUT, very few of the papers I've been reading today did this. Instead the researchers went to a lot of work to genetically engineer cells to produce abnormally high or low levels of c-di-GMP, which makes my suspect that cells may not be permeable to c-di-GMP. Even the few papers that did add it to cultures didn't directly measure changes in gene expression, but just described phenotypic changes such as alterations in biofilm formation. But the papers don't come out and say that exogenous c-di-GMP can (or can't) enter cells. Perhaps I should email some authors about this.

I also should check the E. coli sxy mRNA leader sequence to see if it has the properties expected of a GEMM riboswitch. The RA, always ahead of the game, has already gone to a lot of effort to map the 5' end of this mRNA, so we can sit down with the sequence tomorrow.

This morning the RA and I discussed our immediate research goals. We agreed that it's time to pull together all the work we've done on competence in E. coli, and see what more we need to make a good paper. Although we don't know anything about the properties of competent E. coli (because we have not been able to make E. coli detectably competent), we have accumulated quite a lot of relevant information, and we think that even a negative result paper could be worthwhile.

The basic situation is that E. coli has apparently intact homologs of all of the genes H. influenzae needs to become competent, and all of these are induced when the Sxy activator is overexpressed. One of these genes, ppdD, has received attention from other labs, because it encodes a type IV pilin that appears to be functional (it can be assembled into pili by Pseudomonas aeruginosa), but these labs haven't been able to turn the gene on in E. coli.

One component of the reombination hotspot model presented on Friday was fertility selection. If hotspots are not present to cause crossovers between homologous chromosomes at meiosis, the chromosomes segregate randomly into the two daughter cells, so that half of the time one cell gets both homologs and the other gets neither, creating a defective gamete. This 50% reduction in fertility creates very strong selection for active hotspots

In our original model, this selection acted directly on the hotspot alleles, but wasn't quite strong enough to preserve the active alleles in the face of their self-destructive mode of action (Boulton et al, 1997, Pineda-Krch and Redfield 2005). In the new model presented at the seminar, this selection instead acts on a modifier locus which determines which hotspot alleles are active. The hotspot alleles undergo mutation that changes their sequence, and mutations at the modifier locus change its specificity so that formerly inactive hotspot alleles sometimes become active. If this occurs when the previously active hotspot has self-destructively converted itself into an inactive allele that's now activated by the mutant modifier, this creates fertility selection for the new modifier allele.

The model presented on Friday was able to reproduce the key features of hotspot evolution - rapid turnover of individual hotspots (replacement of active alleles by inactive ones) and preservation of a reasonable recombination rate. (But I can't remember how high this recombination rate was...). But it depended on fertility selection acting on the modifier.

In the talk I raised one issue that I think is very important, the strength of fertility selection, but I'm not sure how coherently I explained it. Many models of natural selection incorporate a step that restores the population size in each generation, after selection has removed some individuals. In a deterministic model this can be done simply by normalizing the numbers, but in a model that follows individuals stochastically, new individuals must be added to the population in each generation to replace those that have died or failed to reproduce. This implicitly assumes that population size is not limited by selection. This is a dangerous assumption because it eliminates the risk that the population will go extinct if selection is too severe. In most models this is only a theoretical concern, because selection is relatively weak. We usually think of strong selection as a positive force for evolutionary change, but it can also be a negative force causing extinction. In fact, extinction might be the usual outcome, with only those lucky populations that happen to have the right alleles escaping it.

Models of hotspot-dependent recombination can incorporate very severe selection, as we discussed in our two hotspot papers. If even a single chromosome loses enough active hotspots that it usually has no crossovers, the population's fertility will be reduced by 50%; if several chromosomes have this problem, fertility will be so low that extinction becomes likely.

Two aspects of the model presented on Friday raised red flags about the strength of fitness selection. First, the modifier locus was assigned a very high rate of mutations that changed its sequence specificity (I think 10^-2 per generation), but never suffered mutations that reduced its activity. This is very unrealistic; everything we know about gene function predicts that loss-of-function mutations will be much more common than change-of-specificity mutations, and nothing about the PRDM9 gene suggests that it should be exempt from this principle. Loss-of-function mutations at the modifier locus would be expected to cause sterility, as is well established for PRDM9. Second, only a single chromosome was modeled, but I think the fertility cost will increase dramatically (exponentially?) as the number of chromosomes increases.

I still think the model is very important, because it incorporates the same features implicated by the PRDM9 work. But it won't be realistic until it considers the real cost of the fertility selection it depends on. It should be easy to modify the model to monitor the fraction of the population that fails to reproduce in each generation. If this fraction is substantial (I'm being deliberately vague here because I don't know how large would be too large), then introduction of a modifier locus hasn't really resolved the paradox.

Yesterday morning was very stimulating - conversation with and seminar by Francisco Ubeda, a visiting theoretician whose focus on the evolution of intra-genomic conflict led him to a very nice model of the evolution of recombination hotspots. The seminar's audience was great too. I'm going to try to make a stop-motion animation of his model, but first I have to model eukaryotic chromosome replication, then mitosis, then meiosis, then crossing over, then initiation of crossing-over by double-strand break repair, then the role of hotspots in initiation, and then the hotspot conversion paradox. At that point I can make a model that incorporates a trans-acting sequence-specific modifier of hotspot activity. This may take some time....

But yesterday afternoon was very frustrating. I had brought my new batch of DNA-coated polystyrene beads to the optical tweezers apparatus, in the hope of testing whether competent cells would bind to them. But I never got to try this, because it was so difficult to get the tweezers to hold on to a bead. Almost every bead that got close to the laser focus was drawn in to it and then immediately spit out again (probably drawn in one side and out the other (the beads appeared to pop through the trap, rather than sticking at its focus). I halfway remember my biophysicist colleague telling me that the bead should approach the focus point from the front side, but I had no way of telling whether an out-of-focus bead was in front of the focus or behind it.

First I tried a chamber with B. subtilis cells and beads, then a chamber with beads but no cells. The beads were sufficiently sparse that finding ones to try to trap was inefficient, so I concentrated my bead stock and filled a fresh chamber. This only resulted in lots more beads popping through the trap.

The plane of focus is both the focus of the visible light that illuminates the image and of the laser that traps the bead. I'd been advised that trapping worked best when this plane was about 5 µ behind the coverslip surface (the top of the chamber), so I tried to maintain this position. It wasn't always at exactly the same setting on the micrometer that controls the focus position, because of minor variations in the thickness of the parafilm sheets that form the sides of the chamber. When I didn't have cells attached to the coverslip, I could still check this position once I had trapped a bead, by bringing the focus forward to a position where the coverslip pushed the bead back out of the trap (the laser focus point).

But even with the focus perfectly positioned, only very few beads stayed in the trap for even a few seconds. My colleague suggested trying 3 µ beads (mine were 2.1 µ) as she's had consistent success with them. But I couldn't get them to work much better than mine. Eventually I gave up. I think it may be time to set this whole project aside until we find a graduate student to take it on.

I needed to write a short paragraph describing how our research area fits with UBC's new program in Genome Science and Technology (GSAT). Here it is:

My research group uses genomic technology to investigate the different ways that recombination shapes bacterial genomes, focusing on the natural transformation system of Haemophilus influenzae and using DNA sequencing as an experimental tool to identify the causes and consequences of DNA uptake and recombination. One project aims to fully characterize the recombination tracts produced when cells of one strain take up DNA from another, using Illumina sequencing of many independent recombinant genomes. A second project uses these recombinant sequences in genome-wide searches for the loci responsible for the differing abilities of natural bacterial strains to be transformed. A third project is characterizing the sequence specificity of DNA uptake by applying deep sequencing to DNA fragments that have been preferentially taken up by competent cells. Finally, we are using optical tweezers technology to physically characterize the process of DNA uptake by naturally competent cells.

So yesterday I incubated some streptavidin-coated polystyrene beads (2.1 µ diameter) with some biotin-tagged EcoRI-cut H. influenzae DNA. How many beads? About 3.5 x 10^8. How much DNA? About 2 µg. The mixture was incubated at 37 °C for about 4 hr, first undiluted and then diluted to 50 ml in TE buffer. I washed the unbound DNA off the beads by drawing the mixture through a filter with 0.2 µ pores; I expected this to retain the beads but allow the unbound DNA to pass through. I washed the filter by drawing 25 ml of TE buffer through it, five times. This was a slightly less thorough series of washes than it sounds, because I tried not to always leave a little bit of buffer on the filter, worrying that if it was sucked dry the beads might be difficult to recover. But I didn't always succeed so the washes were pretty thorough. The thoroughness of the washes becomes important below.

I put the filter into a tube with a ml of THE and agitated it a lot to resuspend as many of the beads as possible. I then used the hemocytometer to count the resuspended beads and a comparable input bead suspension. This showed that I'd recovered more than 90% of the beads. Then I used Picogreen to measure the amount of DNA in the bead suspension: 330 ng/ml. This let me calculate how much DNA was on each bead: about 1000 kb! The average fragment size of EcoRI-cut H. influenzae DNA is about 3-4 kb, so this is about 300 DNA fragments per bead.

I was initially quite pleased with this result, but then started worrying that this was to high to be true. Is there even room on a 2.1 µ bead for this many fragments? And 1000 kb is more than half a H. influenzae genome. I rechecked my calculations, and they all seemed correct. I had never tested whether the filter-washing procedure worked as I thought it should - might much of the DNA have been trapped on the filter rather than being wwashed away, and might it then have been resuspended along with the beads? If so, most of the DNA in my washed beads prep might not be bound to the beads

So this morning I pelletted the beads, removing all but ~20 µl of the supernatant before resuspending them in another ml of TE, and did Picogreen assays on both the bead-free supernatant and the resuspended beads. This showed that about 75% of the DNA was indeed on the beads. This means I have lots of beads with lots of DNA on them, ready for many tweezers experiments.

Yesterday I also used the washed beads to transform competent cells. Preliminary colony counts (the plates need longer incubation) suggest that the transformation frequency was very low (about 10^-7), much lower than expected for the presumed DNA concentration of ~100 ng/ml. This is consistent with much of the DNA being inaccessible to the cells. (But I should go back and check a control transformation I did in another experiment, in case the EcrRI-cut DNA always transforms poorly*.)

* Indeed, EcoRI cuts within the gyrB (NovR) gene, and even when the DNA was not bound to beads the transformation frequency was only 1.3 x 10^7.

I'll be going to the university across town on Friday, partly to hear an informal talk about the evolution of recombination hotspots (a problem we pioneered) and partly to try to get cells to attach to DNA-coated beads using the optical tweezers.

I haven't done anything with the tweezers since before we submitted our latest CIHR grant proposal (Beads and cells). That attempt used beads that had been incubated with DNA and thoroughly washed, but I hadn't taken the time to check how much (if any) DNA was actually bound to the beads. This time I want to be sure that there's DNA on the beads, so after I wash them I'll use Picogreen to measure the bound DNA.

So I'll first incubate the streptavidin-coated polystyrene beads (2.1 µm diameter) with biotin-tagged chromosomal DNA (how much?) for a couple of hours, inverting the mixture on the roller wheel to keep the beads from clumping. Then I'll dilute the beads and wash them by trapping them on a 0.2 µm filter, pouring lots of TE through them. Then I'll resuspend the beads in a small volume of TE and measure the DNA concentration. Maybe I'll also use the beads in a transformation assay to check that DNA is present and the cells can take it up.

The postdoc just gave me a copy of a short article by Sean Eddy titled "What is a hidden Markov model" (Nature Biotechnology 22: 315-316). It's only two pages long, and the heading "Primer" flags it as something for beginners. But I'm struggling to understand it, even with help from the postdoc. So this post will probably be a mix of attempts to explain what a hidden Markov model (HMM) is and does, and complaints that Eddy has failed to weed out much of the jargon from his explanation.

Below I've pasted the figure he uses to illustrate his explanation. We assume that we have a short DNA sequence sequence (look in the middle of the figure, below the arrows), and we want to infer which of the bases in it are exon, 5' splice site, or intron. Because we're told that the sequence starts as exon and contains only one 5' splice site, the only decision we need to make is the location of this splice site.

I think this is how an HMM would do this. It independently considers all of the possible functions for every position, assigns them probabilities (based on the base it finds in the given sequence at that position), and then picks the combination of functions with the best probability score. Because there are 24 bases and 3 possible functions for each, there are (I think) 3^24 different combinations to be considered. Fortunately many of these never arise because of several constraints that the model has been given. First, only Gs and As can be 5' splice sites, as shown in the base-probabilities given at the top of the figure. Second, there can be only one splice site. Third the 'exon' function ('E') can only occur before the splice site ('5'), and the 'intron' function ('I') can only occur after it. This last constraint is indicated by the horizontal and circular arrows that connect these symbols (below the base probabilities); these specify how the state of one position affects the probabilities associated with states at the next position.

After describing the figure Eddy says 'It's useful to imagine a HMM generating a sequence', but I don't think this is what he means. Or rather, I suppose that he's using the words 'generating' and 'sequence' in some special sense that he hasn't told the reader about. By 'sequence' he doesn't seem to mean the sequence of bases we were given. Maybe he means one of the many possible combinations of functions the model will assign to these bases for the purpose of calculating the combination's probability, given the set of constraints the model is using.

He then says 'When we visit a state, we emit a residue from the state's emission probability distribution.' OK, he did define the 'emission probability distribution' - it's the base probabilities at the top of the figure. But what can he mean by 'visit a state' and 'emit a residue'? The postdoc says that 'emit' is jargon that roughly means 'report'. But we already know the residues - they're the bases of the sequence specified in the figure. Maybe the HMM is moving along the 24 positions, and at each position it 'visits' it asks what the base is ('emits a residue'). It then considers the probabilities of all three potential states, given both the state assigned to the previous position and the probabilities of finding that specific base given the state it's considering.

Maybe this will make more sense if I consider starting at the 5' end of the sequence and applying the model...

OK, start at position 1. What states might it have, and with what probabilities? According to the transition probability arrows, it will have state E with probability 1.0, so we don't need to consider any influence of which base is present at this position (it's a C). What about the next base (position 2)? The arrows tell us that there's a 0.1 chance of a transition to state 5, and a 0.9 chance of this position being in state E like position 2. The position has base T, which means it can't have state 5 and so must be state E. The same logic applies to positions 3 and 4 (T and C respectively).

Position 5 has base A, so now we start to consider the first branching of alternatives strings of state assignments, one where position 5 has state E (call this branch A) and one where it has state 5 (call this branch B). What are the probabilities of these two branches? To get the probability of the state 5 alternative I guess we multiply the 0.1 probability of a state transition by the 0.05 probability that a state 5 position will have base A. So the probability of the state 5 branch is only 0.005, which must mean that the probability of the state E branch is 0.995.

Position 6 has base T. In branch B, this position must have state I, because it follows a state 5 position. All the bases after this must also be state I, so the total probability of the splice site being at position 5 is 0.005. In branch A, position 5 must be state E.

Position 7 has base G, so we must calculate the probability that it is the splice site as we did for position 5. We multiply the probability of the transition (0.1) by the probability that a G is the splice site (0.95), giving a branch probability of 0.095 (call this branch C). But we need to take into account the probability of branch A that we already calculated (0.995), so the total probability of branch C is 0.095 x 0.995 = 0.094525 The other branch can still be called branch A; it has probability 0.995 x 0.905 = 0.900475. [Quick check - the probabilities so far sum to 1.0 as they should.]

Position 8 is a T; in branch A it must have state E. Position 9 is another G... OK, I see where this is going. I think this example might be a bit too simple because only one branch continues (we don't have to calculate probabilities for multiple simultaneously ramifying branches. There are only 14 possilbe combinations of states, one for each of the As and Gs in the sequence, because only these are potential splice sites.

Anyway... Did this exercise help me understand what Eddy is trying to explain? If what I've written above is correct, then yes, I guess I sort of understand the rest of the article (except for the sentences immediately following the ones I quoted above). If what I've written is wrong, then, of course, no.

In the next paragraph he explains why this is called a Markov chain (because each state depends only on the preceding state, and why it's 'hidden' (because we don't know the true states). And the later paragraphs are mostly clearer, except for one place where he lapses back in to the jargon about residues being emitted by states.

He explains that the 'posterior decoding' columns at the bottom of the figure are the probabilities that each of the Gs is the true splice site. But the probability I've calculated for position 7 (0.095) is larger than indicated by the corresponding column (about 0.03-0.04?), so I might have done something wrong in calculating the probability for this splice site.

Aha. I've overlooked the different probabilities for the bases in the I state. I think I have to modify the probability that positions 5 and 7 are splice sites by the probabilities that the bases that follow them are introns. I shouldn't just calculate the probability that position 5 is a splice site from the position 4-to-5 transition probability and the position 5 emission probability for a splice site (p = .005), and then just assume that the following positions are intron sites. Instead I need to modify the calculated probability of 0.005 by the probabilities associated with each of the following positions being in the intron state, according to their known base identities.

OK, I take back most of my initially harsh criticism of this article. There's one big flaw in the middle, where he slips into technical jargon, using what appear to be simple English words ('emit', 'visit') with very specialized technical meanings that cannot be inferred from the context. But otherwise it's good.