"One of the most intimate relationships that our body has with the outside world is through our gut. Our gastrointestinal tracts harbor a vast and still largely unexplored microbial world known as the human microbiome that scientists are only just beginning to understand. Researchers are recognizing the integral role of the microbiome in human physiology, health, and disease — with microbes playing critical roles in many host metabolic pathways — and the intimate nature of the relationships between the microbiome and both host physiology and host diet. While there is still a great deal to learn, the newfound knowledge already is being used to develop dietary interventions aimed at preventing and modifying disease risk by leveraging the microbiome.

The IOM’s Food Forum held a public workshop on February 22-23, 2012, to explore current and emerging knowledge on the human microbiome, its role in human health, its interaction with the diet, and the translation of new research findings into tools and products that improve the healthfulness of the food supply. This document summarizes the workshop."

I was unable to go but am very interested in the topic. Forrunately one can get the report for free. And I will be reading it ASAP.

Thursday, October 25, 2012

I assume if you pay any attention to science satire/humor you are familiar with PhD Comics by Jorge Cham. If not, you must check it out. It is simply brilliant stuff. And thus I was completely floored when I was contacted about whether I wanted to be interviewed by Jorge for a video he was commissioned to make as part of Open Access week activities. I mean - I figging say no to almost everything these days but I said yes to this almost immediately.
And so I did a phone interview with him and Nick Shockey from SPARC.
And then Jorge worked his magic -- and here it is.

Monday, October 22, 2012

Uggh. Just read this: Specific bacterial species may initiate, maintain Crohn's. Basically it reports on a paper that showed a correlation between bacterial taxa and early Crohn's disease. The paper makes a big deal out of showing a correlation in the severity of pediatric Crohn's and the types of microbes found. Good. That is useful. But here is the thing. It is a $&*#($@(& correlation. They have NO IDEA if this is the result of the CD or the cause (or both). To go around pushing the idea that this is about bacteria initiating CD is misleading.

The news release says "The work may ultimately lead to treatment involving manipulation of the intestinal bacteria." True. The work may ultimately also lead to my screaming. Oh wait. It did already.

As I have said many time. I believe the human microbiome is VERY important. I believe it probably plays a role in all sorts of human issues - health and disease. But we need to be careful not to be misleading about what we know and don't know ...

Saturday, October 20, 2012

Yesterday I recorded a review session for a class with iTalk https://itunes.apple.com/us/app/italk-recorder/id293673304?mt=8. I recorded it on my iPhone 4S.

When I got home to upload the file and to convert it to an MP3 to share with the class I discovered that it seemed to not be there in the iTalk file list.

I thought - maybe I never formally "saved" the file but maybe iTalk kept the recording somewhere.

So I opened up iTunes connected to my phone and there it was in the Apps file area

I then copied the file to my desktop and no matter what I do I cannot seem to open it and /or extract audio out of it. I have tried to open it a million ways with all sorts of desktop and online programs and nothing works. My guess is somehow the file was not closed out correctly and thus even though it is 430 Mb it is viewed as empty by all the programs I have tried.

Friday, October 19, 2012

Interesting story in thge BBC News on a paper from PLoS Pathogens: BBC News - Faecal transplant clue to treating gut bug (seems that the article has disappeared - maybe they jumped the Embargo? --- anyone --- found another version here). In the work, researchers from the Sanger Institute infected mice with Clostridium difficile and then treated them with different combinations of microbes isolated from mouse feces. In the end they are reported to have identified a combination of six strains that was highly effective in clearing the C. difficile infections. I say "reported to have ..." because I cannot find the PLoS Pathogens paper, again suggesting to me that the BBC story may have somehow jumped the embargo. Will post more when more comes out.

"What are the most efficient methods to extract microbial DNA that accurately represents the community it is isolated from? Janelle Weaver reports on efforts to identify the best methods for DNA extraction from unknown frontiers in the human body and across the globe."

Background New computational resources are needed to manage the increasing volume of biological data from genome sequencing projects. One fundamental challenge is the ability to maintain a complete and current catalog of protein diversity. We developed a new approach for the identification of protein families that focuses on the rapid discovery of homologous protein sequences.

Results We implemented fully automated and high-throughput procedures to de novo cluster proteins into families based upon global alignment similarity. Our approach employs an iterative clustering strategy in which homologs of known families are sifted out of the search for new families. The resulting reduction in computational complexity enables us to rapidly identify novel protein families found in new genomes and to perform efficient, automated updates that keep pace with genome sequencing. We refer to protein families identified through this approach as "Sifting Families," or SFams. Our analysis of ~10.5 million protein sequences from 2,928 genomes identified 436,360 SFams, many of which are not represented in other protein family databases. We validated the quality of SFam clustering through statistical as well as network topology--based analyses.

Conclusions We describe the rapid identification of SFams and demonstrate how they can be used to annotate genomes and metagenomes. The SFam database catalogs protein-family quality metrics, multiple sequence alignments, hidden Markov models, and phylogenetic trees. Our source code and database are publicly available and will be subject to frequent updates (http://edhar.genomecenter.ucdavis.edu/sifting_families/).

Will try to write more on this soon but am in the middle of teaching a 700 person course so a bit overwhelmed with other things.

Thanks for the Gordon and Betty Moore Foundation for support for this work.

A student in my Intro Bio class is interested in learning more about the origin and evolution of the genetic code. I am looking for some relatively recent papers to suggest to her.
I have found the following:

Thursday, October 11, 2012

I have just joined an advisory group for the UC Davis Magazine and I am really happy with their new direction. They are trying to make the magazine a little less "UC Davis is awesome" and more "Here are some interesting things to think about, with a UC Davis angle". The new Fall Issue is a good example of this. There is for example a nice article by Sasha Abramsky about student involvement (or the lack thereof it). Plus there is a little video interview to go with the article.
Perhaps even more "interesting" is the article on the future of higher education by Clifton Parker. Not exactly a glory piece about UC or UC Davis.
Anyway ... just thought I would put this out there. Any opinions on the magazine please send them my way - the staff there seem great and really interested in feedback.

Thursday, October 04, 2012

Below is another in my series on "The Story Behind the Paper" with guest posts from authors of open access papers of relevance to this blog. Today's post comes from David Pollock in Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine. Note - I went to graduate school with David and he is a long time friend. This is why he apparently feels okay to call me Jon even though I go by Jonathan. I have modified his text only to format it. Plus I added some figures from his paper.

The fundamentals of the paper
Okay then, why do I think adaptation and convergence in regulatory systems is cool and important? Well, first because I think a lot of important changes in evolution have to have come about through regulatory evolution, and yet there are huge gaps in our knowledge of how this change might happen. And I say this as someone who has spent most of his career studying (and still believing in the importance of) sequence evolution. Second, a lot of people seem to think that evolution of whole regulatory systems should be hard, because there are so many interactions that would need to change at the same time. Remember, transcription factors can interact with hundreds of different binding sites to regulate hundreds of different proteins. It makes sense that evolution of such a system should be hard. In this paper, I think we go a long way towards demonstrating that this intuitive sense is wrong, that functional evolution of regulatory systems can happen quite easily, that it has happened in a system with around a thousand functional binding sites, and that some of the details of how it happens are really interesting.

Aside from promoting the science, the other reason I want to blog about this paper is that I think it is a great demonstration of how fun and how diverse science can be. To support its points, it brings in many different types of evidence, from genomics to population genetics to protein structure prediction. It is also a good example of using only publicly available data, plus a lot of novel analysis, to see something interesting that was just sitting there. I think there must be a lot more stories like that. All of that Encode data, for example, is bound to have some interesting undiscovered stories, even with 30 papers already published. The most fun part, though, which I don't think I can fully recreate without jumping up and down in front of you, was just the thrill of discovery, and the thrill of having so many predictions fall into place with data from so many different sources.

I don't recall another project where we would say so many times, "well, if that is the explanation, then let's look at this other thing," and bam!, we look at the other thing and it fits in too. It started with Ken Yokoyama, the first author, walking into the lab having just published some pretty good evidence that the preference for the SP1-associated binding site (the GC-box) was newly evolved in the ancestor to eutherian mammals. Well, if that's true, there ought to be a change in the SP1 protein sequence that can explain it. Sure enough, there is, and SP1 is a very conserved protein that doesn't change a lot. Hmm, we have more sequences now, let's look to see if preferences changed anywhere else on the phylogenetic tree. Yes, in birds. Well, there ought to be a change in bird SP1 that can explain that; sure enough, there is, and it's at the homologous position in the protein. Looking good, but is it in the right place in the protein? Yes, in the right domain (zinc finger 2, or zf2), right behind the alpha helix that binds the nucleotide for which the preference changed. And before you ask, Ken ran a protein structure prediction algorithm on the amino acid replacements in SP1, and the predicted functional replacements in bird and mammal are predicted to bend the protein right at the point where it binds the nucleotide at which the preference changed. You might then ask if this amino acid replacement does anything to the binding function, and the answer again is "yes". This time, though, we were able to rely on existing functional studies, which showed that human SP1 binds 3x better to the GC-box binding site than it does to the ancestral GA-box (more on this below).

Fig. 2. Birth-death rates of the SP1 binding motif in mammals. Birth rates (α) denote the probability (per year) that an unoccupied position will gain a binding site; death rates (β) give the probability (per year) that an existing binding site is lost. Branches in the mammalian phylogeny were partitioned into three groups: early eutherian mammals (red), late eutherian mammals (black), and GA box-preferring non- eutherian mammals (blue). Birth and death rates of each group were estimated for the GC box (GGGCGG), GA box (GGGAGG), and the non-functional motif GGGTGG (Letovsky et al. 1989; Wierstra 2008).

The coup de grace on this residue position as the source of convergent functional changes, though, came with consideration of the other transcription factors that interact in this regulatory system (that is, they bind to the same binding sites to modify transcription). If there was a functional change in the transcription factor, driving modified changes in the binding sites, then it seems that this should affect the other transcription factors in the system. It could have been hard to figure anything out about these other transcription factors, but luckily they consist of SP3 and SP4, two paralogs of SP1. This means that they are ancient duplicates of ancestral SP proteins, they share a great deal of conserved sequence with SP1, and they bind with similar affinities to the SP1 binding consensus. And they have not just one or two, but between the two proteins, in birds and mammals, at least eight convergent amino acid replacements at the homologous position that putatively modified binding in SP1. And the substitution that occurs is the same replacement that occurred in bird SP1. Based on sequences from jawed fish and frogs, this position was almost completely conserved in the SP3 and SP4 paralogs for 360 or 450 million years of evolution. The convergent changes all occurred in only the last 100 million years or less of eutherian and bird lineages. We believe that the simplest interpretation is that, over tens of millions of years, a functional replacement occurred at the SP1 protein, adaptively driving hundreds of SP1 binding sites to convert from ancestral GA-boxes to derived GC-boxes, and that this then drove the same functional replacement in coregulatory paralogs SP3 and SP4.

Timing and a mechanism

Two questions often comes up at this point, "how do we know the order of these events?" The simplest piece of evidence for the order of events comes from the order of fixation of substitutions. The amino acid substitutions in SP1 are fixed in all eutherian mammals and all birds, indicating that they occurred on the branches leading to these taxon groups. The increase in GC-boxes occurred over time at different loci, mostly on the branch leading to eutherian mammals and on the branches immediately after that split the most ancient eutherian mammal groups. The replacements in SP3 and SP4 occurred later in the evolution of eutherian mammals and birds, and did not occur in all lineages. One might be able to come up with complicated scenarios whereby changes in some SP1 binding sites occurred first, driving the fixation of the SP1 replacement, followed by further selected changes in other SP1 binding sites, but we think our hypothesis is simpler.

Fig. 3. Population frequencies of an adaptive mutant transcription factor and its binding sites. (NOTE - SOME DETAIL OF LEGEND LOST IN COPY/PASTE - SEE PAPER). (A) Shown are the population frequencies of the adaptive mutant transcription factor allele (blue), which first occurs in a single heterozygous individual at generation (population size: ). The total population frequency of the novel binding consensus (BOXC) and the initial wild-type binding motif (BOXA) are shown in red and black, respectively. We assume a small adaptive benefit for the adaptive transcription factor SPC binding to BOXC (relative fitness , where ) over the wild-type transcription factor and its motif (relative fitness ). Maladaptive binding events (SPC binding to BOXA or the wild-type transcription factor binding to BOXC) have reduced fitness ( , where ). Population frequencies of SPC, BOXA, and BOXC are given on the left for the first 20,000 generations and on the right for 150,000 generations. (B) Evolution of the adaptive trans-factor and binding sites under a semi-dominant model. SPC binding to BOXC is assigned relative fitness for individuals heterozygous for the transcription factor genotype ( ) and for individuals homozygous for the mutant transcription factor. (C) The single binding site locus model. In contrast to the previous model, each locus is restricted to no more than one binding motif (either BOXA or BOXC).

Other pieces of evidence also come into play. The question about the order of events can be rephrased as a question of whether neutral forces, such as changes in mutation rates at binding sites, could have altered the frequency of the alternative binding sites, with SP1 (and then SP3 and SP4) playing functional catch-up to better match the new binding site frequencies. It seems to us that such a model would predict that the binding sites would have changed irrespective to the function of the proteins that they regulate. (As an aside here, we note that our binding site data set is best described not as a definitive set of SP1-regulated promoters, but as a set that is highly enriched in functional SP1 binding sites. We don't trust binding site function predictions, and the putative binding sites inclusively considered were those that had either the ancestral GA-box or the derived GC-box in the functionally relevant region prior to the transcription start site. Such sites are highly enriched for categories of genes known to be under SP1 control.) But the binding sites that shift from GA-boxes to GC-boxes are even further enriched for categories of genes under SP1 control. This is not compatible with the neutral mutational shift model, but is compatible with the idea that the subset of our sites for which SP1 regulation is most important are the ones that were most likely to adaptively shift box type when the SP1 with altered function became more frequent.

The mutational driver model also predicts a simple shift in frequencies driven by mutation. For example, GA-boxes might tend to mutate into GC-boxes, and conversely, GC-boxes might tend to be conserved and not mutate to GA-boxes. What I haven't told you yet, though, is that the excess GC-boxes do not tend to be produced by mutation from GA-boxes, but rather they tend to be produced as de novo mutations from non-SP1 box sequences. They are produced by a wide variety of mutations from a wide variety of different sequences that are slightly different from the canonical SP1 binding sites. Furthermore, the GC-boxes appear in a burst of birth early in eutherian evolution, but the GA-boxes don't disappear in a burst at the same time. Rather, they simply fade away slowly over time in lineages that have evolved GC-boxes. It is not clear to us that this can be explained using a mutation model, but it is easily explained by a model in which the SP1 replacement has adaptively driven hundreds of binding site convergent events. This is then followed by the slow mutational degradation of the GA-boxes, which don't matter so much to function anymore. It is also worth mentioning that the GC box preference doesn't seem to correlate with GC content, as several fish lineages are just as high in GC content as humans, but do not have the GC-box preference.

Fig. 4. Structural changes of SP1 zinc finger 2 (zf2) following replacements at site -13. (Top) Comparisons of predicted lowest-energy zf2 structures between the native human peptide (-13M), and peptides following replacements to the ancestral valine (M-13V) and bird isoleucine (M-13I) at site -13. Structural alignments were conducted according to residues on the 5’ end of the peptide (residues -16 to -12). Both -13M and M-13I peptides showed displacement of residues 5’ to the DNA-contacting alpha-helix (sites -6 to -1) compared to the ancestral valine peptide. No such displacement was seen between -13M and M-13I. All three peptides aligned closely at the 3’ end of the alpha-helix (sites +6 to +10), reflecting structural modifications at the 5’ end of the alpha-helix. (Bottom) Distances between alpha carbons prior to and within the alpha-helix (blue and orange, respectively). Comparisons between the native human peptide and M-13V (left) and between M-13I and M-13V (center) show closely-aligned residues at the 3’ end of the alpha-helix and increasing displacement towards the 5’ end. These modifications begin around site +3, which directly contacts the A/C evolving site of the SP1 binding motif (Philipsen et al. 1999; Bouwman et al. 2002; Dhanasekaran et al. 2006). No such region- specific displacement between -13M and M-13I was observed between -13M and M-13I (right).

The observed pattern of binding site evolution is also predicted by a model we developed to determine if the evolution of transcription factors and their binding sites could be explained in a population genetics framework. We asked, what is a possible mechanism by which these changes might occur? At the beginning of this post, I noted that a lot of people seem to think that evolution of complex multi-genic regulatory systems should be hard. We reasoned, though, that if the beneficial effects of a newly evolved binding interaction were dominant or semi-dominant (that is, the beneficial effects in the heterozygote were at least partly visible to selection), then it might be possible for evolution to be achieved through a transition period in which both the transcription factor and its cognate binding sites were polymorphic.

We developed both a deterministic and a stochastic model, and found that, indeed, even small (nearly neutral) selective benefits per locus can drive the entire system to fixation. What happens is that as long as there are some binding loci with the new binding box in the system, then a new variant of SP1 with a preference for the new binding box, and an associated small selective benefit, will at first rapidly increase in frequency. It won't immediately fix though, but rather will maintain a temporary steady state, kept down in frequency by the deleterious effects that new variant homozygotes would have with the large number of binding loci that are homozygous for the ancestral binding box. Once it has reached this steady state, it exerts selective pressure on all the binding loci to increase the frequencies of the new binding boxes at each locus. Much more slowly then, the frequency of the new transcription factor variant increases, in step with the frequencies of new binding boxes at all the binding loci.

Although our studies do not prove that our population genetics model is the exact mechanism for adaptive changes in SP1, it provides proof of concept that it is not difficult for such a mechanism to exist.

Where does this leave us?
At the broader level, this paper shows that small selective benefits can drive the evolution of complex regulatory systems (in diploids, at least; sorry to leave out the micro folks, Jon). Furthermore, it demonstrates, we believe convincingly, that adaptation has driven the evolution of the SP1 regulatory system, driving convergent evolution at many hundreds of promoters, and in SP3 and SP4. It thus strongly counters prevailing notions that such evolution is hard. We hope that this work (along with other work of this kind) will drive others to further pursue the broad questions in regulatory evolution. Are the details of the SP1 system common to other regulatory systems?

A particularly important question, which we did not focus on here, is whether the evolution we have described involves only static maintenance of the status quo in terms of which genes are regulated. One has to wonder, though, whether if it is easy to evolve a static regulatory system, that it is not therefore easier than previously believed to modify regulatory connections in a complex regulatory system. There are hints of such changes here, in that genes that may have gained novel SP1 regulation (that is, gained a GC-box when they did not have the ancestral GA-box) tend to be enriched in certain GO categories (see Table 2 in the paper).

For SP1, it will be interesting to see if good stories can be developed for to explain why this adaptation should have occurred specifically in birds and eutherian mammals. The ideal story should include both a biophysical mechanism, and a physiology-based mechanism, such as the possibility that warm-bloodedness played a role. Both of these avenues promise to be complicated, if addressed properly. For example, we believe that it will be more meaningful if a biophysical mechanism can address the need for specificity as well as strength of binding, perhaps by utilizing next-generation sequencing approaches to measure affinities for all relevant binding site mutations (see, among others, our recent paper on this topic, Pollock et al., 2011). Are there interactive roles for selection on transcription factor concentration as well as efficiency and selectivity? What trade-offs exist among binding efficiency and binding site duplication? Do different types of regulatory connections evolve differently? These are all great questions for future research.

Addendum

I'll try to add further comments if questions or issues come up. I'm particularly interested to see how this non-press release guest log post works as an experiment to promote the paper and the work. I also hope it will promote Ken Yokoyama's career (he's now at Illinois, and will probably be looking for an academic job in the next year or two). He did an awesomely diverse amount of work on this, learning how to work in totally new areas for him, such as population genetics, birth/death models, and protein structure prediction. He dove into these areas unhesitatingly to pursue the logical scientific questions, developed novel analyses, and did a great job. This paper represents a fundamental contribution and a fantastic advertisement for Ken's abilities.