The earlier article dealt with several experimentally-confirmed functional interactions determined in Escherichia coli: genes in operons, genes whose products physically interact, genes regulated by the same transcription factor (regulons), and genes coding for transcription factors and their regulated genes. In that study we found that the associations involving transcription factors tend to be much less conserved than any of the other associations studied. Our work is not the first to suggest this lack of conservation, but is the first to compare conservation across different kinds of associations, and thus show that those mediated by transcriptional regulation are the least conserved.

The most recent article was an expansion of the association between genes coding for transcription factors and other genes. The idea being to extend the study towards as many other prokaryotes as possible. But how could we determine conservation between genes coding for transcription factors and other genes without experimentally-determined interactions? We knew that at least some transcription factors could be predicted from their possessing a DNA binding domain. But what about their associations? Our prior experience has been that target genes are hard to predict even when there’s information on some characterized binding sites (sites that we like calling operators for tradition’s sake). So what to do if we have only the transcription factors? Well, to answer that we should first explain how we measured relative evolutionary conservation.

To measure evolutionary conservation we used a measure of co-occurrence called mutual information. For any two genes, the higher the mutual information, the less the observed co-occurrence looks random. Since we obtained mutual information scores for all gene pairs in the genomes we analyzed, we decided that instead of something as hard as predicting operators, and matching them to predicted transcription factors, we could use top scoring gene pairs as representatives of the most conserved interaction between our predicted transcription factors and anything else. This allowed us to compare the most conserved interactions involving transcription factors against the conservation of other interactions. Our findings suggest that interactions involving transcription factors evolve quickly in most-if-not-all of the genomes analyzed.

N. Ward, G. Moreno-Hagelsieb, Quickly Finding Orthologs as Reciprocal Best Hits with BLAT, LAST, and UBLAST: How Much Do We Miss? PLoS ONE 9, e101850 (2014).

The story goes as follows. At a talk by some group I heard that they were using UBLAST to quickly find members of some protein families rather than use a Hidden Markov Model approach. They said it was much faster, so I became curious. I downloaded USEARCH 5 back then to try and test for the things I commonly do with NCBI’s BLAST. I was surprised at how fast this program ran. In any event, I thought that testing this program for some task would be a good work for an undergrad student. That was Natalie’s undergrad thesis. Back then about using different options under USARCH to try and get as much coverage with UBLAST as with NCBI’s BLAST (UBLAST was not an option in USEARCH 5, rather, a local alignment search had to be done). We became more ambitious, and decided to test a few more programs. BLAT was something I was already playing with, while an article by Jonathan Eisen (Darling et al., PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2, e243. 2014) pointed me in LAST’s direction (besides reviewers asking for more programs to be tested).

Later on, at some other talk, I think this was a talk by Robert Beiko. He mentioned something about BLAST being too slow for some task, and I asked him why not try UBLAST. He said something to the effect of not knowing how much they might miss.

The articles we published cover one task each. One is the task of finding orthologs as reciprocal best hits. Pretty straightforward. How many orthologs are found by each program when compared to BLAST. Essentially, finding orthologs as reciprocal best hits does not require the finding of every possible match. Top matches would be enough. So, if UBLAST, for example, found just a few top matches (under version 5, we could control the number of matches found before the program stops looking), that would be enough to determine the best, and thus figure out reciprocal best hits. We though we might miss many matches, but still find most of the reciprocal best hits, and that’s what we found to be the case except between evolutionarily distant genomes (see second reference above).

For the test on overannotation, the main idea was that for that task we compare proportions, not total number of matches. Thus, if UBLAST, LAST, and BLAT missed potential homologs, but still found equivalent proportions to those found by NCBI’s BLAST, then the programs would work fine for estimating overannotation. Well, that’s what we found.

Finally, why democratic genomics? Well, tools that can run sequence comparisons in a fraction of the time that BLAST runs, and that in a desktop computer, then comparative genomics of a much larger scale becomes available for most if not all bioinformaticians. Why would I care? Well, because the most people can participate the higher the number of ideas that can make it into the field. Not everybody has access to computer clusters. There’s other avenues towards this democracy, like the availability of some precomputed homologies and orthologies. Yet, people will want to do their own tests for many reasons. From doubting the quality of existing data, to testing genomes and protein sequences not already available in databases. Maybe there’s also a good chance that genome and protein comparisons will be done via cloud computing, and be quite accessible to mere mortals. Maybe web-based tools like RAST and MG-RAST are good enough for these tasks instead of having our own thing. I don’t know. For now I think that the more options the better. These two articles are not enough. Strategies should also be developed to avoid wasting time and effort comparing sequences. As we develop our ideas and test programs, we will publish our results either in articles, or, if not enough for a publication proper, in blog entries.

Since Julie was leaving on Saturday, those present in the lab last Thursday had lunch together.

Julie is a PhD student co-supervised by me and Dr. Santoyo. She came from Mexico for a few months to learn some bioinformatics that she will apply to her PhD project on the rhizospheric microbiome associated to a few crops.

Share this:

Like this:

Marc presented his thesis defense last Wednesday (Oct 30). All is well. Some corrections to make, but that’s that. Anyway, the photo presents the undergrad force of the lab of Computational conSequences (Brigitte, Erum, and Thomas), plus Marc. Taken that very day.

Share this:

Like this:

Several members of The Lab of Computational conSequences went to the Canadian Society of Microbiologists conference in Ottawa last week: Lisa, Jenny, Marc, Scott, and honorary members Mike Lynch, and Laura (Lisa’s sister). All of them presented posters, Jenny gave her first talk in a scientific conference, and Mike gave a talk that I missed on exploring “the rare biosphere” (your homework to figure out what that means).

Posters were successful, Marc, who is working on the evolution of regulation of transcription by, ahem, transcription factors, had lots of visitors, the twins (Lisa and Laura) presented work on the gene cluster for cellulose biosynthesis in Bacteria, Jenny talked about 16S rRNA genes, and Scott presented a bit about phage and horizontal gene transfer.

We shall talk about these projects some time soon. We are preparing several articles and will post something about them as they are finished and submitted.