Admixture of stochastic dictionaries for modelling regulatory regions

Functional selection acting on regulatory regions such as the cis-regulatory modules (CRMs) causes differential enrichment of nucleotide contents therein across evolutionarily related orgamisms. The exact impact of such selection on gene regulatory mechanisms is not yet clearly known; but one important characteristic of CRM function in higher organisms is that they are often multi-functional; that is, under different conditions and times, the same sequence in the CRM can drive different biological regulatory functions via recruitment of different combinations of transcription regulatory proteins. Existing models for transcription factor binding site (TFBS) such as PWMs or single dictionaries of oligomers can not capture the multi-functionality of CRM, and offer no insight of the evolutionary mechanism of this phenomena. In this paper, we develop a novel Admixture of Stochastic Dictionaries (ASD) model for the CRM and motifs therein, which succinctly extract and expose the sequence-compositional basis of such multi-functionality.

We have developed algorithms for learning the Admixture of Stochastic Dictionaries within one organism, and across multiple evolutionarily related organisms, which allow us to examine multi-functionality of CRMs, and the way it evolves by analyzing the extend of change of every functionality-specific dictionary in the ASD models across organisms. We show that the learned component dictionaries in our model are indeed functionally discriminative, and can be used for predicting regulatory regions. We further show that such discriminality is based on their TF binding affinity scores. We find that the corresponding functionality-specific dictionaries across species have similar (but non-identical) distributions over oligomers, such that regulatory information from one species can be used to predict regulatory regions in other species. We conclude that our model is easy to estimate and interpret, and serves as a good platform for modeling functional evolution of the regulatory genome, and a useful tool to identify regulatory function based on these properties.

Motivation: Identifying transcription factor binding sites (TFBS) encoding complex regulatory signals in metazoan genomes remains a challenging problemin computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate "grammatical organization" of motifs within cis-regulatory modules, extant pattern-matching based in-silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologicallymeaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence / absence of nearby coding regions, etc We present a new method for TFBS prediction in metazoan genomes which utilizes both the cis-regulatory module (CRM) architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features.

Results: This model overcomes weaknesses in earliermethods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1-score.

CSMET: Conditional Shadowing via Multi-resolution Evolutionary Trees

Functional turnover of transcription factor binding sites (TFBSs), such as whole-motif loss or gain, are common events during genome evolution. Conventional probabilistic phylogenetic shadowing methods model the evolution of genomes only at nucleotide level, and lack the ability to capture the evolutionary dynamics of functional turnover of aligned sequence entities. As a result, comparative genomic search of non-conserved motifs across evolutionarily related taxa remains a difficult challenge, especially in higher eukaryotes, where the cis-regulatory regions containing motifs can be long and divergent; existing methods rely heavily on specialized pattern-driven heuristic search or sampling algorithms, which can be difficult to generalize and hard to interpret based on phylogenetic principles. We propose a new method:

Conditional Shadowing via Multi-resolution Evolutionary Trees, or CSMET, which uses a context-dependent probabilistic graphical model that allows aligned sites from different taxa in a multiple alignment to be modeled by either a background or an appropriate motif phylogeny conditioning on the functional specifications of each taxon. The functional specifications themselves are the output of a phylogeny which models the evolution not of individual nucleotides, but of the overall functionality (e.g., functional retention or loss) of the aligned sequence segments over lineages. Combining this method with a hidden Markov model that autocorrelates evolutionary rates on successive sites in the genome, CSMET offers a principled way to take into consideration lineage-specific evolution of TFBSs during motif detection, and a readily computable analytical form of the posterior distribution of motifs under TFBS turnover. On both simulated and real Drosophila cis-regulatory modules, CSMET outperforms other state-of-the-art comparative genomic motif finders.

The transcriptional regulatory sequences in metazoan genomes often consist of multiple cis-regulatory modules (CRMs). Each CRM contains locally enriched occurrences of binding sites (motifs) for a certain array of regulatory proteins, capable of integrating, amplifying or attenuating multiple regulatory signals via combinatorial interaction with these proteins. The architecture of CRM organizations is reminiscent of the grammatical rules underlying a natural language, and presents a particular challenge to computational motif and CRM identification in metazoan genomes. In this paper, we present BayCis, a Bayesian hierarchical HMM that attempts to capture the stochastic syntactic rules of CRM organization. Under the BayCis model, all candidate sites are evaluated based on a posterior probability measure that takes into consideration their
similarity to known BSs, their contrasts against local genomic context, their first order dependencies on upstream sequence elements, as well as priors reflecting general knowledge of CRM structure. We compare our approach to five existing methods for the discovery of CRMs, and demonstrate competitive or superior prediction results evaluated against experimentally based annotations on a comprehensive selection of Drosophila regulatory regions.