The regulation of transcription is central to the proper functioning of all cells. Identifying the NA binding sites for all transcription factors (TFs) would greatly facilitate our understanding of regulatory networks and variations in gene expression, both normal and in disease states, that accompany genetic differences. New high-throughput technologies are generating data about the DNA binding specificity of transcription factors at a greatly increased rate, but good computational methods are required to maximize the biological information extracted from those data. In the previous funding period we developed new, and improved, methods for the analysis of three different types of high- throughput specificity data. In this proposal we will expand on those methods in several ways, including methods for analyzing additional types of data and the development of more complex models that are required for the adequate representation of the specificity of some factors. More complex models are needed for TFs whose specificity is not well represented by position weight matrices (PWMs) which impose the constraint that the positions within the binding site contribute independently to the binding. We will develop models for TFs that allow for higher-order interactions as well as for TFs that can bind in alternative modes and require multiple, independent models to represent them. The improved models will be compared to in vivo location analysis for TFs to better assess which binding sites are indirect or require cooperative binding with other factors. We also take advantage of greatly increased data to develop improved recognition models that can predict the specificity of TFs based on the protein sequence and aid in the design of new factors with novel specificity. This will be done initially for homeodomain and zinc finger proteins, the two largest families of TFs in eukaryotic genomes and the ones with the most available specificity information. We will also take advantage of the vast information available for bacterial genomes to develop specificity models for various bacterial TF families. A new experimental method will be employed to more comprehensively assess the non-independent interactions between protein residues and binding site base-pairs, which should lead to further improvements in recognition modeling. We continue collaborating with experimental biologists, which helps them use our programs and further their research goals, and helps us identify the limitations of the current methods and fosters improvements. We also have a new collaboration that seeks to improve upon methods for predicting specificity in protein-DNA interactions based on molecular modeling, combining their expertise in thermodynamic and structural modeling with our extensive models of TF binding specificity.

Public Health Relevance

Transcription factors control the expression of genes and are essential to the proper functioning of cells. Identifying the DNA sequences that they bind to can lead to a better understanding of the normal regulatory network and how it can be altered in genetic variation and disease. Recent technological advances have greatly increased the data about transcription factor binding sites, but good computer programs are required to maximize the biological information obtained from those experiments. We are developing improved computational methods to extract the most important information from high-throughput experiments with the goal of enhancing our understanding and modeling of normal control of gene expression and its variation. We are also using that information to help in the design of novel transcription factors with desired characteristics.