High resolution models of transcription factor-DNA affinities
improve in vitro and in vivo binding predictions

Abstract

Accurately modeling the DNA sequence preferences of transcription
factors (TFs), and using these models to predict in vivo genomic
binding sites for TFs, are key pieces in deciphering the regulatory
code. These efforts have been frustrated by the limited availability
and accuracy of TF binding site motifs, usually represented as
position-specific scoring matrices (PSSMs), which may match large
numbers of sites and produce an unreliable list of target
genes. Recently, protein binding microarray (PBM) experiments have
emerged as a new source of high resolution data on in vitro TF binding
specificities. PBM data has been analyzed either by estimating PSSMs
or via rank statistics on probe intensities, so that individual
sequence patterns are assigned enrichment scores (E-scores). This
representation is informative but unwieldy because every TF is
assigned a list of thousands of scored sequence patterns. Meanwhile,
high-resolution in vivo TF occupancy data from ChIP-seq experiments is
also increasingly available. We have developed a flexible
discriminative framework for learning TF binding preferences from high
resolution in vitro and in vivo data. We first trained support vector
regression (SVR) models on PBM data to learn the mapping from probe
sequences to binding intensities. We used a novel -mer based string
kernel called the di-mismatch kernel to represent probe sequence
similarities. The SVR models are more compact than E-scores, more
expressive than PSSMs, and can be readily used to scan genomics
regions to predict in vivo occupancy. Using a large data set of yeast
and mouse TFs, we found that our SVR models can better predict probe
intensity than the E-score method or PBM-derived PSSMs. Moreover, by
using SVRs to score yeast, mouse, and human genomic regions, we were
better able to predict genomic occupancy as measured by ChIP-chip and
ChIP-seq experiments. Finally, we found that by training kernel-based
models directly on ChIP-seq data, we greatly improved in vivo
occupancy prediction, and by comparing a TF's in vitro and in vivo
models, we could identify cofactors and disambiguate direct and
indirect binding.