Advances in genomics allow researchers to measure the complete set of
transcripts in cells. These transcripts include messenger RNAs (which
encode for proteins) and microRNAs, short RNAs that play an important
regulatory role in cellular networks. While this data is a great resource
for reconstructing the activity of networks in cells, it also presents
several computational challenges. These challenges include the data
collection stage which often results in incomplete and noisy measurement,
developing methods to integrate several experiments within and across
species, and designing methods that can use this data to map the interactions
and networks that are activated in specific conditions. Novel and efficient
algorithms are required to successfully address these challenges.

In this thesis, we present probabilistic models to address the set of
challenges associated with expression data. First, we present a novel
probabilistic error correction method for RNA-Seq reads. RNA-Seq generates
large and comprehensive datasets that have revolutionized our ability to
accurately recover the set of transcripts in cells. However, sequencing
reads inevitably contain errors, which affect all downstream analyses. To
address these problems, we develop an efficient hidden Markov modelbased
error correction method for RNA-Seq data . Second, for the analysis of
expression data across species, we develop clustering and distance function
learning methods for querying large expression databases. The methods use
a Dirichlet Process Mixture Model with latent matchings and infer soft
assignments between genes in two species to allow comparison and clustering
across species. Third, we introduce new probabilistic models to integrate
expression and interaction data in order to predict targets and networks
regulated by microRNAs.

Combined, the methods developed in this thesis provide a solution to the
pipeline of expression analysis used by experimentalists when performing
expression experiments.