New Project Ideas

Predictive models for gene regulation

My research group has recently developed a method for learning
predictive models for gene regulation from gene expression data
and regulatory sequence data in simple organisms like yeast.
By "predictive", we mean that we
learn to predict which genes will be up- or down-regulated under
different experimental conditions. The method uses boosting, a
classification algorithm from machine learning, with alternating
decision trees to represent the learned predictive model. There
are many directions for extension or application to other simple
organisms. This work is to appear in ISMB 2004 (the largest
international conference on machine learning).

Integrating data types for learning models of regulation

Combining multiple sources of data -- such sequence data from
promoter regions from
transcription factor binding sites containing transcription factor
binding sites, gene expression data, and binding localization data --
in a learning approach can lead to improved understanding of transcriptional
regulation. The REDUCE paper was presented earlier in the semester in
lecture. The reference from
Daphne Koller's lab is technically more difficult, but there are interesting
ideas -- you may be able to try a simpler model. We have also done
work on joint clustering models in our research group -- see me for
additional ideas and references.

Predicting Protein-Protein Interactions

Learning to predict which pairs of proteins will interact is
an important but difficult new problem. There are new high-throughput
techniques like yeast two-hybrid screens for detecting pairwise
protein interactions, but these assays are notoriously noisy -- that is,
the + and - labels (for interaction and non-interaction) are uncertain. Some
recent efforts have focused on combining different kinds of evidence
for supervised learning (see Janssen reference below) or incorporating
protein motif data (unpublished, but see abstract referenced below).
A possible new approach for this problem would be to avoid using labels
and instead weigh different sources of evidence for consistency in order
to predict interactions. Dr. Phil Long (at Columbia's
CLASS research center) and collaborators have developed a machine learning
technique for weighing evidence that could be applied to this problem.

Inferring Regulatory Networks from Expression Data

A new and exciting area of computational biology is the problem of
inferring regulatory networks in the cell from gene expression data.
The theory of Bayes nets (or "graphical models") -- a probabilistic
generative model that describes a joint probability distribution
for an acyclic network of random variables -- provides a framework
for learning such networks. Unfortunately, structure learning in
graphical models is quite involved, and limited and noisy data make the
inferrence problem difficult; currently, there are literally only a
handful of groups in the world with strong expertise in this area,
and they use their own internal (unavailable) software for computations.

Given the advanced nature of the model and the importance of the
problem, it would be a worthwhile project simply to try to implement
a matlab prototype of the learning algorithms discussed in one of
the references below and to try to reproduce the results on the
(publically available) datasets that these papers use. As a starting
point, download and study Kevin Murphy's Bayes Net Toolbox for matlab --
he has code to set up graphical models, learn parameters from data,
and even do a few types of structure learning.

The easier project would be to reproduce results in the Hartemink paper:
set up a set of candidate models for the small network that they
consider and calculate the "Bayesian" score
for each model to try to rank the candidates. Ideally, you would find
a second biological example on which to validate this method.

A more involved project would be to model "interventions" as discussed
in the Pe'er paper for dealing with knock-out data. You would want
to try to set up the bootstrapping (sampling) process to calculate
confidence scores for small features of the network and see if you
can validate the high-confidence features that the authors obtained
for the mating response and/or ergesterol cycles.

A newer paper (the Minreg paper of Pe'er et al.) used prior knowledge of
transcription factors in yeast to learn a simpler network structure.

Learning rankings for protein sequence searches

Algorithms like PSI-BLAST use pairwise similarity measures to produce
a ranking of protein sequences from a database relative to a sequence
query. Sequences near the top of the list (the top "hits") are most
likely to be homologs of the query.
We have recently developed a graph-based algorithm called RankProp
for learning to improve the ranking of protein sequences
returned by algorithms
like PSI-BLAST. RankProp defines a graph on the space of all protein
sequences, where edge weights are derived from pairwise similarity
scores, and uses this global structure to learn an improved ranking.
This work is to appear in PNAS. Due to PNAS rules, I cannot post
a preprint online, but see me for more information and project ideas.

Contacts:

Rui (Ray) Kuang (PhD student), Dr. Jason Weston, Dr. Christina Leslie

Inference from Single Nucleotide Polymorphism Data

The human genome project has led to considerable progress in understanding
and characterizing variation in the human genome. A dense collection of
sequence variants (i.e., genetic markers) has been mapped across the
genome, which will aid researches to identify disease causing sequence
variants. Most stable variation in the genome occurs in the form of
single nucleotide polymorphism. Single-nucleotide polymorphisms (SNPs)
represent about 90% of the common variation in the genome. This variation
arises through a single mutation event in the history of the human
population. The likelihood of recurrent mutation at the same site is low.
Consequently, SNPs are stable genetic markers.

The extensive repository of these SNP markers provides a tool for
discovering the genetic basis of common complex diseases (due to multiple
interacting genes and the environment). The approach involves typing large
number of SNP markers in a set of candidate genes thought to be
functionally significant in the manifestation of disease of interest using
case-control samples. The expectation is that SNPs associated with the
disease would have a different profile in the case vs. the control sample.

There are a number of possible learning-based approaches to this problem.
One could view the SNPs as (typically binary-valued) features for a
multi-class classification problem (the classes correspond to phenotypes
or diseases), and one could use standard supervised learning techniques to
train a classifier. More meaningful to a medical researcher or biologist,
perhaps, would be to develop a probabilistic model for this data that is
somewhat more involved than the one implicitly used by the population
geneticists -- for example, a graphical model that would allow for
interaction between SNPs in different candidate genes, producing SNP base
network configurations that distinguish the case and control population.
Any interesting learning approach applied to this data would be a
novel contribution and an interesting project.

Haplotype Mapping

Here is some information about the Haplotype Mapping project ("HapMap")
from the National Institutes of Health: "Sites in the genome where individuals differ in their DNA sequence by a single
base are called single nucleotide polymorphisms (SNPs).
Recent work has shown
that there are about 10 million SNPs that are common in human populations.
SNPs
are not inherited independently; rather, sets of adjacent
SNPs are inherited in
blocks. The specific pattern of particular SNP alleles in a block is called a
haplotype. Recent studies show that most haplotype blocks in the human genome
have been transmitted through many generations without recombination.
Furthermore, each block has only a few common haplotypes. This means that
although a block may contain many SNPs, it takes only a few SNPs to uniquely
identify or 'tag' each of the haplotypes in the block."

Computational approaches are being developed
for determining haplotype blocks from genotype data
from many individuals and as well as associating haplotypes with disease.
I list just a few references; more references can be found in the bibliography
for the second paper. A possible project would be to implement one of
these haplotype algorithms and test on a small dataset; extensions of these
methods or comparisons between methods would be very interesting.

Motif models and discovery

Probabilistic and combinatorial models for regulatory motifs (e.g. binding
site for transcription factors) have been used to search for new signals
in promoter regions and full genomes. We'll probably cover one EM-based
approach, called MEME, later in the semester.

String Kernels for Sequence Data

String kernels are functions that implicitly map a pair
of sequences to a feature space and take their inner product in this
space; they allow us to use learning algorithms and techniques
for vector-valued
data (SVMs, clustering, principal component analysis) on sequence
data. Various string kernels have recently been used in
computational biology for applications such as
protein classification (many were introduced by our group at
Columbia) and peptide cleavage
site recognition; they have also appeared in natural language processing
for text classification. New string kernels, extensions of existing
string kernels, and new biological applications of string kernels
would all make interesting subjects for a project. More recently, our
group has been involved in developing profile-based string kernels and
semi-supervised approaches to building kernels -- see me for additional
newer references.

I list only a few references below -- I can provide more to interested
students. Among other things, there is a connection between these kernels and
non-deterministic finite state automata.

New Approaches for Time Series Expression Data

Many labs are now producing time series gene expression data sets, where
multiple microarray assays are made at different time points in some
biological process. One should be able to learn more from time series data
than from the same number of unrelated replicates, since we can see the
evolution of a process. However, the data is too sparse and noisy (and the
genes to numerous) to use many standard time series analysis techniques.

Below I list two clustering algorithms specifically designed to deal with
time series data. Implementation of the spline-based clustering is available
from my lab. I can make specific suggestions for projects related to the
spline approach if a group is interested. Otherwise, any implementation or
comparison of time series clustering or analysis techniques for this data
could be interesting.