Spectral Clustering and Biological Data

Matt Mahoney

Genetics, Dartmouth Medical School

High-throughput gene expression data is rapidly becoming a standard tool
in biology. Gene expression data provides a large-scale snapshot of a vast
number of molecular processes within cells and tissues, leading to the hope
that it will give insights into basic biological processes as well as clues to dysfunction
in disease. Several studies over the past decade have demonstrated
that many intractable diseases, including multiple cancers and autoimmune
disorders, have molecularly distinct sub-types, suggesting multiple disease
mechanisms. As such, data clustering to identify these sub-types is a ubiquitous first step in data processing and one to which much attention must be
paid. This talk will present the ongoing work of an erstwhile mathematician
to implement Spectral Clustering for high-throughput gene expression data.
I will focus on the perennial problem of identifying the number of clusters in
a given data set and how features of gene expression data can inform certain
choices for accomplishing the task.

INDISCLAIMER: All buzzwords above (including the mathematical ones)
will be defined for non-biologists!