In high-dimensional data analysis, one is often faced with the problem that
real data is noisy and in many cases given in coordinates that are not
informative for understanding the data structure itself or for performing later
tasks, such as clustering, classification and regression. The combination of
noise and very high dimensions (such as >1000) presents challenges for data
analysis and calls for efficient dimensionality reduction tools that take the
inherent geometry of natural data into account. In this talk, I will
first describe a data-driven multi-scale basis that can be used for feature
extraction of smooth data as well as data where the coordinates may be randomly
ordered. I will then, in the second half of my talk, describe a general
framework for dimensionality reduction, data set parameterization and
clustering that combines many ideas from eigenmaps,
spectral graph theory and harmonic analysis. Our construction is based on a
Markov random walk on the data, and allows one to define a system of coordinates
that is robust to noise, and that reflects the intrinsic geometry or
connectivity of the data points in a diffusion process. Examples will be taken
from image analysis, word-document clustering and spectroscopy. (Part of this
work is joint with R.R. Coifman
and S. Lafon)