Research Abstracts Online
January 2010 - March 2011

PI: Vipin Kumar, Fellow

Data Mining for Earth-Science, Clinical, and Biological Data

The primary objective of this research is to develop novel, high-performance data-mining algorithms and tools for mining large-scale datasets that arise in a variety of applications. Some examples are gigabyte datasets collected by earth-observing satellites that must be processed to better understand global scale changes in biosphere processes and patterns, data generated by scientific simulations that can be used to gain insight into the underlying physical processes, data obtained through monitoring network traffic to detect illegal network activities, and large collections of text and hypertext analyzed to extract relevant information. The key technical challenges in mining these datasets include: high volume, dimensionality, and heterogeneity; the spatio-temporal aspect of the data; possible skewed class distribution; the distributed nature of the data; and complexity in converting raw collected data into high level features. High-performance data mining is essential to analyze the growing data and provide analysts with automated tools that facilitate some of the steps needed for hypothesis generation and evaluation.

Data mining has also become a key tool for analyzing biomedical data. In collaboration with the Mayo Clinic of Rochester, Minnesota, these researchers are developing advanced data-mining techniques for several medical problems. They are also working to identify data mining’s impact on the automatic prediction of protein function from proteomics data, genetic and genomic marker discovery from SNP and gene-expression data. Computational challenges imposed by the large size of the datasets are addressed by building upon the group’s past research in highly parallel formulations of key data-mining kernels for anomaly/outlier detection, finding association patterns, clustering, and building rare-class predictive models that can take advantage of high performance computers.