Data Mining for Biological Data Learning: Algorithm and Application

Doctoral Dissertation

Abstract

Due to fast growing technology developments, large amounts of experimental data for complex biological systems have been increasingly available. For example, microarray technology enabled biologists to monitor the expression profiles of thousands of genes simultaneously, generating large volumes of gene expression data; next generation sequencing technology is leading to a DNA sequence data deluge. Life science researchers are accumulating massive data and the assumption is that something in the data will stimulate important questions and insights. This provides opportunities and challenges on how to efficiently and effectively leverage these data for novel discovery.

Data mining, which is the process of analyzing data from different perspectives and summarizing them into useful information and patterns, is of immense importance in bioinformatics and biomedical science more generally. In particular, supervised data mining has been used to great effect in numerous bioinformatics prediction problems. With more and different sources of data accumulating every day, it requires sophisticated computational analyses and data mining. One major bottleneck so far is how to analyze the huge noisy and heterogeneous data sets quickly and precisely. My PhD research focuses on applying data mining algorithms and tools to tackle these challenging and interesting computational problems in bioinformatics.

We first present a two-stage data mining approach for pathway analysis. During the first stage, informative genes that can represent a pathway are selected using feature selection methods. In the second stage, pathways are ranked based on their representative genes using classification methods.Then, we demonstrate a machine learning framework for trait based microbial ecology using whole genome sequence data. We use this framework to quantitatively link genotypes with functional traits. Finally, we extend the previous framework to handle continuous function traits. Specifically, we use Random Forest Regression to predict continuous functional traits based solely on whole genome sequences and identify a small set of biomarkers that are relevant to functional traits. We also incorporate network analysis by providing correlated information to further narrow down results.