Author

Access Type

Date of Award

Degree Type

Degree Name

Ph.D.

Department

Computer Science

First Advisor

Farshad Fotouhi

Second Advisor

Chandan K. Reddy

Abstract

Hierarchical multi-label classification is a variant of traditional classification in which the

instances can belong to several labels, that are in turn organized in a hierarchy. Functional classification of genes is a challenging problem in functional genomics due to several reasons. First, each gene participates in multiple biological activities. Hence, prediction models should support multi-label classification. Second, the genes are organized and classified according to a hierarchical classification scheme that represents the relationships between the functions of the genes. These relationships should be maintained by the prediction models. In addition, various bimolecular data sources, such as gene expression data and protein-protein interaction data, can be used to assign biological functions to genes. Therefore, the integration of multiple data

sources is required to acquire a precise picture of the roles of the genes in the living organisms through uncovering novel biology in the form of previously unknown functional annotations. In order to address these issues, the presented work deals with the hierarchical multi-label classification.

The purpose of this thesis is threefold: first, Hierarchical Multi-Label classification

algorithm using Boosting classifiers, HML-Boosting, for the hierarchical multi-label

classification problem in the context of gene function prediction is proposed. HML-Boosting exploits the predefined hierarchical dependencies among the classes. We demonstrate, through HML-Boosting and using two approaches for class-membership inconsistency correction during the testing phase, the top-down approach and the bottom-up approach, that the HMLBoosting algorithm outperforms the flat classifier approach. Moreover, the author proposed the HiBLADE algorithm (Hierarchical multi-label Boosting with LAbel DEpendency), a novel algorithm that takes advantage of not only the pre-established hierarchical taxonomy of the classes, but also effectively exploits the hidden correlation among the classes that is not shown through the class hierarchy, thereby improving the quality of the predictions. According to the proposed approach, first, the pre-defined hierarchical taxonomy of the labels is used to decide

upon the training set for each classifier. Second, the dependencies of the children for each label in the hierarchy are captured and analyzed using Bayes method and instance-based similarity. The primary objective of the proposed algorithm is to find and share a number of base models across the correlated labels. HiBLADE is different than the conventional algorithms in two ways. First, it allows the prediction of multiple functions for genes at the same time while maintaining the hierarchy constraint. Second, the classifiers are built based on the label understudy and its most similar sibling. Experimental results on several real-world biomolecular datasets show that the proposed method can improve the performance of hierarchical multilabel classification.

More important, however, is then the third part that focuses on the integration of multiple

heterogeneous data sources for improving hierarchical multi-label classification. Unlike most of the previous works, which mostly consider a single data source for gene function prediction, the author explores the integration of heterogeneous data sources for genome-wide gene function prediction. The integration of multiple heterogeneous data sources is addressed with a novel Hierarchical Bayesian iNtegration algorithm, HiBiN, a general framework that uses Bayesian reasoning to integrate heterogeneous data sources for accurate gene function prediction. The system formally uses posterior probabilities to assign class memberships to samples using multiple data sources while maintaining the hierarchical constraint that governs the annotation of the genes. The author demonstrates, through HiBiN, that the integration of the diverse datasets significantly improves the classification quality for hierarchical gene function prediction

in terms of several measures, compared to single-source prediction models and fused-flat model, which are the baselines compared against. Moreover, the system has been extended to include a weighting scheme to control the contributions from each data source according to its relevance to the label under-study. The results show that the new weighting scheme compares favorably with the other approach along