Simultaneous partitioning and modeling : a framework for learning from complex data

View/Open

Author

Share

Metadata

Abstract

While a single learned model is adequate for simple prediction
problems, it may not be sufficient to represent heterogeneous
populations that difficult classification or regression problems often involve. In such scenarios, practitioners often adopt a "divide and conquer" strategy that segments the data into relatively homogeneous
groups and then builds a model for each group. This two-step procedure
usually results in simpler, more interpretable and actionable models
without any loss in accuracy. We consider prediction problems on
bi-modal or dyadic data with covariates, e.g., predicting customer behavior across products, where the independent variables can be naturally partitioned along the modes. A pivoting operation can now result in the target variable showing up as entries in a "customer by product" data matrix. We present a model-based co-clustering framework that interleaves partitioning (clustering) along each mode
and construction of prediction models to iteratively improve both
cluster assignment and fit of the models. This Simultaneous
CO-clustering And Learning (SCOAL) framework generalizes co-clustering
and collaborative filtering to model-based co-clustering, and is shown to be better than independently clustering the data first and then building models. Our framework applies to a wide range of bi-modal and
multi-modal data, and can be easily specialized to address
classification and regression problems in domains like recommender systems, fraud detection and marketing.
Further, we note that in several datasets not all the data is useful
for the learning problem and ignoring outliers and non-informative values may lead to better models. We explore extensions of SCOAL to
automatically identify and discard irrelevant data points and features
while modeling, in order to improve prediction accuracy. Next, we leverage the multiple models provided by the SCOAL technique to
address two prediction problems on dyadic data, (i) ranking
predictions based on their reliability, and (ii) active learning. We also extend SCOAL to predictive modeling of multi-modal data, where one of the modes is implicitly ordered, e.g., time series data. Finally, we illustrate our implementation of a parallel version of SCOAL based
on the Google Map-Reduce framework and developed on the open source
Hadoop platform. We demonstrate the effectiveness of specific
instances of the SCOAL framework on prediction problems through
experimentation on real and synthetic data.