Abstract

Active learning (AL) is a promising way to efficiently
building up training sets with minimal supervision. A learner
deliberately queries specific instances to tune the classifier’s
model using as few labels as possible. The challenge for streaming
is that the data distribution may evolve over time and therefore
the model must adapt. Another challenge is the sampling bias
where the sampled training set does not reflect the underlying
data distribution. In presence of concept drift, sampling bias is
more likely to occur as the training set needs to represent the
whole evolving data. To tackle these challenges, we propose a
novel bi-criteria AL approach (BAL) that relies on two selection
criteria, namely
label uncertainty criterion
and
density-based cri-
terion
. While the first criterion selects instances that are the most
uncertain in terms of class membership, the latter dynamically
curbs the sampling bias by weighting the samples to reflect on the
true underlying distribution. To design and implement these two
criteria for learning from streams, BAL adopts a Bayesian online
learning approach and combines online classification and online
clustering through the use of
online logistic regression
and
online
growing Gaussian mixture models
respectively. Empirical results
obtained on standard synthetic and real-world benchmarks show
the high performance of the proposed BAL method compared to
the state-of-the-art AL methods