Abstract : Comparing two sets of multivariate samples is a central problem in data analysis. From a statistical
standpoint, the simplest way to perform such a comparison is to resort to a non-parametric two-sample
test (TST), which checks whether the two sets can be seen as i.i.d. samples of an identical unknown
distribution (the null hypothesis). If the null is rejected, one wishes to identify regions accounting for
this difference. This paper presents a two-stage method providing feedback on this difference, based
upon a combination of statistical learning (regression) and computational topology methods.
dConsider two populations, each given as a point cloud in R^d. In the first step, we assign a label
to each set and we compute, for each sample point, a discrepancy measure based on comparing
an estimate of the conditional probability distribution of the label given a position versus the global
unconditional label distribution. In the second step, we study the height function defined at each point
by the aforementioned estimated discrepancy. Topological persistence is used to identify persistent
local minima of this height function, their basins defining regions of points with high discrepancy and
in spatial proximity.
Experiments are reported both on synthetic and real data (satellite images and handwritten digit
images), ranging in dimension from d = 2 to d = 784, illustrating the ability of our method to localize
discrepancies.
On a general perspective, the ability to provide feedback downstream TST may prove of ubiquitous
interest in exploratory statistics and data science.