Significance

Under the general heading of “topological data analysis” (TDA), the recent adoption of topological methods for the analysis
of large, complex, and high-dimensional data sets has established that the abstract concepts of algebraic topology provide
powerful tools for data analysis. However, despite the successes of TDA, most applications have lacked formal statistical
veracity, primarily due to difficulties in deriving distributional information about topological descriptors. We present an
approach, Replicating Statistical Topology (RST), which takes the most basic descriptor of TDA, the persistence diagram, and,
using models based on Gibbs distributions and Markov chain Monte Carlo, provides replications of it. These allow for formal
statistical hypothesis testing, without requiring costly, or perhaps intrinsically unavailable, replications of the original
data set.

Abstract

Under the banner of “big data,” the detection and classification of structure in extremely large, high-dimensional, data sets
are two of the central statistical challenges of our times. Among the most intriguing new approaches to this challenge is
“TDA,” or “topological data analysis,” one of the primary aims of which is providing nonmetric, but topologically informative,
preanalyses of data which make later, more quantitative, analyses feasible. While TDA rests on strong mathematical foundations
from topology, in applications, it has faced challenges due to difficulties in handling issues of statistical reliability
and robustness, often leading to an inability to make scientific claims with verifiable levels of statistical confidence.
We propose a methodology for the parametric representation, estimation, and replication of persistence diagrams, the main
diagnostic tool of TDA. The power of the methodology lies in the fact that even if only one persistence diagram is available
for analysis—the typical case for big data applications—the replications permit conventional statistical hypothesis testing.
The methodology is conceptually simple and computationally practical, and provides a broadly effective statistical framework
for persistence diagram TDA analysis. We demonstrate the basic ideas on a toy example, and the power of the parametric approach
to TDA modeling in an analysis of cosmic microwave background (CMB) nonhomogeneity.