Topological data analysis

In applied mathematics, topological data analysis (TDA) is an approach to the analysis of datasets using techniques from topology. Extraction of information from datasets that are high-dimensional, incomplete and noisy is generally challenging. TDA provides a general framework to analyze such data in a manner that is insensitive to the particular metric chosen and provides dimensionality reduction and robustness to noise. Beyond this, it inherits functoriality, a fundamental concept of modern mathematics, from its topological nature, which allows it to adapt to new mathematical tools.

The initial motivation is to study the shape of data. TDA has combined algebraic topology and other tools from pure mathematics to allow mathematically rigorous study of "shape". The main tool is persistent homology, an adaptation of homology to point cloud data. Persistent homology has been applied to many types of data across many fields. Moreover, its mathematical foundation is also of theoretical importance. The unique features of TDA make it a promising bridge between topology and geometry.

The premise underlying TDA is that shape matters. Real data in high dimensions is nearly always sparse, and tends to have relevant low dimensional features. One task of TDA is to provide a precise characterization of this fact. An illustrative example is a simple predator-prey system governed by the Lotka–Volterra equations.[1] One can easily observe that the trajectory of the system forms a closed circle in state space. TDA provides tools to detect and quantify such recurrent motion.[2]

Many algorithms for data analysis, including those used in TDA, require the choice of various parameters. Without prior domain knowledge, the correct collection of parameters for a data set is difficult to choose. The main insight of persistent homology is that we can use the information obtained from all values of a parameter. Of course this insight alone is easy to make; the hard part is encoding this huge amount of information into an understandable and easy-to-represent form. With TDA, there is a mathematical interpretation when the information is a homology group. In general, the assumption is that features that persist for a wide range of parameters are "true" features. Features persisting for only a narrow range of parameters are presumed to be noise, although the theoretical justification for this is unclear.[3]

Precursors to the full concept of persistent homology appeared gradually over time.[4] In 1990, Patrizio Frosini introduced the size function, which is equivalent to the 0th persistent homology.[5] Nearly a decade later, Vanessa Robins studied the images of homomorphisms induced by inclusion.[6] Finally, shortly thereafter, Edelsbrunner et al. introduced the concept of persistent homology together with an efficient algorithm and its visualization as a persistence diagram.[7] Carlsson et al. reformulated the initial definition and gave an equivalent visualization method called persistence barcodes,[8] interpreting persistence in the language of commutative algebra.[9]

A persistence moduleU{\displaystyle \mathbb {U} } indexed by Z{\displaystyle \mathbb {Z} } is a vector space Ut{\displaystyle U_{t}} for each t∈Z{\displaystyle t\in \mathrm {Z} }, and a linear map uts:Us→Ut{\displaystyle u_{t}^{s}:U_{s}\to U_{t}} whenever s≤t{\displaystyle s\leq t}, such that utt=1{\displaystyle u_{t}^{t}=1} for all t{\displaystyle t} and utsusr=utr{\displaystyle u_{t}^{s}u_{s}^{r}=u_{t}^{r}} whenever r≤s≤t.{\displaystyle r\leq s\leq t.}[10] An equivalent definition is a functor from Z{\displaystyle \mathbb {Z} } considered as a partially ordered set to the category of vector spaces.

The persistent homology groupPH{\displaystyle PH} of a point cloud is the persistence module defined as PHk(X)=∏Hk(Xr){\displaystyle PH_{k}(X)=\prod H_{k}(X_{r})}, where Xr{\displaystyle X_{r}} is the Čech complex of radius r{\displaystyle r} of the point cloud X{\displaystyle X} and Hk{\displaystyle H_{k}} is the homology group.

A persistence barcode is a multiset of intervals in R{\displaystyle \mathbb {R} }, and a persistence diagram is a multiset of points in Δ{\displaystyle \Delta }(:={(u,v)∈R2|u,v≥0,u≤v}{\displaystyle :=\{(u,v)\in \mathbb {R} ^{2}|u,v\geq 0,u\leq v\}}).

The Wasserstein distance between two persistence diagrams X{\displaystyle X} and Y{\displaystyle Y} is defined as

Intuitively, the free parts correspond to the homology generators that appear at filtration level ti{\displaystyle t_{i}} and never disappear, while the torsion parts correspond to those that appear at filtration level rj{\displaystyle r_{j}} and last for sj{\displaystyle s_{j}} steps of the filtration (or equivalently, disappear at filtration level sj+rj{\displaystyle s_{j}+r_{j}}).

Persistent homology is visualized through a barcode or persistence diagram. The barcode has its root in abstract mathematics, though not at first sight; essentially, the derived category of chain complexes over a field is equivalent to the graded category of vector spaces.[12]

Stability is desirable because it provides robustness against noise. If X{\displaystyle X} is any space which is homeomorphic to a simplicial complex, and f,g:X→R{\displaystyle f,g:X\to \mathbb {R} } are continuous tame[13] functions, then the persistence vector spaces {Hk(f−1([0,r]))}{\displaystyle \{H_{k}(f^{-1}([0,r]))\}} and {Hk(g−1([0,r]))}{\displaystyle \{H_{k}(g^{-1}([0,r]))\}} are finitely presented, and W∞(D(f),D(g))≤‖f−g‖∞{\displaystyle W_{\infty }(D(f),D(g))\leq \lVert f-g\rVert _{\infty }}, where W∞{\displaystyle W_{\infty }} refers to the bottleneck distance[14] and D{\displaystyle D} is the map taking a continuous tame function to the persistence diagram of its k{\displaystyle k}-th homology.

If X{\displaystyle X} is a point cloud, replace X{\displaystyle X} with a nested family of simplicial complexesXr{\displaystyle X_{r}} (such as the Čech or Vietoris-Rips complex). This process converts the point cloud into a filtration of simplicial complexes. Taking the homology of each complex in this filtration gives a persistence module

The first algorithm for persistent homology over F2{\displaystyle F_{2}} was given by Edelsbrunner et al.[7] Zomorodian and Carlsson gave the first practical algorithm to compute persistent homology over all fields.[9] Edelsbrunner and Harer's book gives general guidance on computational topology.[17]

One issue that arises in computation is the choice of complex. The Čech complex and Vietoris-Rips complex are most natural at first glance; however, their size grows rapidly with the number of data points. The Vietoris-Rips complex is preferred over Čech complex because its definition is simpler and the Čech complex requires extra effort to define in a general finite metric space. Efficient ways to lower the computational cost of homology have been studied. For example, the α-complex and witness complex are used to reduce the dimension and size of complexes.[18]

Recently, Discrete Morse theory has shown promise for computational homology because it can reduce a given simplicial complex to a much smaller cellular complex which is homotopic to the original one.[19] This reduction can in fact be performed as the complex is constructed by using matroid theory, leading to further performance increases.[20] Another recent algorithm saves time by ignoring the homology classes with low persistence.[21]

Various software packages are available, such as javaPlex, Dionysus, Perseus, PHAT, DIPHA, Gudhi, Ripser, and TDAstats. A comparison between these tools is done by Otter et al.[22] An R package TDA is capable of calculating recently invented concepts like landscape and the kernel distance estimator.[23] The Topology ToolKit is specialized for continuous data defined on manifolds of low dimension (1, 2 or 3), as typically found in scientific visualization. Also, a recently released R package, TDAstats, implements the fast C++ Ripser library to calculate persistent homology.[24] It also uses the ubiquitous ggplot2 package to generate reproducible, customizable, publication-quality visualizations of persistent homology, specifically topological barcodes and persistence diagrams. The sample code below gives an example of how the R programming language can be used to compute persistent homology.

# install TDAstats package from CRAN and load
install.packages("TDAstats")library("TDAstats")# make code reproducibleset.seed(1)# create dataset: point cloud with 100 2-dimensional uniformly distributed points in a unit square
unif.2d <-cbind(runif(100), runif(100))# create dataset: 100-points on a circle in 2 dimensions
angles <- runif(100,0,2*pi)
circ.2d <-cbind(cos(angles),sin(angles))# calculate persistent homology for both datasets# dim = 1 is maximum dimension of cycles for 2-dimensional data; increase# appropriately for larger dimensions (dim := dimension - 1)
unif.hom <- calculate_homology(unif.2d, dim =1)
circ.hom <- calculate_homology(circ.2d, dim =1)# plot uniformly distributed point cloud as persistence diagram# although some features might look significant, the units on this diagram# and the following barcode will clarify that all the features in this diagram# are pretty short/represent noise
plot_persist(unif.hom)# plot circle point cloud as topological barcode# we see a single persistent bar, as expected for a circle (single 1-cycle/loop)
plot_barcode(circ.hom)

Persistence diagram created by sample code (unif.2d dataset) for a set of 100 points uniformly distributed within a 2-dimensional unit square. None of the 0-cycles or 1-cycles are considered true signal (none truly exist within a unit square point cloud). Although some features appear to persist, the axis ticks show that the most persistent feature persists for less than 0.20 units, which is relatively small for a point cloud on a unit square.

Topological barcode created by sample code (circ.2d dataset) for a set of 100 points uniformly distributed around the circumference of a circle. The single, long 1-dimensional feature at the top of the barcode represents the only 1-cycle present in a circle.

High-dimensional data is impossible to visualize directly. Many methods have been invented to extract a low-dimensional structure from the data set, such as principal component analysis and multidimensional scaling.[25] However, it is important to note that the problem itself is ill-posed, since many different topological features can be found in the same data set. Thus, the study of visualization of high-dimensional spaces is of central importance to TDA, although it does not necessarily involve the use of persistent homology. However, recent attempts have been made to use persistent homology in data visualization.[26]

Carlsson et al. have proposed a general method called MAPPER.[27] It inherits the idea of Serre that a covering preserves homotopy.[28] A generalized formulation of MAPPER is as follows:

Let X{\displaystyle X} and Z{\displaystyle Z} be topological spaces and let f:X→Z{\displaystyle f:X\to Z} be a continuous map. Let U={Uα}α∈A{\displaystyle \mathbb {U} =\{U_{\alpha }\}_{\alpha \in A}} be a finite open covering of Z{\displaystyle Z}. The output of MAPPER is the nerve of the pullback cover M(U,f):=N(f−1(U)){\textstyle M(\mathbb {U} ,f):=N(f^{-1}(\mathbb {U} ))}, where each preimage is split into its connected components.[26] This is a very general concept, of which the Reeb graph and merge trees are special cases.

This is not quite the original definition.[27] Carlsson et al. choose Z{\displaystyle Z} to be R{\displaystyle \mathbb {R} } or R2{\displaystyle \mathbb {R} ^{2}}, and cover it with open sets such that at most two intersect.[3] This restriction means that the output is in the form of a complex network. Because the topology of a finite point cloud is trivial, clustering methods (such as single linkage) are used to produce the analogue of connected sets in the preimage f−1(U){\displaystyle f^{-1}(U)} when MAPPER is applied to actual data.

Mathematically speaking, MAPPER is a variation of the Reeb graph. If the M(U,f){\textstyle M(\mathbb {U} ,f)} is at most one dimensional, then for each i≥0{\displaystyle i\geq 0},

[29] The added flexibility also has disadvantages. One problem is instability, in that some change of the choice of the cover can lead to major change of the output of the algorithm.[30] Work has been done to overcome this problem.[26]

Three successful applications of MAPPER can be found in Carlsson et al.[31] A comment on the applications in this paper by J. Curry is that "a common feature of interest in applications is the presence of flares or tendrils."[32]

A free implementation of MAPPER is available online written by Daniel Müllner and Aravindakshan Babu. MAPPER also forms the basis of Ayasdi's AI platform.

Multidimensional persistence is important to TDA. The concept arises in both theory and practice. The first investigation of multidimensional persistence was early in the development of TDA,[33] and is one of the founding papers of TDA.[9] The first application to appear in the literature is a method for shape comparison, similar to the invention of TDA.[34]

The definition of an n-dimensional persistence module in Rn{\displaystyle \mathbb {R} ^{n}} is[32]

vector space Vs{\displaystyle V_{s}} is assigned to each point in s=(s1,...,sn){\displaystyle s=(s_{1},...,s_{n})}

It might be worth noting that there are controversies on the definition of multidimensional persistence.[32]

One of the advantages of one-dimensional persistence is its representability by a diagram or barcode. However, discrete complete invariants of multidimensional persistence modules do not exist.[35] The main reason for this is that the structure of the collection of indecomposables is extremely complicated by Gabriel's theorem in the theory of quiver representations,[36] although a finitely n-dim persistence module can be uniquely decomposed into a direct sum of indecomposables due to the Kull-Schmidt theorem.[37]

Nonetheless, many results have been established. Carlsson and Zomorodian introduced the rank invariantρM(u,v){\displaystyle \rho _{M}(u,v)}, defined as the ρM(u,v)=rank(xu−v:Mu→Mv){\displaystyle \rho _{M}(u,v)=\mathrm {rank} (x^{u-v}:M_{u}\to M_{v})}, in which M{\displaystyle M} is a finitely generated n-graded module. In one dimension, it is equivalent to the barcode. In the literature, the rank invariant is often referred as the persistent Betti numbers (PBNs).[17] In many theoretical works, authors have used a more restricted definition, an analogue from sublevel set persistence. Specifically, the persistence Betti numbers of a function f:X→Rk{\displaystyle f:X\to \mathrm {R} ^{k}} are given by the function βf:Δ+→N{\displaystyle \beta _{f}:\Delta ^{+}\to \mathrm {N} }, taking each (u,v)∈Δ+{\displaystyle (u,v)\in \Delta ^{+}} to βf(u,v):=rank(H(X(f≤u)→H(X(f≤v))){\displaystyle \beta _{f}(u,v):=\mathrm {rank} (H(X(f\leq u)\to H(X(f\leq v)))}, where Δ+:={(u,v)∈R×R:u≤v}{\displaystyle \Delta ^{+}:=\{(u,v)\in \mathbb {R} \times \mathbb {R} :u\leq v\}} and X(f≤u):={x∈X:f(x)≤u}{\displaystyle X(f\leq u):=\{x\in X:f(x)\leq u\}}.

Some basic properties include monotonicity and diagonal jump.[38] Persistent Betti numbers will be finite if X{\displaystyle X} is a compact and locally contractible subspace of Rn{\displaystyle \mathbb {R} ^{n}}.[39]

Using a foliation method, the k-dim PBNs can be decomposed into a family of 1-dim PBNs by dimensionality deduction.[40] This method has also led to a proof that multi-dim PBNs are stable.[41] The discontinuities of PBNs only occur at points (u,v)(u≤v){\displaystyle (u,v)(u\leq v)} where either u{\displaystyle u} is a discontinuous point of ρM(⋆,v){\displaystyle \rho _{M}(\star ,v)} or v{\displaystyle v} is a discontinuous point of ρ(u,⋆){\displaystyle \rho (u,\star )} under the assumption that f∈C0(X,Rk){\displaystyle f\in C^{0}(X,\mathrm {R} ^{k})} and X{\displaystyle X} is a compact, triangulable topological space.[42]

Persistent space, a generalization of persistent diagram, is defined as the multiset of all points with multiplicity larger than 0 and the diagonal.[43] It provides a stable and complete representation of PBNs. An ongoing work by Carlsson et al. is trying to give geometric interpretation of persistent homology, which might provide insights on how to combine machine learning theory with topological data analysis.[44]

The first practical algorithm to compute multidimensional persistence was invented very early.[45] After then, many other algorithms have been proposed, based on such concepts as discrete morse theory[46] and finite sample estimating.[47]

The nonzero maps in persistence module are restricted by the preorder relationship in the category. However, mathematicians have found that the unanimity of direction is not essential to many results. "The philosophical point is that the decomposition theory of graph representations is somewhat independent of the orientation of the graph edges".[48] Zigzag persistence is important to the theoretical side. The examples given in Carlsson's review paper to illustrate the importance of functorality all share some of its features.[3]

It's natural to extend persistence homology to other basic concepts in algebraic topology, such as cohomology and relative homology/cohomology.[50] An interesting application is the computation of circular coordinates for a data set via the first persistent cohomology group.[51]

Normal persistence homology studies real-valued functions. The circle-valued map might be useful, "persistence theory for circle-valued maps promises to play the role for some vector fields as does the standard persistence theory for scalar fields", as commented in D. Burghelea et al.[52] The main difference is that Jordan cells(very similar in format to the ones in linear algebra) are nontrivial in circle-valued functions, which would be zero in real-valued case, and combing with barcodes give the invariants of a tame map, under moderate conditions.[52]

Two techniques they use are More-Novikov theory[53] and graph representation theory.[54] More recent results can be found in D. Burghelea et al.[55] For example, the tameness requirement can be replaced by the much weaker condition, continuous.

The proof of the structure theorem relies on the base domain being field, so not many attempts have been made on persistence homology with torsion. Frosini defined a pseudometric on this specific module and proved its stability.[56] One of its novelty is that it doesn't depend on some classification theory to define the metric.[57]

One advantage of category theory is its ability to lift concrete results to a higher level, showing relationships between seemingly unconnected objects. Bubenik et al.[58] offers a short introduction of category theory fitted for TDA.

Category theory is the language of modern algebra, and has been widely used in the study of algebraic geometry and topology. It has been noted that "the key observation of [9] is that the persistence diagram produced by [7] depends only on the algebraic structure carried by this diagram."[59] The use of category theory in TDA has proved to be fruitful.[58][59]

Following the notations made in Bubenik et al.,[59] the indexing categoryP{\textstyle P} is any preordered set (not necessarily N{\displaystyle \mathbb {N} } or R{\displaystyle \mathbb {R} }), the target category D{\displaystyle D} is any category (instead of the commonly used VectF{\textstyle \mathrm {Vect} _{\mathbb {F} }}), and functorsP→D{\textstyle P\to D} are called generalized persistence modules in D{\displaystyle D}, over P{\textstyle P}.

One advantage of using category theory in TDA is a clearer understanding of concepts and the discovery of new relationships between proofs. Take two examples for illustration. The understanding of the correspondence between interleaving and matching is of huge importance, since matching has been the method used in the beginning (modified from Morse theory). A summary of works can be found in Vin de Silva et al.[60] Many theorems can be proved much more easily in a more intuitive setting.[57] Another example is the relationship between the construction of different complexes from point clouds. It has long been noticed that Čech and Vietoris-Rips complexes are related. Specifically, Vr(X)⊂C2r(X)⊂V2r(X){\displaystyle V_{r}(X)\subset C_{{\sqrt {2}}r}(X)\subset V_{2r}(X)}.[61] The essential relationship between Cech and Rips complexes can be seen much more clearly in categorical language.[60]

The language of category theory also helps cast results in terms recognizable to the broader mathematical community. Bottleneck distance is widely used in TDA because of the results on stability with respect to the bottleneck distance.[10][14] In fact, the interleaving distance is the terminal object in a poset category of stable metrics on multidimensional persistence modules in a prime field.[57][62]

Sheaves, a central concept in modern algebraic geometry, are intrinsically related to category theory. Roughly speaking, sheaves are the mathematical tool for understanding how local information determines global information. Justin Curry regards level set persistence as the study of fibers of continuous functions. The objects that he studies are very similar to those by MAPPER, but with sheaf theory as the theoretical foundation.[32] Although no breakthrough in the theory of TDA has yet used sheaf theory, it is promising since there are many beautiful theorems in algebraic geometry relating to sheaf theory. For example, a natural theoretical question is whether different filtration methods result in the same output.[63]

Stability is of central importance to data analysis, since real data carry noises. By usage of category theory, Bubenik et al. have distinguished between soft and hard stability theorems, and proved that soft cases are formal.[59] Specifically, general workflow of TDA is

data

⟶F{\displaystyle {\stackrel {F}{\longrightarrow }}}

topological persistence module

⟶H{\displaystyle {\stackrel {H}{\longrightarrow }}}

algebraic persistence module

⟶J{\displaystyle {\stackrel {J}{\longrightarrow }}}

discrete invariant

The soft stability theorem asserts that HF{\displaystyle HF} is Lipschitz continuous, and the hard stability theorem asserts that J{\displaystyle J} is Lipschitz continuous.

Bottleneck distance is widely used in TDA. The isometry theorem asserts that the interleaving distancedI{\displaystyle d_{I}} is equal to the bottleneck distance.[57] Bubenik et al. have abstracted the definition to that between functors F,G:P→D{\displaystyle F,G:P\to D} when P{\textstyle P} is equipped with a sublinear projection or superlinear family, in which still remains a pseudometric.[59] Considering the magnificent characters of interleaving distance,[64] here we introduce the general definition of interleaving distance(instead of the first introduced one):[10] Let Γ,K∈TransP{\displaystyle \Gamma ,K\in \mathrm {Trans_{P}} } (a function from P{\textstyle P} to P{\textstyle P} which is monotone and satisfies x≤Γ(x){\displaystyle x\leq \Gamma (x)} for all x∈P{\textstyle x\in P}). A (Γ,K){\displaystyle (\Gamma ,K)}-interleaving between F and G consists of natural transformations φ:F⇒GΓ{\displaystyle \varphi :F\Rightarrow G\Gamma } and ψ:G⇒FK{\displaystyle \psi :G\Rightarrow FK}, such that (ψΓ)=φFηKΓ{\displaystyle (\psi \Gamma )=\varphi F\eta _{K\Gamma }} and (φΓ)=ψGηΓK{\displaystyle (\varphi \Gamma )=\psi G\eta _{\Gamma K}}.

Let P{\textstyle P} be a preordered set with a sublinear projection or superlinear family. Let H:D→E{\textstyle H:D\to E} be a functor between arbitrary categories D,E{\textstyle D,E}. Then for any two functors F,G:P→D{\textstyle F,G:P\to D}, we have dI(HF,HG)≤dI(F,G){\textstyle d_{I}(HF,HG)\leq d_{I}(F,G)}.

The structure theorem is of central importance to TDA; as commented by G. Carlsson, "what makes homology useful as a discriminator between topological spaces is the fact that there is a classification theorem for finitely generated abelian groups."[3] (see the fundamental theorem of finitely generated abelian groups).

In general, not every persistence module can be decomposed into intervals.[65] Many attempts have been made at relaxing the restrictions of the original structure theorem.[clarification needed] The case for pointwise finite-dimensional persistence modules indexed by a locally finite subset of R{\displaystyle \mathbb {R} } is solved based on the work of Webb.[66] The most notable result is done by Crawley-Boevey, which solved the case of R{\displaystyle \mathbb {R} }. Crawley-Boevey's theorem states that any pointwise finite-dimensional persistence module is a direct sum of interval modules.[67]

To understand the definition of his theorem, some concepts need introducing. An interval in (R,≤){\displaystyle (\mathbb {R} ,\leq )} is defined as a subset I⊂R{\displaystyle I\subset \mathbb {R} } having the property that if r,t∈I{\displaystyle r,t\in I} and if there is an s∈R{\displaystyle s\in \mathbb {R} } such that r≤s≤t{\displaystyle r\leq s\leq t}, then s∈I{\displaystyle s\in I} as well. An interval modulekI{\displaystyle k_{I}} assigns to each element s∈I{\displaystyle s\in I} the vector space k{\displaystyle k} and assigns the zero vector space to elements in R/I{\displaystyle \mathbb {R} /I}. All maps ρst{\displaystyle \rho _{s}^{t}} are the zero map, unless s,t∈I{\displaystyle s,t\in I} and s≤t{\displaystyle s\leq t}, in which case ρst{\displaystyle \rho _{s}^{t}} is the identity map.[32] Interval modules are indecomposable.[68]

Although the result of Crawley-Boevey is a very powerful theorem, it still doesn't extend to the q-tame case.[65] A persistence module is q-tame if the rank of ρst{\displaystyle \rho _{s}^{t}} is finite for all s≤t{\displaystyle s\leq t}. There are examples of q-tame persistence modules that fail to be pointwise finite.[69] However, it turns out that a similar structure theorem still holds if the features that exist only at one index value are removed.[68] This holds because the infinite dimensional parts at each index value do not persist, due to the finite-rank condition.[70] Formally, the observable category Ob{\displaystyle \mathrm {Ob} } is defined as Pers/Eph{\displaystyle \mathrm {Pers} /\mathrm {Eph} }, in which Eph{\displaystyle \mathrm {Eph} } denotes the full subcategory of Pers{\displaystyle \mathrm {Pers} } whose objects are the ephemeral modules (ρst=0{\displaystyle \rho _{s}^{t}=0} whenever s<t{\displaystyle s<t}).[68]

Note that the extended results listed here do not apply to zigzag persistence, since the analogue of a zigzag persistence module over R{\displaystyle \mathbb {R} } is not immediately obvious.

Real data is always finite, and so its study requires us to take stochasticity into account. Statistical analysis gives us the ability to separate true features of the data from artifacts introduced by random noise. Persistent homology has no inherent mechanism to distinguish between low-probability features and high-probability features.

One way to apply statistics to topological data analysis is to study the statistical properties of topological features of point clouds. The study of random simplicial complexes offers some insight into statistical topology. K. Turner et al.[71] offers a summary of work in this vein.

Another way is to study probability distributions on the persistence space. The persistence space B∞{\displaystyle B_{\infty }} is ∐nBn/∽{\displaystyle \coprod _{n}B_{n}/\backsim }, where Bn{\displaystyle B_{n}} is the space of all barcodes containing exactly n{\displaystyle n} intervals and the equivalences are {[x1,y1],[x2,y2],...,[xn,yn]}∽{[x1,y1],[x2,y2],...,[xn−1,yn−1]}{\displaystyle \{[x_{1},y_{1}],[x_{2},y_{2}],...,[x_{n},y_{n}]\}\backsim \{[x_{1},y_{1}],[x_{2},y_{2}],...,[x_{n-1},y_{n-1}]\}} if xn=yn{\displaystyle x_{n}=y_{n}}.[72] This space is fairly complicated; for example, it is not complete under the bottleneck metric. The first attempt made to study it is by Y. Mileyko et al.[73] The space of persistence diagrams Dp{\displaystyle D_{p}} in their paper is defined as

where Δ{\displaystyle \Delta } is the diagonal line in R2{\displaystyle \mathbb {R} ^{2}}. A nice property is that Dp{\displaystyle D_{p}} is complete and separable in the Wasserstein metric Wp(u,v)=(infγ∈Γ(u,v)∫X×Xρp(x,y)dγ(x,y))1p{\displaystyle W_{p}(u,v)=\left(\inf _{\gamma \in \Gamma (u,v)}\int _{\mathbb {X} \times \mathbb {X} }\rho ^{p}(x,y)\mathrm {d} \gamma (x,y)\right)^{\frac {1}{p}}}. Expectation, variance, and conditional probability can be defined in the Fréchet sense. This allows many statistical tools to be ported to TDA. Works on null hypothesis significance test,[74]confidence intervals,[75] and robust estimates[76] are notable steps.

Persistence landscapes, introduced by Peter Bubenik, are a different way to represent of barcodes, more amenable to statistical analysis.[77] The persistence landscape of a persistent module M{\displaystyle M} is defined as a function λ:N×R→R¯{\displaystyle \lambda :\mathbb {N} \times \mathbb {R} \to {\bar {\mathbb {R} }}}, λ(k,t):=sup(m≥0|βt−m,t−m≥k){\displaystyle \lambda (k,t):=\sup(m\geq 0|\beta ^{t-m,t-m}\geq k)}, where R¯{\displaystyle {\bar {\mathbb {R} }}} denotes the extended real line and βa,b=dim(im(M(a≤b))){\displaystyle \beta ^{a,b}=\mathrm {dim} (\mathrm {im} (M(a\leq b)))}. The space of persistence landscapes is very nice: it inherits all good properties of barcode representation (stability, easy representation, etc.), but statistical quantities can be readily defined, and some problems in Y. Mileyko et al.'s work, such as the non-uniqueness of expectations,[73] can be overcome. Effective algorithms for computation with persistence landscapes are available.[78] Another approach is to use revised persistence, which is image, kernel and cokernel persistence.[79]

one being the study of homological invariants of data one individual data sets, and the other is the use of homological invariants in the study of databases where the data points themselves have geometric structure.

Ayasdi is a data analysis company relying heavily on TDA, cofounded by a number of leading researchers in the field. There are several notable interesting features of the recent applications of TDA:

Combining tools from several branches of mathematics. Besides the obvious need for algebra and topology, partial differential equations,[104] algebraic geometry,[35] presentation theory,[48] statistics, combinatorics, and Riemannian geometry[70] have all found use in TDA.

Quantitative analysis. Topology is considered to be very soft since many concepts are invariant under homotopy. However, persistent topology is able to record the birth (appearance) and death (disappearance) of topological features, thus extra geometric information is embedded in it. One evidence in theory is a partially positive result on the uniqueness of reconstruction of curves;[105] two in application are on the quantitative analysis of Fullerene stability and quantitative analysis of self-similarity, separately.[98][106]

The role of short persistence. Short persistence has also been found to be useful, despite the common belief that noise is the cause of the phenomena.[107] This is interesting to the mathematical theory.

One of the main fields of data analysis today is machine learning. Some examples of machine learning in TDA can be found in Adcock et al.[108] A conference is dedicated to the link between TDA and machine learning. In order to apply tools from machine leaning, the information obtained from TDA should be represented in vector form. An ongoing and promising attempt is the persistence landscape discussed above. Another attempt uses the concept of persistence images.[109] However, one problem of this method is the loss of stability, since the hard stability theorem depends on the barcode representation.

Topological data analysis and persistent homology have had impacts on Morse theory. Morse theory has played a very important role in the theory of TDA, including on computation. Some work in persistent homology has extended results about Morse functions to tame functions or, even to continuous functions. A forgotten result of R. Deheuvels long before the invention of persistent homology extends Morse theory to all continuous functions.[110]

One recent result is that the category of Reeb graphs is equivalent to a particular class of cosheaf.[111] This is motivated by theoretical work in TDA, since the Reeb graph is related to Morse theory and MAPPER is derived from it. The proof of this theorem relies on the interleaving distance.

It is evident to mathematicians that persistent homology is closely related to spectral sequences.[112] Zigzag persistence may turn out to be of theoretical importance to spectral sequences.

^Novikov S P. Quasiperiodic structures in topology[C]//Topological methods in modern mathematics, Proceedings of the symposium in honor of John Milnor’s sixtieth birthday held at the State University of New York, Stony Brook, New York. 1991: 223-233.

1.
Applied mathematics
–
Applied mathematics is a branch of mathematics that deals with mathematical methods that find use in science, engineering, business, computer science, and industry. Thus, applied mathematics is a combination of science and specialized knowledge. The term applied mathematics also describes the professional specialty in which work on practical problems by formulating and studying mathematical models. The activity of applied mathematics is thus connected with research in pure mathematics. Historically, applied mathematics consisted principally of applied analysis, most notably differential equations, approximation theory, quantitative finance is now taught in mathematics departments across universities and mathematical finance is considered a full branch of applied mathematics. Engineering and computer science departments have made use of applied mathematics. Today, the applied mathematics is used in a broader sense. It includes the areas noted above as well as other areas that have become increasingly important in applications. Even fields such as number theory that are part of mathematics are now important in applications. There is no consensus as to what the various branches of applied mathematics are, such categorizations are made difficult by the way mathematics and science change over time, and also by the way universities organize departments, courses, and degrees. Many mathematicians distinguish between applied mathematics, which is concerned with methods, and the applications of mathematics within science. Mathematicians such as Poincaré and Arnold deny the existence of applied mathematics, similarly, non-mathematicians blend applied mathematics and applications of mathematics. The use and development of mathematics to industrial problems is also called industrial mathematics. Historically, mathematics was most important in the sciences and engineering. Academic institutions are not consistent in the way they group and label courses, programs, at some schools, there is a single mathematics department, whereas others have separate departments for Applied Mathematics and Mathematics. It is very common for Statistics departments to be separated at schools with graduate programs, many applied mathematics programs consist of primarily cross-listed courses and jointly appointed faculty in departments representing applications. Some Ph. D. programs in applied mathematics require little or no coursework outside of mathematics, in some respects this difference reflects the distinction between application of mathematics and applied mathematics. Research universities dividing their mathematics department into pure and applied sections include MIT, brigham Young University also has an Applied and Computational Emphasis, a program that allows student to graduate with a Mathematics degree, with an emphasis in Applied Math

2.
Topology
–
In mathematics, topology is concerned with the properties of space that are preserved under continuous deformations, such as stretching, crumpling and bending, but not tearing or gluing. This can be studied by considering a collection of subsets, called open sets, important topological properties include connectedness and compactness. Topology developed as a field of study out of geometry and set theory, through analysis of such as space, dimension. Such ideas go back to Gottfried Leibniz, who in the 17th century envisioned the geometria situs, Leonhard Eulers Seven Bridges of Königsberg Problem and Polyhedron Formula are arguably the fields first theorems. The term topology was introduced by Johann Benedict Listing in the 19th century, by the middle of the 20th century, topology had become a major branch of mathematics. It defines the basic notions used in all branches of topology. Algebraic topology tries to measure degrees of connectivity using algebraic constructs such as homology, differential topology is the field dealing with differentiable functions on differentiable manifolds. It is closely related to geometry and together they make up the geometric theory of differentiable manifolds. Geometric topology primarily studies manifolds and their embeddings in other manifolds, a particularly active area is low-dimensional topology, which studies manifolds of four or fewer dimensions. This includes knot theory, the study of mathematical knots, Topology, as a well-defined mathematical discipline, originates in the early part of the twentieth century, but some isolated results can be traced back several centuries. Among these are certain questions in geometry investigated by Leonhard Euler and his 1736 paper on the Seven Bridges of Königsberg is regarded as one of the first practical applications of topology. On 14 November 1750 Euler wrote to a friend that he had realised the importance of the edges of a polyhedron and this led to his polyhedron formula, V − E + F =2. Some authorities regard this analysis as the first theorem, signalling the birth of topology, further contributions were made by Augustin-Louis Cauchy, Ludwig Schläfli, Johann Benedict Listing, Bernhard Riemann and Enrico Betti. Listing introduced the term Topologie in Vorstudien zur Topologie, written in his native German, in 1847, the term topologist in the sense of a specialist in topology was used in 1905 in the magazine Spectator. Their work was corrected, consolidated and greatly extended by Henri Poincaré, in 1895 he published his ground-breaking paper on Analysis Situs, which introduced the concepts now known as homotopy and homology, which are now considered part of algebraic topology. Unifying the work on function spaces of Georg Cantor, Vito Volterra, Cesare Arzelà, Jacques Hadamard, Giulio Ascoli and others, Maurice Fréchet introduced the metric space in 1906. A metric space is now considered a case of a general topological space. In 1914, Felix Hausdorff coined the term topological space and gave the definition for what is now called a Hausdorff space, currently, a topological space is a slight generalization of Hausdorff spaces, given in 1922 by Kazimierz Kuratowski

3.
Metric (mathematics)
–
In mathematics, a metric or distance function is a function that defines a distance between each pair of elements of a set. A set with a metric is called a metric space, a metric induces a topology on a set, but not all topologies can be generated by a metric. A topological space whose topology can be described by a metric is called metrizable, an important source of metrics in differential geometry are metric tensors, bilinear forms that may be defined from the tangent vectors of a differentiable manifold onto a scalar. A metric tensor allows distances along curves to be determined through integration, however, not every metric comes from a metric tensor in this way. The first condition is implied by the others, for sets on which an addition +, X × X → X is defined, d is called a translation invariant metric if d = d for all x, y and a in X. These conditions express intuitive notions about the concept of distance, for example, that the distance between distinct points is positive and the distance from x to y is the same as the distance from y to x. The triangle inequality means that the distance x to z via y is at least as great as from x to z directly. Euclid in his work stated that the shortest distance between two points is a line, that was the triangle inequality for his geometry, if a modification of the triangle inequality 4*. D ≤ d + d is used in the definition then property 1 follows straight from property 4*, properties 2 and 4* give property 3 which in turn gives property 4. The discrete metric, if x = y then d =0, the Euclidean metric is translation and rotation invariant. The taxicab metric is translation invariant, more generally, any metric induced by a norm is translation invariant. If n ∈ N is a sequence of seminorms defining a vector space E. Graph metric, a defined in terms of distances in a certain graph. The Hamming distance in coding theory, Riemannian metric, a type of metric function that is appropriate to impose on any differentiable manifold. For any such manifold, one chooses at each point p a symmetric, positive definite, bilinear form L, Tp × Tp → ℝ on the tangent space Tp at p, a smooth manifold equipped with a Riemannian metric is called a Riemannian manifold. The Fubini–Study metric on complex projective space and this is an example of a Riemannian metric. String metrics, such as Levenshtein distance and other string edit distances, graph edit distance defines a distance function between graphs. For a given set X, two metrics d1 and d2 are called equivalent if the identity mapping id, → is a homeomorphism

4.
Algebraic topology
–
Algebraic topology is a branch of mathematics that uses tools from abstract algebra to study topological spaces. The basic goal is to find algebraic invariants that classify topological spaces up to homeomorphism, although algebraic topology primarily uses algebra to study topological problems, using topology to solve algebraic problems is sometimes also possible. Algebraic topology, for example, allows for a convenient proof that any subgroup of a group is again a free group. Below are some of the areas studied in algebraic topology, In mathematics. The first and simplest homotopy group is the group, which records information about loops in a space. Intuitively, homotopy groups record information about the shape, or holes. In homology theory and algebraic topology, cohomology is a term for a sequence of abelian groups defined from a co-chain complex. That is, cohomology is defined as the study of cochains, cocycles. Cohomology can be viewed as a method of assigning algebraic invariants to a space that has a more refined algebraic structure than does homology. Cohomology arises from the algebraic dualization of the construction of homology, in less abstract language, cochains in the fundamental sense should assign quantities to the chains of homology theory. A manifold is a space that near each point resembles Euclidean space. Typically, results in algebraic topology focus on global, non-differentiable aspects of manifolds, knot theory is the study of mathematical knots. While inspired by knots that appear in daily life in shoelaces and rope, in precise mathematical language, a knot is an embedding of a circle in 3-dimensional Euclidean space, R3. A simplicial complex is a space of a certain kind, constructed by gluing together points, line segments, triangles. Simplicial complexes should not be confused with the abstract notion of a simplicial set appearing in modern simplicial homotopy theory. The purely combinatorial counterpart to a complex is an abstract simplicial complex. A CW complex is a type of space introduced by J. H. C. Whitehead to meet the needs of homotopy theory. This class of spaces is broader and has some better categorical properties than simplicial complexes, an older name for the subject was combinatorial topology, implying an emphasis on how a space X was constructed from simpler ones

5.
Homology (mathematics)
–
In mathematics, homology is a general way of associating a sequence of algebraic objects such as abelian groups or modules to other mathematical objects such as topological spaces. Homology groups were defined in algebraic topology. Similar constructions are available in a variety of other contexts, such as abstract algebra, groups, Lie algebras, Galois theory. The original motivation for defining homology groups was the observation that two shapes can be distinguished by examining their holes, however, because a hole is not there, it is not immediately obvious how to define a hole or how to distinguish different kinds of holes. Homology was originally a rigorous mathematical method for defining and categorizing holes in a manifold, loosely speaking, a cycle is a closed submanifold, a boundary is the boundary of a submanifold with boundary, and a homology class is an equivalence class of cycles modulo boundaries. There are many different homology theories, a particular type of mathematical object, such as a topological space or a group, may have one or more associated homology theories. When the underlying object has an interpretation like topological spaces do. In general, most homology groups or modules arise as derived functors on appropriate abelian categories and they provide concrete descriptions of the failure of a functor to be exact. From this abstract perspective, homology groups are determined by objects of a derived category, Homology theory can be said to start with the Euler polyhedron formula, or Euler characteristic. This was followed by Riemanns definition of genus and n-fold connectedness numerical invariants in 1857 and these cycles are also sometimes thought of as cuts which can be glued back together, or as zippers which can be fastened and unfastened. For example, a line drawn on a surface represents a 1-cycle, on the ordinary sphere S2, the cycle b in the diagram can be shrunk to the pole and even the equatorial great circle a can be shrunk in the same way. The Jordan curve theorem shows that any arbitrary cycle such as c can be shrunk to a point. All cycles on the sphere can therefore be continuously transformed into each other and they are said to be homologous to zero. Cutting a manifold along a cycle homologous to zero separates the manifold into two or more components, for example, cutting the sphere along a produces two hemispheres. This is not generally true of cycles on other surfaces, the torus T2 has cycles which cannot be continuously deformed into each other, for example in the diagram none of the cycles a, b or c can be. Cycles a and b cannot be shrunk significantly, cycle c can be shrunk to a point, so it is homologous to zero. If the torus surface is cut along both a and b, it can be opened out and flattened into a rectangle or, more conveniently, one opposite pair of sides represents the cut along a, and the other opposite pair represents the cut along b. The edges of the square may then be glued together in different ways

6.
Point cloud
–
A point cloud is a set of data points in some coordinate system. In a three-dimensional coordinate system, these points are defined by X, Y, and Z coordinates. Point clouds may be created by 3D scanners and these devices measure a large number of points on an objects surface, and often output a point cloud as a data file. The point cloud represents the set of points that the device has measured, there are many techniques for converting a point cloud to a 3D surface. One application in which point clouds are directly usable is industrial metrology or inspection using industrial computed tomography, the point cloud of a manufactured part can be aligned to a CAD model, and compared to check for differences. These differences can be displayed as color maps that give a visual indicator of the deviation between the part and the CAD model. Geometric dimensions and tolerances can also be extracted directly from the point cloud, point clouds can also be used to represent volumetric data used for example in medical imaging. Using point clouds multi-sampling and data compression are achieved, in geographic information systems, point clouds are one of the sources used to make digital elevation model of the terrain. They are also used to generate 3D models of urban environments, euclideon, a 3D graphics engine which makes use of a point cloud search algorithm to render images

In statistics, a confidence interval (CI) is a type of interval estimate (of a population parameter) that is computed …

In this bar chart, the top ends of the brown bars indicate observed means and the red line segments ("error bars") represent the confidence intervals around them. Although the error bars are shown as symmetric around the means, that is not always the case. It is also important that in most graphs, the error bars do not represent confidence intervals (e.g., they often represent standard errors or standard deviations)

Image: Margarinefilling

The blue vertical line segments represent 50 realizations of a confidence interval for the population mean μ, represented as a red horizontal dashed line; note that some confidence intervals do not contain the population mean, as expected.

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set …

A principal components analysis scatterplot of Y-STRhaplotypes calculated from repeat-count values for 37 Y-chromosomal STR markers from 354 individuals. PCA has successfully found linear combinations of the different markers, that separate out different clusters corresponding to different lines of individuals' Y-chromosomal genetic descent.

PCA of a multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 in roughly the (0.866, 0.5) direction and of 1 in the orthogonal direction. The vectors shown are the eigenvectors of the covariance matrix scaled by the square root of the corresponding eigenvalue, and shifted so their tails are at the mean.

Category theory formalizes mathematical structure and its concepts in terms of a labeled directed graph called a …

Schematic representation of a category with objects X, Y, Z and morphisms f, g, g ∘ f. (The category's three identity morphisms 1X, 1Y and 1Z, if explicitly represented, would appear as three arrows, from the letters X, Y, and Z to themselves, respectively.)

In topology, the Vietoris–Rips complex, also called the Vietoris complex or Rips complex, is an abstract simplicial …

A Vietoris–Rips complex of a set of 23 points in the Euclidean plane. This complex has sets of up to four points: the points themselves (shown as red circles), pairs of points (black edges), triples of points (pale blue triangles), and quadruples of points (dark blue tetrahedrons).

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. It …

An example of classical multidimensional scaling applied to voting patterns in the United States House of Representatives. Each red dot represents one Republican member of the House, and each blue dot one Democrat.

Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection …

An example of data produced by data dredging through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence.