Meta

The Shape of Data

Data is a large part of our everyday lives. We personally take in all kinds of data everyday. Things such as weather, political polls, meal costs, and much much more make up most of our daily lives. Companies collect data as well. Medical, consumer, weather, and other kinds of data have been collected for years. Recently, we have started learning how to use this data to model and predict future outcomes. This is a hot field of study in industry known as big data and data science. One particular field of study is concerned with the shape of data.

The anthem of Topological Data Analysis (TDA) is that data has shape and that shape matters. We would like to take a data sample and describe the topological space it was sampled from. This will help us make predictions to where new data may land. TDA has been used in many fields such as medical imaging [1] , sensor networks [2], sports analysis [3], disease progression [4], image analysis [5], signal analysis [6], and many others.In this post, we are just going to give the basic idea. Suppose we have are given a data set that looks like this.

It seems obvious to the human eye that this data has been sampled from circular object. This is because we are wired to recognize patterns, especially ones as easy as this data set. But how could we get a computer to understand this pattern? This is where TDA comes in. Imagine that we begin growing balls around points.

As the balls grow they will intersect. When two balls intersect, we place a line segment (edge). When three balls intersect we place a triangle. When four balls intersect we place a tetrahedron and so on.

Eventually, the balls will have grown enough to bound a gap.

As we continue growing the balls, the gap will eventually close. Beyond this point nothing changes topologically, hence we can tell the computer to stop here. Now what we have done is created what is called a filtration which is simply an increasing chain of spaces. To capture the topological properties, we use homology to count holes. We apply homology (count the holes) to each space in the filtration. Then, more or less, we measure how long the holes last. The idea is that the longer lasting holes are more important to the topological properties of the space the data was sampled from. This process is accurately called persistent homology. There are, of course, some fine details excluded from this summary, especially the fact that TDA does not begin nor stop at persistent homology. If you would like to know more please check out some of the references I am leaving at the bottom. I will be making a post (or series of posts) soon that will go a little deeper in the theory.