1st Semester 2012/13: Complexity of Complex Networks

The graph, for all its simplicity, is a powerful method of representation. It allows us to bring complicated phenomena into the realm of mathematics and computation. We can represent any object and relation in our field of study without having to massage our data. Our brains are graphs of neurons and axons. Their thoughts deal with graphs of objects and relations. Our bodies regulate themselves by complicated protein networks, and we interact with each other through vast social and technological networks. But all that power comes at a price. For tabular data, we can find a mean. We can assume a distribution and calculate sufficient statistics. We can find optimal models and bound our certainty with significance estimates. For complex networks, we are in a hairier position. Is there such a thing as an average of a series of networks? Or a variance? What if we only have one example of a given type of network? Can we say anything about the process that generated it? For numeric, tabular and sequential data, we have methods to answer all these questions. For network data, all we have is a variety of metrics and a few algorithms for generating random graphs.

Minimum Description Length (MDL) is a statistical principle that says that any method that compresses a dataset, must have found some meaningful structure in it. That is, learning and compression are equivalent. MDL makes no assumptions about the nature of the data, just that the compressor should be a computable method. This allows us to use the methods of computer science to perform statistics on data is whatever form.

Most MDL techniques are designed for sequential data. This makes them unsuitable for graphs. If we assume that the ordering of nodes in our data file is random, then any structure that a simple sequential model will find will disappear if we replace this ordering by another one. If we want to perform statistics on the meaningful structure of the graph, we will need models that analyse our data at that level.

For this project we want to look at simple models. Our aim is not to find deep and meaningful structure in data, but to validate the basic principle of (crude) MDL methods on graph data. The main model we are interested in investigating is the common subgraph. If we store a subgraph that occurs often in our data separately, and replace each occurrence by a special node, it should compress the graph, and thus provide some statistical insight into our data.

Prerequisites

We are looking for students with good programming skills and an interest in complexity science. Language choice is open, but Python will likely be the best choice, because of available libraries.

References

S.H. Strogatz. Exploring Complex Networks. Nature 2001. Good introduction to some of the cornerstones of complex network research.

P. Grüwald. A Tutorial Introduction to the Minimum Description Length Principle, 2005. A good introduction to the basics of MDL. Overly technical for this project, but the fundamentals are intuitively explained.

L.B. Holder, D.J. Cook, J. Gonzalez, and I. Jonyer. Structural Pattern Recognition in Graphs. Pattern Recognition and String Matching, 2002. A description of the basic algorithm for frequent subgraph discovery that we would like to use.

R. Cilibrasi and P. Vitanyi. Clustering by Compression. Transactions on Information Theory, 2005. A slightly different approach to statistics through compression that is nevertheless very effective.