Spectral Graph Partitioning and the Laplacian with Matlab

Finding a Partition

In this segment, we'll plant a partition in a graph, and then use the second smallest eigenvector to find it.

As always, the first step is to generate our dataset. In this example, we'll be a little more ambitious
and use a larger number of vertices.

>> n = 1000;

To plant an artificial partition in our dataset, we need to determine the vectices in each of the groups.
To do this, we randomly generate a permutation of all the vertices and select one set as group 1
and the rest as group 2. The variable gs controls how big the size of the first group.

Now, we need to decide on the probabilities of edges within each group and between the two groups. Because we are
planting a partition, the probabilities of edges between the groups should be much lower than the probability of
edges within each group.
Suppose that group 1 is a little more tightly connected than group 2. (Please insert your own amusing names
for an actual identification of group 1 and group 2, e.g. politicians and mathematicians.)

>> p_group1 = 0.5;
>> p_group2 = 0.4;
>> p_between = 0.1;

With these probabilities in hand, we can construct our graph. The last few operations symmetrize the matrix.

The large second smallest eigenvalue states that we
shouldn't expect to find any very good cuts in our dataset.
To see what we found, let's plot the second smallest eigenvector V(:,2).

>> plot(V(:,2), '.-');

That isn't all that helpful. Maybe sorting the vector...

>> plot(sort(V(:,2)), '.-');

Now, that picture is much more informative! We see a large gap in the middle of the values.
Interestingly enough, the number of points on the right gap is the same as gs, the size of
our planted group. Let's see what happens when we permute the vertices of the graph
to this ordering.

>> [ignore p] = sort(V(:,2));
>> spy(A(p,p));

Looks like we found our partitions!

Meaningful Partitions of Real Datasets

The previous example was rather compelling. However, it was also fake. We
planted a partition there! How do we know such things occur in real data?
In this two section, we'll see what happens when we try the same procedure
on two matrices. Unfortunately, the data for these sets is private,
so it's difficult to "follow along" with the Matlab. Nevertheless,
the pictures should tell the story.

The dataset for this section is a bipartite graph
from Yahoo! Search Marketing (formerly known as Overture).
The two types of nodes are advertisers and keywords. Advertisers
place bids on keywords.

The functions readSMAT and readList are my own functions
for reading in datafiles from disk.

Here, the second smallest eigenvalue is significantly smaller than the first dataset.
This means that we expect to find some fairly small cuts or rather tight clusters.
In the previous test, we found that plotting the sorted eigenvector was most informative.

>> plot(sort(V(:,2)), '.-');

Widely generalizing from our previous results, the two large gaps in the
sorted eigenvector state that we'd expect three large groups in our dataset.

>> [ignore p] = sort(V(:,2));
>> spy(B(p,p));

As expected, we see three large groups. For this graph, we can
validate the identity of each group using the terms associated
with part of the vertices.

Because the graph was bipartite, we need to find the
terms that correspond to the vertices in the "tight" groups
in the lower right. The first 3000 vertices in
the graph correspond to the labels on the terms.

While this shows a little bit about the data, it isn't all that much.
Presumably, there is also more structure in the rest of the dataset, besides
just one gospel group of artists.

To expose the remainder of the structure, we apply the second
smallest eigenvector recursively. That is, use the second smallest
eigenvector of the full graph to determine a good way to split the
graph into two pieces (e.g. the gospel cluster and the rest of the data)
and then repeat the process on each subgraph.

If we do this for the entire dataset using
an optimized program written in C, we get the following
adjacency matrix. (The file "au-out.rperm" is the
final permutation produced by the recursive program.)

>> p = load('au-out.rperm')+1;
>> spy(A(p,p))

In the new dataset, related artists should always
be close by. Let's check this hypothesis for
Metallica (mainstream rock), Britney Spears
(teeny bob), and DJ Tiesto (techno).