Clustering KNN Encodings

In this article, we are going to use the partitioner to cluster data points which have been encoded using a K-Nearest Neighbors (KNN) encoder. First we outline how a KNN encoding is achieved before jumping into a number of examples.

KNN Encoding

The method of K nearest neighbors encoding creates spike patterns for multi-dimensional points by assigning active spikes to its spike pattern for each of its k nearest points. Spike patterns for a point are thus relative to the distribution from which it is drawn.

In practice, we don’t use all the points from the distribution but instead define a fixed library size and choose k from this library.

A KNN encoding is heavily dependent on the “measure of distance” we decide to use. We should set this measure to suit our needs. For instance, some spatial measure, such a euclidean distance, creates similar spike patterns for points which are spatially close together, likewise, for discrete variables, such as for bags of words in text classification, another metric can be used, such as the overlap metric (or Hamming distance). Or in the context of gene expression microarray data, correlation coefficients such as Pearson and Spearman.

The Knowm API comes with a built in KNN spike encoder which encodes floating point values from either L1 (Manhattan Distance), L2 (Euclidean Distance), and Cosine distances.

Java

1

publicclassKNNEncoder implementsEncoder

We can instantiate this encoder by defining k, the library_size, and our choice of measure.

Java

1

2

3

4

5

6

7

8

9

10

11

12

// Instantiation

intk=5;

intlibrary_size=100;

KNNEncoder knnEncoder=newKNNEncoder(k,library_size,Measure.L2);

...

// Encode first 100 points to fill library

...

// Encode

float[]point=newfloat[]{0.1f,0.2f,0.3f,0.4f,....};

int[]spikes=knnEncoder.encoder(point);

KNN adaption

The KNN library stores the first library_size number of points that it encodes directly into the library. Afterward, new values replace their closest neighbor in the library.

Java

1

2

// Adapt library to distribution

library.get(0).center=value;// library is sorted by distance to value

When the distribution is stationary and there are few outliers, this causes the library points to fit the shape of the distribution. Otherwise, we may find the library begins to fill up with non-informationally important points. Consider a non-stationary distribution, for instance, which moves linearly in one direction, the KNN library will continually replace those that are closest to the distribution, while points that are farther away will remain and trail behind the distribution.

As an initial implementation, these problems are remedied if the distribution is limited to a finite space, and the library is sufficiently large to cover the varying positions the distribtion moves through.

KNN Clustering

The spike patterns returned by the KNN encoder do not represent clusters yet. To do this, we need to train our partitioner class on the KNN spike stream. Over time, the partitioner will converge on regularities in the spike stream and output unique spike patterns for objects from within the same cluster. These spike patterns can be converted directly into a spike label by reading each spike pattern as a binary number.

Java

1

2

3

4

5

6

7

float[]dataPoint=newfloat[]{0.1,0.2,0.3,.....,0.n};

int[]knnSpikes=knnEncoder.encode(dataPoint);

int[]partitionerSpikes=partitioner.encode(knnSpikes);

intlabel=binaryEncoder.encode(partitionerSpikes)[0];

The above process is encapsulated by a Float_Vector_KNN_Partitioner. We can use its encode() method to create our cluster labels as we encode data points from our distribution.

Clustering App

We’ve built a generic clustering suite ClusteringPanel2d.java which we can use to build and test our clustering algorithms on two-dimensional data distributions. This wrapper is responsible for plotting and coloring our points. It extends a JPanel object which we can pass to our animation frame when we start the App.

The following class uses this wrapper to build and cluster a simple distribution:

This distribution is composed of three Gaussian clusters. Each is defined using a Blod2d.java class. We pass these objects to ClusteringPanel2D when getBlobs() is called.

Java

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

privatefinalList<Blob2d>blobCenters=newArrayList<Blob2d>();

....

/**

* Constructor

*/

publicMickyMouse2d(){

for(inti=0;i<5;i++){

blobCenters.add(newBlob2d(0,0,.1f));

blobCenters.add(newBlob2d(.5f,-.5f,.05f));

blobCenters.add(newBlob2d(-.5f,-.5f,.05f));

}

//And pass them to the 2-D panel by overriding getBlobs

@Override

publicList<Blob2d>getBlobs(){

returnblobCenters;

}

Each Blob is instantiated by setting its frame-relative coordinates, with x in (-1, 1) and y in (-1, 1) and an additional radius parameter. The radius acts as a bound on each Gaussian cluster’s variance.

Clustering

We pass our clustering encoder to the clustering panel by overriding the getEncoder() method.

Results

Running the App creates our animation frame. Then, data points are pulled from our blob objects, given a cluster label from our Float_Vector_KNN_Partitioner object, and then finally added to the animation frame. Quickly these values will form visual clusters as the partitioner object adapts to the KNN regularities.

Knowm API- KNN clustering Example 1

Notice that some of the values between the clusters are being assigned unique colors. This is because these points are have unique nearest neighbor encodings – some points from both clusters.

Sparsity

We can reduce the number of unique clusters we see by increasing the number of neighbors we use to encode each example.

Java

1

floatsparsity=0.7f;// k = sparsity * library_size

Increasing the sparsity constant here will increase the number of common neighbors (K) points we choose per spike pattern. If the sparsity were equal to 1.0 then every point would have the same encoding: all points. As we add more points to each encoding we create more commonality between spike encodings. This will create a clustering convergence: fewer unique clusters.

If we decrease the sparsity to 0.1, our spike patterns will contain fewer common independent components. We see the opposite effect: a divergence.

Over time, KNN encoder adapts its library to contain points which span the distribution. As this happens a re-convergence to our usual groupings may occur.

Knowm API: Re-convergence

However, if we reduce the sparsity too far, points within clusters rarely share common neighbors and we find complete divergence.

Knowm API – Complete Divergence

Number of AHaH Nodes

Each AHaH node in our Partitioner object is attempting to maximize the margin between opposing components in our spike stream. Decision boundaries have binary output and thus a one node Paritioner will see two clusters.

Knowm API – Partitioner with one AHaH node

This single AHaH node has found a margin which separated the KNN encodings of objects in the head cluster and everything else. Only if we increase the number of AHaH nodes is it possible to cluster the three regions. Specifically a maximum of 2^n unique clusters is possible with n AHaH nodes. There is no guarantee we find 2^n unique clusters since the attractor states are not always unique. In practice, the more AHaH nodes we use the more likely we are to find all regularities in the data.

Using different KNN encodings

We could also change the way our KNN encodings measure the distance between objects. If we use cosine distance, for instance, then the similarity between points is measured by the angle between them.

Java

1

knnEncoder=newFloat_Vector_KNN(library_size,k,Measure.Cosine);

This creates a clustering effect similar to spherical k-means.

Knowm API – Sphirical Clustering

Many Clusters

We can increase the number of clusters and distribute them randomly across the panel by changing the blob objects when we instantiate the app. StaticRandom2d.java uses the following clusters:

Non-Gaussian Distributions

Each blob object reflects a gaussian distribution, however, by placing them at specific locations we can create a non-gaussian distribution. In the APP CurveDot2d.java we create a non-gaussian distribution as follows.

Dynamic Clustering

We’ve seen that clustering with the Knowm API is an adaptive procedure. We can harness this to build a dynamic clustering system. Our APP Sliding2D.java implements this idea. Here we implement Slideable and an additional method getSlideRate().

Java

1

publicclassSliding2d extendsClusteringPanel2d implementsSlideable

Our app now implements a getSlideRate() method this creates a lateral pixels/cycle movement for our distribution.

Java

1

2

3

4

@Override

publicfloatgetSlideRate(){

return.00005f;

}

These additions add movement to our blob clusters.

In the following short video, the sliding distribution moves from one side of the screen to the other, shortly being eclipsed by the middle distribution. The library size has been increased to 1000 points here.

The next video shows a full cycle across the screen, as the distribution moves laterally, our groupings adapt, converging when close together and diverging when farther apart.

In the KNN Adaption section, it was discussed how the method of library point-adoption could cause problems if the distribution was non-stationary and the library was not large enough to distribute points evenly. In the following video we have decreased the library size down to 250 points.

In the first few cycles across the screen, the moving distribution has trouble remaining uniform at the far left side of the screen. This is caused by a bunching of library points at that location, which was created when we first instantiated our distribution. Here, KNN encodings are highly diverse, we observe divergence. Over time, however, the continual return of the moving distribution causes the library to spread, and we eventually reach our desired result.

Discussion of Results

A K-nearest neighbors encoding scheme is paired with a Partitioner object and Binary encoder. These are used to form a simple adaptive clustering object we called the Float_Vector_KNN_Partitioner.

This object was capable of clustering a number of distributions. These are highlighted in demos: MickeyMouse2d.java, CurveDot2D.java, StaticRandom2D.java, and Sliding2d.java.

During this article, we explored the effect of the KNN encoding hyper-parameters K and library size through a sparsity constant. Increasing or decreasing the number of active spikes was shown to cause clustering convergence and divergence respectively.

A single AHaH node was used to demonstrate the effect of changing the size of our Partitioner object. Changing the number of AHaH nodes allowed us to see a single decision boundary formed in the input space.

Demo StaticRandom2D.java showed how clustering multiple distribution centers of variable size and density were possible – Although convergence and noisy colours appear in areas of uncertainty between clusters.

Demo CurveDot2D.java was used to demonstrate non-uniform cluster groupings. This demonstrates both that 1) non-linear decision boundaries are possible in the input space and 2) non-uniform clusters can be formed by a KNN encoding.

Demo Sliding2d.java was created with moving cluster centers via a simple horizontal slide across the frame. The adoption of AHaH nodes to the changing distribution is visually apparent.