CS 251: Assignment #7

Clustering

Due Monday 10 April 2017

The goal of this week's project is to add basic K-means clustering
analysis to your program. During the lab you will implement your own
version of K-means clustering.

Tasks

You will implement two versions of K-means clustering. One will use
the built-in Numpy functions, one you will write yourself. You can use
the Numpy version as a reference while debugging (and as a potentially
faster version for very large data sets.

Add a function to your analysis module that uses Numpy's built-in
k-means capabilities. It should take as arguments a data set, the
headers to use for clustering, and the number of clusters to
create. It will return the cluster means, the cluster ID of each
point, and the representation error.

import scipy.cluster.vq as vq

Write the following kmeans_numpy function.

def kmeans_numpy( d, headers, K, whiten = True):
'''Takes in a Data object, a set of headers, and the number of clusters to create
Computes and returns the codebook, codes, and representation error.
'''
# assign to A the result of getting the data from your Data object
# assign to W the result of calling vq.whiten on A
# assign to codebook, bookerror the result of calling vq.kmeans with W and K
# assign to codes, error the result of calling vq.vq with W and the codebook
# return codebook, codes, and error

Now write your own K-means function. The basic K-means algorithm is as follows.

initialize a set of K mean vectors
loop
identify the closest cluster mean for each data point
compute a new set of cluster means
calculate the error between the old and new cluster means
if the error is small enough, break
identify the closest cluster mean for each data point
return the new cluster means and the cluster IDs and distances for each data point

To make the above function easier, write two helper functions:
kmeans_init and kmeans_classify.

The kmeans_init should take in the data, the number of
clusters K, and an optional set of categories (cluster labels
for each data point) and return a numpy matrix with K rows,
each one repesenting a cluster mean. If no categories are
given, a simple way to select the means is to randomly choose
K data points (rows of the data matrix) to be the first K
cluster means. You want to avoid have two of the initial means
being the same.

If you are given an Nx1 matrix of categories/labels, then
compute the mean values of each category and return those as
the initial set of means. You can assume the categories
given in the data are zero-indexed and range from 0 to K-1.

The kmeans_classify should take in the data and cluster means
and return a list or matrix (your choice) of ID values and
distances. The IDs should be the index of the closest cluster
mean to the data point. The default distance metric should be
sum-squared distance to the nearest cluster mean.

Once you have the above functions written, then write the top-level
kmeans function, which should be similar to the numpy version
from step 1.

def kmeans(d, headers, K, whiten=True, categories = ''):
'''Takes in a Data object, a set of headers, and the number of clusters to create
Computes and returns the codebook, codes and representation errors.
If given an Nx1 matrix of categories, it uses the category labels
to calculate the initial cluster means.
'''
# assign to A the result getting the data given the headers
# if whiten is True
# assign to W the result of calling vq.whiten on the data
# else
# assign to W the matrix A
# assign to codebook the result of calling kmeans_init with W, K, and categories
# assign to codebook, codes, errors, the result of calling kmeans_algorithm with W and codebook
# return the codebook, codes, and representation error

Test your kmeans function using this
file. Your results should be pretty much identical to the
numpy version.

Note that if you whiten your data before clustering, then the
cluster means will be in the whitened coordinate system. The
scipy whiten function divides each column by its standard
deviation (according to the documentation it does not subtract
off the mean first. Therefore, if you want to plot those
cluster means in the original data space you need to do the
following for each dimension, where c_i is the cluster mean in
the original data space for dimension i, cw_i is the whitened
cluster mean, and s_i is the standard deviation.

c_i = c_w*s_i

Note that, before whitening, you may want to subtract the
original data mean and store it. In that case, you need to add
the data mean to the right side of the expression above to get
the clusters in the original data space.