The Gaussian EM clusterer models the vectors as being produced by
a mixture of k Gaussian sources. The parameters of these sources
(prior probability, mean and covariance matrix) are then found to
maximise the likelihood of the given data. This is done with the
expectation maximisation algorithm. It starts with k arbitrarily
chosen means, priors and covariance matrices. It then calculates
the membership probabilities for each vector in each of the
clusters; this is the ‘E’ step. The cluster parameters are then
updated in the ‘M’ step using the maximum likelihood estimate from
the cluster membership probabilities. This process continues until
the likelihood of the data does not significantly increase.

The Group Average Agglomerative starts with each of the N vectors as singleton
clusters. It then iteratively merges pairs of clusters which have the
closest centroids. This continues until there is only one cluster. The
order of merges gives rise to a dendrogram: a tree with the earlier merges
lower than later merges. The membership of a given number of clusters c, 1
<= c <= N, can be found by cutting the dendrogram at depth c.

This clusterer uses the cosine similarity metric only, which allows for
efficient speed-up in the clustering process.

The K-means clusterer starts with k arbitrary chosen means then allocates
each vector to the cluster with the closest mean. It then recalculates the
means of each cluster as the centroid of the vectors in the cluster. This
process repeats until the cluster memberships stabilise. This is a
hill-climbing algorithm which may converge to a local maximum. Hence the
clustering is often repeated with random initial means and the most
commonly occurring output means are chosen.

Represents a dendrogram, a tree with a specified branching order. This
must be initialised with the leaf items, then iteratively call merge for
each branch. This class constructs a tree representing the order of calls
to the merge function.

This module contains a number of basic clustering algorithms. Clustering
describes the task of discovering groups of similar items with a large
collection. It is also describe as unsupervised machine learning, as the data
from which it learns is unannotated with class information, as is the case for
supervised learning. Annotated data is difficult and expensive to obtain in
the quantities required for the majority of supervised learning algorithms.
This problem, the knowledge acquisition bottleneck, is common to most natural
language processing tasks, thus fueling the need for quality unsupervised
approaches.

This module contains a k-means clusterer, E-M clusterer and a group average
agglomerative clusterer (GAAC). All these clusterers involve finding good
cluster groupings for a set of vectors in multi-dimensional space.

The K-means clusterer starts with k arbitrary chosen means then allocates each
vector to the cluster with the closest mean. It then recalculates the means of
each cluster as the centroid of the vectors in the cluster. This process
repeats until the cluster memberships stabilise. This is a hill-climbing
algorithm which may converge to a local maximum. Hence the clustering is
often repeated with random initial means and the most commonly occurring
output means are chosen.

The GAAC clusterer starts with each of the N vectors as singleton clusters.
It then iteratively merges pairs of clusters which have the closest centroids.
This continues until there is only one cluster. The order of merges gives rise
to a dendrogram - a tree with the earlier merges lower than later merges. The
membership of a given number of clusters c, 1 <= c <= N, can be found by
cutting the dendrogram at depth c.

The Gaussian EM clusterer models the vectors as being produced by a mixture
of k Gaussian sources. The parameters of these sources (prior probability,
mean and covariance matrix) are then found to maximise the likelihood of the
given data. This is done with the expectation maximisation algorithm. It
starts with k arbitrarily chosen means, priors and covariance matrices. It
then calculates the membership probabilities for each vector in each of the
clusters - this is the ‘E’ step. The cluster parameters are then updated in
the ‘M’ step using the maximum likelihood estimate from the cluster membership
probabilities. This process continues until the likelihood of the data does
not significantly increase.

They all extend the ClusterI interface which defines common operations
available with each clusterer. These operations include.

The current existing classifiers also extend cluster.VectorSpace, an
abstract class which allows for singular value decomposition (SVD) and vector
normalisation. SVD is used to reduce the dimensionality of the vector space in
such a manner as to preserve as much of the variation as possible, by
reparameterising the axes in order of variability and discarding all bar the
first d dimensions. Normalisation ensures that vectors fall in the unit
hypersphere.