Resumen:

Clustering is an old data analysis problem that has been extensively studied during
the last decades. However, there is not a single algorithm that provides a satisfactory result
for every data set. Moreover, there exist some problems related to cluster analClustering is an old data analysis problem that has been extensively studied during
the last decades. However, there is not a single algorithm that provides a satisfactory result
for every data set. Moreover, there exist some problems related to cluster analysis that also
remain unsolved. In this monograph we study some of such problems as they commonly appear
in practice, and test how they work when applied to gene expression data analysis, where
clustering is widely used.
Different clustering algorithms often lead to different results, and in order to make sense
out of them it is important to understand how clusters from one analysis relate to those from a
different one. A comparison method to find and visualize many-to-many relationships between
two clusterings, either two flat clusterings or a flat and a hierarchical clustering, is presented.
The similarities between clusters are represented by a weighted bipartite graph, where the
nodes are the clusters and an edge weight shows the number of elements in common to the
connected nodes. To visualize the relationships between clusterings the number of edge crossings
is minimized. When applied to the case of comparing a hierarchical and a flat clustering we use
a criterion based either on the graph layout aesthetics or in the mutual information, to decide
where to cut the hierarchical tree.
Since iterative methods are sensitive to the initial parameters, we have developed two refinement algorithms designed to improve this initial state, based on the notion of data depth.
One of these algorithms looks for initial points in the same data space, while the second one,
using the bootstrap technique, selects the initial seeds in a new space of bootstrap centroids.
Also, this second approach allows to construct a soft (non-hard) clustering of the data, that
assigns to each point a probability of belonging to each cluster, and thus a single point may
partially belong to more than one cluster.
On the other hand, the number of clusters underlying in a data set is usually unknown.
Using ideas from the clustering comparison method previously proposed and from the data
depth concept, we present three procedures to estimate the number of real groups. The first two
methods consist basically in sampling pairs of clusterings from a population and successively
performing comparisons between them to find a consensus in the number of clusters, and the
third one looks for representative subsets of the clusters whose diameter is used to estimate
the optimal number of real groups.
The extensive study we carried out in simulated and real gene expression data shows that
the techniques presented here are useful and e±cient. The results that we obtained with real
data make sense not only from a statistical point of view, but they have proven to have a biological meaning.
______________________________________________[+][-]