A cohesion-based clustering technique for categorical data

Abstract

Clustering is a technique which aims to partition a given dataset of objects into groups of similar objects. In this work, we consider categorical data, which are unordered unlike numerical data. This makes clustering such data a more challenging task. We propose a clustering technique for categorical data, which uses a novel similarity function, called cohesion , to measure the degree to which objects "stick" to clusters. We have implemented this technique, to which we refer as CLUC ( CLU stering with C ohesion). To evaluate CLUC, we compared its results with those produced by well-known clustering algorithms. The results of our extensive experiments on real and synthetic datasets show that CLUC generates high quality clusters which conform better to clusterings by human experts. For some well-known real datasets, CLUC even discovers clusterings identical to those provided by experts. Our results also indicate that CLUC is order insensitive in general and is scalable when the dataset grows in size (the number of objects) and/or dimensions (attributes)