Clustering corpus data with hierarchical cluster analysis

Hierarchical cluster analysis (HCA) belongs to the family of multifactorial exploratory approaches. What it does is cluster individuals based on the distance between them. I illustrate HCA with the preposition data set described here.

Hierarchical Cluster Analysis

HCA comes in two flavors: agglomerative (or ascending) and divisive (or descending). Agglomerative clustering fuses the individuals into groups, whereas divisive clustering separates the individuals into finer groups. What these two methods have in common is that they allow the researcher to find an optimal number of clusters to help explore a given data set. For reasons of space, and also because it is far more popular, I focus on agglomerative HCA.

HCA takes as input a table T that consists of i individuals (rows) and j variables (columns). The table can be a count matrix, a table of real numbers (with decimals), or a table containing both integers and real numbers. The table is converted into a distance matrix.1 The distance matrix is then amalgamated in such as way that the individuals in the distance object are merged into clusters. The analysis starts with each individual in a single cluster (represented by an uppercase letter) and then combines individuals progressively into larger clusters until a final stage where all individuals are merged into a single group. This stepwise process is represented graphically in the form of a tree-like diagram also known as a dendrogram.

In Fig. 1, the individuals are represented by uppercase letters. The plot should be read from bottom to top. The further up you go, the larger the clusters. For this reason, the method is called ascending/agglomerative hierarchical cluster analysis.

Fig. 1 A generic dendrogram

HCA is available from several R packages: hclust, diana, cluster (agnes()), or pvclust. In this section, I show how to use hclust() because it is part of base R.2

Text categories and prepositions in a corpus of US English

To illustrate HCA we return to preposition data set used here previously. This time, we focus on prepositions in the Brown corpus. Our goal is to cluster the fifteen text categories based on the prepositions that appear in each of them. We ignore the lengths of the prepositions.

Creating a distance matrix

Step 1 is done with the dist() function. Minimally, its main argument is the input matrix.

dist.mat<-dist(mat)

By default, the distance measure that is used to generate the distance matrix is the Euclidean metric. It is the simplest and most commonly used measure. For two individuals, the Euclidean distance is the square root of the sum of the squared differences between the pairs of corresponding values (Divjak and Fieller 2014, 417).

Amalgamating the clusters

Step 2 is done with the hclust() function, which clusters individuals in the distance matrix by means of an agglomeration method.

clusters = hclust(dist(mat))

By default, the agglomeration method is known as complete linkage: the distance between two clusters is defined as the greatest distance between a member of a cluster and a member of the other cluster (Everitt et al. 2011, 76).

Plotting the dendrogram

It is now time to plot the dendrogram with plot(). By specifying a negative value for the hang argument (hang = -1), the labels hang down from 0 and are neatly aligned (Fig. 2). The Height axis corresponds to the distance at which each fusion is observed.

plot(clusters, hang = -1)

Fig. 2 A cluster dendrogram of text categories in the Brown corpus based on the distribution of prepositions

Choosing the right measure

With HCA, one issue has to do with the choice of an appropriate distance measure and an appropriate amalgamation method. The default combination with dist() and hclust() is Euclidean–complete.

However, dist() proposes up to five other distance measures (maximum, manhattan, canberra, binary, and minkowski) and hclust() up to seven other amalgamation methods (ward.D, ward.D2, single, average, mcquitty, median, and centroid). The choice of one measure over another has an impact on the shape of the dendrogram, as evidenced in Fig. 4 which compares all six distance measures and applies Ward’s agglomeration method (Ward, 1963). As you can see, belles_lettres and learned_scientific are part of the same immediate cluster with all distance measures except binary.

If you start toying with agglomeration methods too, you realize that the number of combinations is high (6 distance measures × 8 agglomeration methods = 48 combinations).3There are good theoretical reasons for choosing one measure over the others. For an inventory of distance metrics, see Divjak and Fieller (2014, 417–418). For an inventory of agglomeration methods, see Everitt et al. (2011, 79).

In practice, however, the input matrices that tend to be compiled in corpus linguistics are sparse (i.e. matrices in which most of the elements are zero). In our input matrix, 2080 cells out out 3885 are zeros. Because the Canberra distance metric handles the relatively large number of empty occurrences well, it is an interesting option (Desagulier 2014, 163). With respect to the agglomeration method, Ward’s is widely used. Although sensitive to outliers (i.e. observations that deviate significantly from the other members of the sample in which they occur), it has the advantage of generating clusters of modelate size. As Divjak and Fieller (2014, 417–418) put it: “[u]se of squared distances penalises spread out clusters and so results in compact clusters without being as restrictive as complete linkage.”

In R, we select the distance measure as an argument of dist() and the amalgamation method as an argument of hclust(). Let us select the combination Canberra–Ward. We obtain Fig. 4.

Fig. 4 A cluster dendrogram of text categories in the Brown corpus based on the distribution of prepositions (distance: Canberra; amalgamation: Ward)

Divjak and Fieller (2014, 426) note the choice of a metric does not have much influence on the shape of the clusters. If they do, “thought must be given to why such differences occur and which of the methods is the most appropriate for the research questions of interest”. While this is true, to some extent, Fig. 3 shows that, depending on the kind of data, deciding upon which metric to use is not a trivial moment in the analysis.

Grouping clusters into classes

The rect.hclust() function allows you to groups clusters into user-defined classes. The code below draws five red rectangles around the branches of the dendrogram, highlighting five cluster classes.

cluster.classes <- rect.hclust(canberra.ward, 5)

The result is displayed in Fig. 5.

Fig. 5 A cluster dendrogram of text categories in the Brown corpus based on the distribution of prepositions with 5 cluster classes (distance: Canberra; amalgamation: Ward)

Inspection of the dendrogram reveals that the use of prepositions does not match the neat delimitation of text categories in the Brown corpus. For example, prepositions are not used identically in all press subgenres or all fiction subgenres. The most consistent cluster is the one in the middle of the dendrogram (fiction).