I am experimenting with different distance measures, different parameter weights and different hierarcical methods, to understand their impact on the final clusters / structure/view of the tree (dendrogram). My question whether there is a standard calculation / measure to calculate the difference between different hierarchical trees, and how to implement this in R (e.g. to quantify that some trees are nearly identical, and that some are drastically different).

1 Answer
1

To compare the similarity of two hierarchical (tree-like) structures, measures based on cophenetic correlation idea are used. But is it correct to perform comparison of dendrograms in order to select the "right" method or distance measure in hierarchical clustering?

There are some points - hidden snags - regarding hierarchical cluster analysis that I would hold quite important:

Never compare (in order to select the method giving stronger partition) dendrograms obtained by different agglomeration
methods visually. It won't tell which method is "better" at that. Each method has its
own "prototypical" tree look: the trees will differ consistently even when
the data have no cluster structure or have random cluster structure. (And I don' think there exist a standardization or measure that would take off these intrinsic differences.). You may, however, compare dendrogram looks of results produced by the same method but different data. Maxim: direct, appearance comparing of dendrograms after different methods is unacceptable.

Do not decide on the number of clusters (i.e. where to cut the tree) by looking at the dendrogram of Ward method. In Ward, the tree shows the growth of the summative, not the averaged, coefficient of colligation; and the consequence is that since later clusters are bigger by the number of points, the later clusters look misleadingly "better" on the tree. To standardize Ward's dendrogramm appropriately, divide the coefficient growth at each step by the overall number of points in the two clusters being combined (such standardized Ward dendrogram, though, may be hard to implement graphically).$^1$ Maxim: choosing a cut level by contemplating a dendrogram appearance, while possible, is not the best method to select the partition, and for some methods may be misleading. It is recommended to rely on some formal internal clustering criterion instead.

Albeit no-one can forbid you "experimenting" with distance measures or agglomerative methods, it is better to select the distance and the method consciously, not blind trying. The distance should reflect the aspects of difference you are interested in, and the method - one must be aware - implies a specific archetype of a cluster (e.g. the metaphor of a Ward cluster is, I would say, type; cluster after complete linkage would be circle [by hobby or plot]; cluster after single linkage would be spectrum [chain]; cluster after centroid method would be proximity of platforms [politics]; an average linkage cluster is conceptually most undifferentiated and would be generally united class).

Some methods call for right distance measures and/or right type of data. Ward and centroid, for example, logically require (squared) euclidean distance - because these methods engage in computation of centroids in euclidean space. And computation of geometric centroids is incongruous with, for example, binary data; the data should be scale/continuous. Maxim: data/distance/method assumptions and correspondence is very important and not so easy question.

Preprocessing (such as centering, scaling and other forms of transformation of variables/features) prior computation of a distance matrix and doing the clustering is extremely important question, too. It can dramatically influence the results. Think over what preprocessing may help you and will make sense from the interpretation point of view. Also, never be shy to carefully inspect you data graphically before attempting to do cluster analysis.

Not all methods of agglomerative clustering can be equally seen as giving you hierarchical classification... on philosophical grounds. For example, centroid method do gives hierarchy in a sense, because cluster centre is an emergent and defining feature of a cluster as a whole, and merging clusters is driven by that feature. Complete linkage, on the other hand, "dismisses" both subclusters when it merges them - by virtue of distancing among individual objects of the two. Thus, complete linkage dendrogram is merely a history of collection and not a parent-child sort of taxonomy. Maxim: hierarchical agglomerative cluster analysis, generally, expects that you make a partition based on its result, rather than see the result as hierarchical taxonomy.

Hierarchical clustering is typical greedy algorithm that makes the best choice among alternatives appearing on each step in the hope to get close to optimal solution in the end. However, the "best" choice appearing on a high level step is likely to be poorer than global optimum theoretically possible on that step. The greater is the step, the greater is the suboptimality, as a rule. Given that we usually want few clusters last steps are important; and, as just said, they are expected to be relatively poor if the steps' number is high (say, thousandth step). That's why hierarchical clustering is generally not recommended for large samples of objects (numbering thousands of objects) even if the program could handle such a big distance matrix.

If after the above precautions you continue to think that you want a measure of similarity between hierarchical classifications you might google on 'comparing of dendrograms' and 'comparing hierarchical classifications'. One most suggesting itself idea may be based on the cophenetic correlation: having two dendrograms for the same dataset of n objects, let $X_{ij}$ be coefficient of colligation (or maybe its rank, the step number) between every pair of objects ij in one dendrogram, and $Y_{ij}$ likewise be the same in the other dendrogram. Compute correlation or cosine.

$^1$ Later update on the problem of dendrogram of Wards's method. Different clustering programs may output differently transformed aglomeration coefficients for Ward's method. Hence their dendrograms will look somewhat differently despite that the clustering history and results are the same. For example, SPSS doesn't take the root from the ultrametric coefficients, and it cumulates them in the output. Another tradition (found in some R packages, for example) is to take the root (so called "Ward-2" implementations) and not to cumulate. To repeat again, such differences affect only the general shape/looks of the dendrogram, not the clustering results. But the looks of the dendrogram might influence your decision about the number of clusters. The moral is that it would be safe not to rely on dendrogram in Ward's method at all, unless you know exactly what are these coefficients out of your program and how to interpret them correctly.

$\begingroup$I can second pretty much all of this. Two points that I want to add: A) what you seem to do is some kind of overfitting. By systematically evaluating measures, weights and methods, there is a high risk that the parameters you end up with are highly specific for your current data, and may be useless on other data or even later data. B) what good is it to know the similarity of the dendrograms. Consider what you want to do with them afterwards, and then try evaluating at the end result. Evaluating interim results may be misleading.$\endgroup$
– Anony-MousseJul 7 '13 at 14:47

1

$\begingroup$If you're still interested in the topic, I thought that you might find my recent answer on DS SE helpful, especially as it offers a coverage, albeit limited, of both frequentist and Bayesian approaches to hierarchical topical models (with embedded class information) and selecting similarity measures.$\endgroup$
– Aleksandr BlekhJan 9 '15 at 4:57

$\begingroup$@ttnphns, Could you please explain in more details how to use "coefficient of colligation" to compare two hierarchical classifications?$\endgroup$
– bassirAug 12 '15 at 13:17