(Dis)similarity & distance

Notice

More references are being added to this page. == incomplete ==

Choosing the right measure

Your choice of (dis)similarity measure is likely to have major impact on your results. Understanding how each measure affects your data and which one is suitable is an essential part of many analyses. The page below discusses some of these measures. If you're unable to decide on a measure, consider using our (dis)similarity wizard to help you decide what sort of measure may be most appropriate..

(Dis)similarity, distance, and dependence measures are powerful tools in determining ecological association and resemblance. Choosing an appropriate measure is essential as it will strongly affect how your data is treated during analysis and what kind of interpretations are meaningful. Non-metric dimensional scaling, principal coordinate analysis, and cluster analysis are examples of analyses that are strongly influenced by the choice of (dis)similarity measure used. Note, that while these measures may draw out certain types of relationships in your raw data, they may do so at the expense of other information present therein. Below, several key measures for asserting ecological resemblance are introduced. For a more complete overview, see chapter seven in Legendre & Legendre's Numerical Ecology (1998). For a critical view on the use of dissimilarity and distance, see Warton et al. (2012).

When choosing a distance measure, ensure that the measure reflects the ecological relationships you are concerned with. Further, some measures have mathematical properties that make them unsuitable for certain analyses. Similarly, certain analyses will only produce meaningful results when certain measures are used. If a measure listed below sounds suited to your data, use more detailed resources to learn about its properties and limitations before drawing any conclusions from analyses based upon it. The list below is not exhaustive, but aims to familiarise you with a set of commonly used measures and their uses.

Key terminology

The following terminology will be used in the measure descriptions

Q mode analysis

An analysis type which focuses on relationships between objects.

R mode analysis

An analysis type which focuses on relationships between variables. "R" refers to measures of dependence, with "0" indicating no dependence. The Pearson's R2 coefficient is one such measure.

Euclidean space

Sometimes called metric space, Euclidean spaces have axes which are quantitative and may be used for standard measurement. In other words, distances between points in Euclidean space are literal distances and are directly interpretable.

Dependence coefficients

Dependence coefficients indicate the association between variables in R mode analyses. This is evaluated by analysing the values of a given set of variables across a collection of objects. The Pearson's r2 coefficient and other correlation coefficients are examples of dependence coefficients.

Similarity coefficients

Similarity coefficients, used in Q mode analyses, indicate the degree to which objects resemble one another. This is evaluated by analysing a given set of objects based on values of the variables that describe them. Similarity coefficients reach their maximum when objects have identical values across their variables. Similarity coefficients are never metric as they violate the Euclidean assumption that the shortest path between two points is a straight line. If two objects, A and B, are three times more similar to one another (i.e. have a similarity score that is three times larger) than to another object C, the paths ACB and BCA will always be of lesser magnitude than the paths AB or BA. Hence, "object x object" similarity matrices cannot be embedded in Euclidean space. Converting similarities to dissimilarities or, more appropriately, distances can allow metric representation.

Dissimilarity coefficients

Dissimilarity coefficients are the conceptual (and often mathematical) inverse of similarity coefficients. These reach their maxima when objects share no similar variable values. Dissimilarity measures may or may not be metric. When they are metric, they are more correctly called distance coefficients.

Distance coefficients

A special case of dissimilarity coefficients, distances must satisfy the triangle inequality. This specifies that for any triangle with vertices A, B, and C, the path defined by the line AB is always shorter than the path defined by AC + CB or BC + CA.

For some variables, such as species counts or presence/absences, zero values shared between objects are not necessarily indications of true similarity between these objects. Zeros may reflect a true absence of a variable's effect, but may also merely indicate that the variable effect was not observed or measured. Further, assuming that a 'true' zero value is reported, zero values provide no information regarding why a given object is not affected by a variable. The object may be beyond the influence of the variable in an arbitrary direction; for example, a "site" (object) may have zero instances of a "species" (variable) because the local conditions are too hot or to cold. Asserting, based on zero species values, that hot and cold sites are similar is highly questionable.

Metric coefficients

Metric coefficients, which are distance coefficients, satisfy four criteria: 1) the distance between identical objects is "0", which is the minimum of the coefficient; 2) if objects are not identical, a metric coefficient will have a positive value; 3) symmetry, whereby the distance between A and B is the same as that between B and A; 4) conformance to the triangle inequality, which specifies that for any triangle with vertices A, B, and C, the path defined by the line AB is always shorter than the path defined by AC + CB or BC + CA.

Semimetric coefficients

Semimetric coefficients do not satisfy the triangle inequality and hence cannot (reliably) be used to ordinate objects in a Euclidean space.

Nonmetric coefficients

Nonmetric coefficients may assume negative values and violate other metric criteria. Hence, these coefficients cannot (reliably) be used to ordinate objects in a Euclidean space.

Q mode similarity measures

As noted above, similarity measures (S) are never metric, thus objects cannot be ordinated in a metric or Euclidean space based on their similarities. Converting similarities to distances can allow such ordination. This can be done simply by taking their one-complement (1-S) or its square root. Below, a few common measures are described below. For an extensive overview, see Legendre and Legendre (1998).

Binary measures

Binary measures are appropriate for data sets where variables can only take the values "1" or "0", such as presence/absence data sets.

Simple-matching coefficient

This coefficient gives equal weight to both forms of match - double zeros and double ones, and is thus a symmetrical coefficient.

Jaccard coefficient

This coefficient excludes double zeros, giving equal weight to non-zero agreements ("1", "1") and disagreements ("1", "0" and "0", "1") when comparing two objects. Given a "sites x species" matrix, the Jaccard coefficient can be used to express species/OTU turnover.

Sørensen / Dice coefficient

This coefficient is similar to the Jaccard coefficient, however, gives double weight to non-zero agreements. This asserts that the co-occurrence or coincidence of variable states among objects is more informative or important than disagreements. This is based on the logic of the harmonic mean and is thus suitable for data sets with large-valued outliers. It may, however, increase the influence of small-valued outliers.

Other binary measures are available which treat double-zero agreements, double-one agreements, and disagreements differently for a variety of reasons. Consider carefully if any special meaning is indicated by the different matching states of the binary variables in your data set and ensure that the measure chosen adequately reflects these.

Quantitative measures

Quantitative coefficients take into account values other than "0" and "1". Some quantitative measures lessen the effect of relatively large or small variable values in a data set to preserve overall interpretability. However, other measures are sensitive to large quantitative differences and perform better on transformed data.

Gower coefficient

This coefficient may be used for heterogeneous data sets (i.e. data sets including numerous variable types). It calculates a partial similarity value of two objects for each variable describing them. The final similarity score is the average of all partial similarities. Binary, qualitative and semi-quantitative, and quantitative variables are treated differently.

Binary variables can be evaluated symmetrically or asymmetrically.

Qualitative and semi-quantitative variables will have a similarity score of "1" when their values are equivalent between two objects, and "0" otherwise.

For quantitative variables, a dissimilarity is calculated by dividing the absolute difference between a given variable's values describing two objects by the range of this variable across all objects. The one-complement of this dissimilarity is then taken as the similarity value.

Missing data may be accounted for by integrating Kronecker's delta into the implementation.

Gower, 1971

Steinhaus coefficient

This asymmetric coefficient is widely used for raw count data. It compares the sum of the minimum, per-variable values between two objects to the average value of all variables describing these objects. If applied to binary data, this is equivalent to the Sørensen coefficient. The one-complement of this coefficient is the popular Bray-Curtis dissimilarity measure.

== incomplete ==

Q mode dissimilarity and distance measures

There are three groups of dissimilarity measues: metric, semimetric, and nonmetric. See the "Key terminology" section of this page for definitions.

Metric distances

Euclidean distance

A simple, symmetrical metric using the Pythagorean formula. The more variables present in a data set, the larger one may expect Euclidean distances to be. Further, double zeros result in decreased distances. This property makes the Euclidean distance unsuitable for many ecological data sets and ecologically-motivated transformations should be considered. Principal components analysis and redundancy analysis ordinate objects using Euclidean distances.

Chord distance

This asymmetric distance measure is simply the Euclidean distance calculated for a row standardised matrix (see the chord transformation). Rather than comparing absolute values, the chord distance compares objects based on the proportion of a given value to the sum of all variable values across the row corresponding to that object. Thus, even if objects have different raw values of two or more variables, as long as these values are proportionately equivalent when standardised, the sites will be considered similar. The Chord distance is insensitive to double zeros.

Orlóci, 1967

Mahalanobis distance

Appropriate for comparing groups of objects described by the same variables, this coefficient eliminates the effect of correlations between variables and is arrived at through the calculation of a covariance matrix from the input matrix. It also eliminates differences in scale between variables. Alternative forms of this measure may be used to calculate the distance between a group and a single object.

Mahalanobis, 1936

Coefficient of racial likeness

Appropriate for comparing groups of objects described by the same variables, this coefficient does not eliminate the effect of correlations between variables. This may be desirable when samples are too small to effectively remove correlative effects (see e.g. Penrose, 1952)

Pearson, 1926

χ2 metric

The calculation of this asymmetric metric transforms a matrix of quantitative values into a matrix of conditional probabilities (i.e. the quotient of a given value in a cell and either the row or column totals). A weighted Euclidean distance measure is then computed based on the values in the rows (or columns in R mode analysis) of the conditional probability matrix. Weights, which are the reciprocal of the variable (column) totals from the raw data matrix, serve to reduce the influence of the highest values measured.

χ2 distance

This asymmetric distance is similar to the χ2 metric, however, the weighted Euclidean distances are multiplied by the total of all values in the raw data matrix. This converts the weights in the Euclidean distances to probabilities rather than column totals. This is the measure used in correspondence analysis and related analyses.

Lebart & Fénelon, 1971

Hellinger distance

This asymmetric distance is similar
to the χ2 metric. While no weights are applied, the square roots of conditional probabilities are used as variance-stabilising data transformations. This distance measure performs well in linear ordination. Variables with few non-zero counts (such as rare species) are given lower weights.

Hellinger, 1909; Rao, 1995

Manhattan metric

Similar to the Euclidean distance; however, rather than using the Pythagorean formula, the Manhattan distance simply sums the absolute differences across pairs of variable values for a given object. Just like the Euclidean distance, this metric suffers from the double zero problem and distances reported will increase with the number of variables assessed.

Canberra metric

This metric excludes double zeros and increases the effect of differences between variables with low values or many zeroes.

Lance & Williams, 1966

Jaccard distance

The one complement of the Jaccard similarity (described above), is a metric distance.

Semimetric measures

As described above, semimetric measures do not always satisfy the triangle inequality and hence cannot be fully relied upon to represent dissimilarities in a Euclidean space without appropriate transformation. That being said, they often do behave metrically and can be used in principal coordinates analysis (following an adjustment for negative eigenvalues if necessary) and non-metric dimensional scaling.

Bray-Curtis dissimilarity

This is an asymmetrical measure often used for raw count data. This is the one-complement of the Steinhaus similarity coefficient and a popular measure of dissimilarity in ecology. This measure treats differences between high and low variable values equally.

Bray & Curtis, 1957

Sørensen dissimilarity

The one complement of the Sørensen similarity coefficient (described above) is a semimetric dissimilarity measure.

Nonmetric measures

As noted by Legendre and Legendre (1998), nonmetric dissimilarity measures, such as a binary coefficient proposed by Kulczynski (1928) which is the quotient of double presences and disagreements, may assume negative values. As negative dissimilarities are intuitively nonsensical, they are problematic for interpretation. In general, these should be avoided unless there is a very clear reason to use them.

R mode measures of dependence

R mode measures express the relationships between variables. With some exceptions, Q-mode measures are generally not useful or meaningful in R-mode analysis. See Legendre and Legendre (1998) and Ludwig and Reynolds (1988) for an explanation of what constitutes a permissible R-mode measure. Often, R-mode measures are referred to as dependence coefficients as they express how much the values of one variable can be said to depend on the states of another variable. Well-known correlation measures are examples of R mode measures.

Pearson's r

This familiar measure of linear correlation between two variables, suitable only for detecting linear relationships between variables. This is covariance between two variables divided by the product of their standard deviations. If your variables have many zeros, this correlation coefficient will not be reliable as double-zeros will be understood as an "agreement" when, in fact, they are simply the absence of an observation. This will inflate the correlation coefficient.

Spearman's rho

This is a non-parametric measure of correlation which uses ranks rather than the original variable values. Variables should have monotonic relationships: that is, their ranks should either go up or down across objects, but not necessarily in a linear fashion. Like Pearson's r, Spearman's rho is based on the principal of least squares, but is concerned with how strongly the rankings between two variables disagree. The larger the disagreement the lower the rho value. This statistic is sensitive to large disagreements. That is, if one variable ranks an object as "1" and another variable ranks the same object as "100", the correlation reported by Spearman's rho will be strongly affected (relative to Kendall's tau, for example), even if these variables agree on all other ranks. This measure is suitable for raw or standardized abundance data and any monotonically related variables.

Kendall's τ

Like the Spearman's rho, Kendall's tau uses ranked values to calculate correlation. This measure, however, is not based on the principal of least squares and instead expresses the degree of concordance between two rankings. The tau statistic is the quotient of 1) the difference between concordant and discordant pairs (i.e. ranks that agree and ranks that differ) and 2) the total number of pairs compared. This statistic is not sensitive to the scale of the disagreement. As above, variables should have monotonic relationships: that is, their ranks should either go up or down across objects, but not necessarily in a linear fashion. This measure is suitable for raw or standardized abundance data and any monotonically related variables.

χ2 similarity, metric, and distance

The χ2 similarity, metric, and distance measures (see above for description) may also be used for R-mode analysis. These are useful when monotonic relationships are not present and are appropriate for raw abundance data, qualitative and ordinal data.

Hellinger distance

Described above, the Hellinger distance is useful for variables populated with abundance data.

Symmetric uncertainty coefficient

This coefficient is based in the logic of information theory. It expresses the amount of information shared between two variables using contingency tables and Shannon's information formula. Resorting to contingency tables is useful when dealing with qualitative variables with no monotonic relationships. Probabilities of association can be calculated and then translated into measures of dependence. Legendre and Legendre (1998) offer a developed discussion on information theory in numerical ecology.

Jaccard coefficient

This coefficient excludes double zeros, giving equal weight to non-zero agreements ("1", "1") and disagreements ("1", "0" and "0", "1") when comparing two objects. Given a "sites x species" matrix, the Jaccard coefficient can be used to express species/OTU turnover.

Dice coefficient

This coefficient is similar to the Jaccard coefficient, however, gives double weight to non-zero agreements. This asserts that the co-occurrence or coincidence of variable states among objects is more informative or important than disagreements.This is based on the logic of the harmonic mean and is thus suitable for data sets with large-valued outliers. It may, however, increase the influence of small-valued outliers.

Ochai index

The Ochai index is the quotient of the total non-zero agreements ("1", "1") between two variables and the the product of the square rooted sums of non-zero agreements and each form of disagreement (i.e. "0","1" and "1","0"). Thus, this measure is based on the logic of the geometric mean and values with different ranges will be normalised before a central value is proposed. This is particularly suitable when the ranges and variance of agreements and disagreements are very different from one another.

Implementations

R

vegdist() in the vegan package

dist() in the package

distance() or bcdist() in the ecodist package

daisy() in the cluster package can compute a Gower index for both quantitative and categorical variables