17.2. Similarity Measures#

This section provides a review of popular similarity and dissimilarity measures in literature.

17.2.1. Introduction# Similarity Function#

The following is an informal definition of a similarity function.

Definition 17.5 (Similarity function)

Let \(\bX\) be a dataset of \(n\) data points sampled from the Euclidean space \(\RR^m\). Then a similarity function is some function \(s : \RR^m \times \RR^m \to \RR\) for any pair of data points \(\bx, \by \in \RR^m\) with a structure

\[ s(\bx, \by) = s(x_1, \dots, x_d, y_1, \dots, y_d) \]

which grows as the data points look similar to each other and satisfies the following properties

  1. \(0 \leq s(\bx, \by) \leq 1\).

  2. \(s (\bx, \bx) = 1\) for every \(\bx \in \RR^m\).

  3. \(s(\bx, \by) = s(\by, \bx)\).

Normally, we require that the similarity function is symmetric; i.e.,

\[s(\bx, \by) = s(\by, \bx).\]

There are rare cases where asymmetric similarity functions have been used.

Example 17.6 (Similarity from metrics)

Let \(d\) be a metric (distance function) defined on \(\RR^n\). Then we have for any \(\bx, \by, \bz \in \RR^m\),

  1. \(d (\bx, \bx) = 0\).

  2. \(d (\bx, \by) \geq 0\).

  3. \(d (\bx, \by) = d(\by, \bx)\).

  4. \(d (\bx, \by) \leq d(\bx, \bz) + d(\bz, \by)\).

One way to induce a similarity function is:

\[ s(\bx, \by) = \exp (- d(\bx, \by)). \]

We can see that

  1. Since \(\exp(t) \geq 0\) for every \(t \in \RR\), hence \(s(\bx, \by) \geq 0\).

  2. Since \(d(\bx, \by) \geq 0\), hence \(s(\bx, \by) \leq 1\).

  3. Since \(d\) is symmetric, hence \(s\) is symmetric.

  4. Since \(d(\bx, \bx) = 0\), hence \(s (\bx, \bx) = 1\).

Thus, \(s\) is indeed a similarity function.

Note that a metric satisfies the triangle inequality. However, this property was not required for the similarity function induced from the metric. Thus, a metric imposes an additional structure which is more stringent than the needs for a similarity measure.

For example, the squared distance function \(d^2\) satisfies the identity of indiscernibles, is nonnegative and symmetric. It doesn’t satisfy triangle inequality. However, we can construct a similarity function

\[ s'(\bx, \by) = \exp (- d^2(\bx, \by)). \] Proximity Matrices#

Definition 17.6 (Distance matrix)

Let \(d\) be a metric on the space \(\RR^m\). Let \(\bX\) be a dataset of \(n\) points \(\{ \bx_1, \dots, \bx_n \}\). Then the distance matrix for \(\bX\) is defined as

\[\begin{split} \bM_{\dist}(\bX) = \begin{bmatrix} 0 & d_{1 2} & \dots & d_{1 n}\\ d_{2 1} & 0 & \dots & d_{2 n} \\ \vdots & \vdots & \ddots & \vdots \\ d_{n 1} & d_{n 2} & \dots & 0 \end{bmatrix} \end{split}\]

where \(d_{i j} = d(\bx_i, \bx_j)\).

Definition 17.7 (Similarity matrix)

Let \(s\) be a similarity function on the space \(\RR^m\). Let \(\bX\) be a dataset of \(n\) points \(\{ \bx_1, \dots, \bx_n \}\). Then the similarity matrix for \(\bX\) is defined as

\[\begin{split} \bM_{\simf}(\bX) = \begin{bmatrix} 1 & s_{1 2} & \dots & s_{1 n}\\ s_{2 1} & 1 & \dots & s_{2 n} \\ \vdots & \vdots & \ddots & \vdots \\ s_{n 1} & s_{n 2} & \dots & 1 \end{bmatrix} \end{split}\]

where \(s_{i j} = s(\bx_i, \bx_j)\).

Definition 17.8 (Proximity matrix)

Let \(\bX\) be a dataset of \(n\) points \(\{ \bx_1, \dots, \bx_n \}\). Then an \(n \times n\) square matrix \(\bM\) where every \((i,j)\)-th element is some measure of similarity or distance between \(\bx_i\) and \(\bx_j\) is known as a proximity matrix.

In other words, a proximity matrix is either a distance matrix or a similarity matrix.

Each entry in a proximity matrix is called a proximity index.

Remark 17.1 (Asymmetry in proximity)

There are several problems where a proximity metric may not be symmetric. For example, when the objects in the dataset are locations in a hilly area and the measure of the distance is the time taken in going from point A to point B, the time taken may be quite different when going uphill or downhill. Such problems do give rise to asymmetrical proximity matrices. However, we will normally be concerned with symmetric cases.

Definition 17.9 (Proximity graph)

Let \(\bX\) be a dataset of \(n\) points \(\{ \bx_1, \dots, \bx_n \}\). Let \(\bM\) be a proximity matrix associated with the dataset \(\bX\). Then the corresponding proximity graph \(\GGG\) is a weighted graph whose nodes/vertices are the data points \(\bx_i\) and the edges have weights equal to the proximity indices.

A symmetric proximity matrix leads to an undirected graph while an asymmetric proximity matrix leads to a directed graph. Scatter and Covariance#

Definition 17.10 (Scatter matrix)

Let \(\bX\) be a dataset of \(n\) points \(\{ \bx_1, \dots, \bx_n \}\) sampled from the Euclidean space \(\RR^n\). Let \(\bar{\bx}\) be the arithmetic mean of the data points.

\[ \bar{\bx} = \frac{1}{n}\sum_{i=1}^n \bx_i. \]

Then the scatter matrix for \(\bX\) is defined as

\[ \bM_t(\bX) = \sum_{i=1}^n (\bx_i - \bar{\bx}) (\bx_i - \bar{\bx})^T. \]

Definition 17.11 (Statistical scatter)

The trace of the scatter matrix is known as the statistical scatter of the data set. It is given by

\[ \Trace(\bM_t(\bX)) = \sum_{i=1}^n (\bx_i - \bar{\bx})^T (\bx_i - \bar{\bx}). \]

Recall that \(\Trace(\bA \bB) = \Trace (\bB \bA)\).


\[ \Trace(\bM_t(\bX)) = \Trace\left ( \sum_{i=1}^n (\bx_i - \bar{\bx}) (\bx_i - \bar{\bx})^T \right ) = \Trace \left ( \sum_{i=1}^n (\bx_i - \bar{\bx})^T (\bx_i - \bar{\bx}) \right ) = \sum_{i=1}^n (\bx_i - \bar{\bx})^T (\bx_i - \bar{\bx}) \]

since this quantity is a scalar.

Let \(\CCC = \{C_1, \dots, C_k \}\) be a (hard) clustering (partition) of \(\bX\) into \(k\) clusters. Then for a given cluster \(C_j\), the within-scatter-matrix is given by

\[ \bM_t(C_j) = \sum_{i \in C_j} (\bx_i - \bz_j) (\bx_i - \bz_j)^T \]

where \(\bz_j\) is the mean of the cluster \(C_i\)

\[ \bz_j = \frac{1}{|C_j|} \sum_{i \in C_j} \bx_i. \]

The within-cluster scatter matrix of the partition is given by

(17.2)#\[\bM_w(\CCC) = \sum_{j=1}^k \bM_t(C_j) = \sum_{j=1}^k \sum_{i \in C_j} (\bx_i - \bz_j) (\bx_i - \bz_j)^T.\]

The within-cluster scatter is the trace of this matrix given by

(17.3)#\[\Trace (\bM_w(\CCC)) = \sum_{j=1}^k \sum_{i \in C_j} (\bx_i - \bz_j)^T (\bx_i - \bz_j) \]

The between-cluster scatter matrix for the partition \(\CCC\) is given by

\[ \bM_b(\CCC) = \bM_t(\bX) - \bM_w(\CCC). \]

We shall denote the dataset after mean removal as

\[ \widetilde{\bX} = \bX - \bar{\bx} \]

where the matrix vector subtraction is done according to the standard broadcasting rules.

The covariance matrix of the dataset is given by

(17.4)#\[\Sigma = \frac{1}{n} \widetilde{\bX} \widetilde{\bX}^T.\]

Sample covariance matrix is often used in statistics which is given by

(17.5)#\[\bS = \frac{1}{n - 1} \widetilde{\bX} \widetilde{\bX}^T = \frac{n}{n - 1} \Sigma.\]

17.2.2. Common Proximity Measures#

We look at some of the common proximity measures for numerical data. Euclidean Distance#

(17.6)#\[d_2(\bx, \by) = \sqrt{\sum_{i=1}^m (x_i - y_i)^2} = \sqrt{ (\bx - \by)^T (\bx - \by)}. \] Squared Euclidean Distance#

(17.7)#\[d_2^2(\bx, \by) = \sum_{i=1}^m (x_i - y_i)^2 = (\bx - \by)^T (\bx - \by). \] Manhattan Distance#

This is also known as city-block distance

(17.8)#\[d_1(\bx, \by) = \sum_{i=1}^m |x_i - y_i|.\]

Manhattan segmental distance is a variant in which only some of the dimensions selected by a subset \(P \subseteq \{ 1,\dots, m\}\) are used.

(17.9)#\[d_P(\bx, \by) = \frac{1}{|P|} \sum_{i \in P} |x_i - y_i|.\] Maximum Distance#

This is also known as the Chebyshev distance.

(17.10)#\[d_{\max}(\bx, \by) = \max_{i=1}^m |x_i - y_i|.\] Minkowski Distance#

This is the generalization of Euclidean, Manhattan and Chebyshev distances for some \(p \geq 1\).

(17.11)#\[d_p(\bx, \by) = \left ( \sum_{i=1}^m |x_i - y_i|^p \right )^{\frac{1}{p}}.\] Mahalanobis Distance#

As discussed in Example 17.5, Mahalanobis distance can address the issues related to the scale differences between different features and the correlation among features. This is given by

(17.12)#\[d_{\mah}(\bx, \by) = \sqrt{ (\bx - \by)^T \Sigma^{-1} (\bx - \by)}.\]

Mahalanobis distance is invariant to nonsingular transformations. Let \(\bC\) be any \(m \times m\) non-singular matrix.

Let \(\bY = \{ \by_1, \dots, \by_n \}\) be a new dataset constructed by transforming the dataset \(\bX\) as

\[ \by_i = \bC \by_i. \]


\[ d_{\mah}(\bx_i, \bx_j) = d_{\mah}(\by_i, \by_j). \] Chord Distance#

We normalize the data points to project them to the unit sphere in \(\RR^m\). Then we measure the length of the chord between the two points on the sphere. It is given by

(17.13)#\[d_{\chord}(\bx, \by) = \sqrt{ 2 - 2 \frac{\langle \bx, \by \rangle}{\| \bx \|_2 \| \by\|_2}}.\]

It is easy to see how this formula is arrived at. Assume \(\bx, \by\) lie on the unit sphere. Then

\[ \langle \bx - \by, \bx - \by \rangle = \| \bx \|_2^2 + \| \by \|_2^2 - 2 \langle \bx, \by \rangle = 2 - 2 \langle \bx, \by \rangle. \]

When they are not normalized, we normalize them

\[ 2 - 2 \frac{\langle \bx, \by \rangle}{\| \bx \|_2 \| \by\|_2}. \]

Finally, for computing the length of the chord, we need to take the square root. Geodesic Distance#

Once the data points have been normalized to the unit sphere, one can calculate the shortest path (arc) between any two data points along the surface of the unit sphere. This is known as the geodesic distance. It is given by

(17.14)#\[d_{\geo}(\bx, \by) = \arccos \left ( 1 - \frac{d_{\chord}(\bx, \by) }{2} \right).\] Cosine Similarity#

The cosine similarity between two points is given by

(17.15)#\[s_{\cos}(\bx, \by) = \frac{|\langle \bx, \by \rangle | }{\| \bx \|_2 \| \by\|_2}.\]

It is easy to see that cosine similarity is bound between \(0\) and \(1\), is symmetric and \(s_{\cos}(\bx, \bx) = 1\).

17.2.3. Proximity Between Clusters#

Many clustering algorithms are hierarchical. In the context of agglomerative hierarchical clustering, one needs to merge clusters which are similar to each other in each iteration. Thus, it is imperative to have some measures of similarity or dissimilarity between clusters.

In this subsection, we shall consider two arbitrary clusters of the dataset \(\bX\) denoted by \(C_1\) and \(C_2\) where

\[ C_1 = \{ \by_1, \dots, \by_r \} \text{ and } C_2 = \{ \bz_1, \dots, \bz_s \}. \]

Let the mean of \(C_1\) be given by

\[ \bar{\by} = \frac{1}{r} \sum_{i=1}^r \by_i. \]

Let the mean of \(C_2\) be given by

\[ \bar{\bz} = \frac{1}{s} \sum_{i=1}^s \bz_i. \] Mean Based Distance#

The mean based distance between two clusters is defined as

(17.16)#\[d_{\mean}(C_1, C_2) = d(\bar{\by}, \bar{\bz}).\] Nearest Neighbor Distance#

(17.17)#\[d_{\nn}(C_1, C_2) = \min_{1 \leq i \leq r, 1 \leq j \leq s} d(\by_i, \bz_j).\] Farthest Neighbor Distance#

(17.18)#\[d_{\fn}(C_1, C_2) = \max_{1 \leq i \leq r, 1 \leq j \leq s} d(\by_i, \bz_j).\] Average Neighbor Distance#

(17.19)#\[d_{\ave}(C_1, C_2) = \frac{1}{r s} \sum_{i=1}^r \sum_{j=1}^s d(\by_i, \bz_j).\] Statistical Distance#

(17.20)#\[d_{\stat}(C_1, C_2) = \frac{r s}{r + s} (\bar{\by} - \bar{\bz})^T (\bar{\by} - \bar{\bz}).\] Lance-Williams Formula#

Consider a step in an agglomerative hierarchical algorithm where clusters \(C_i\) and \(C_j\) are being merged into a new cluster \(C\) and we wish to consider the distance between another cluster \(C_k\) and the new cluster \(C\).

The following formula provides a recursive definition such that we can compute the distance \(d(C_k, C)\) using existing distance information.

(17.21)#\[d(C_k, C_i \cup C_j) = \alpha_i d( C_k, C_i) + \alpha_j d(C_k, C_j) + \beta d( C_i, C_j) + \gamma | d(C_k, C_i) - d(C_k, C_j) |\]

where \(d\) is some distance function between two clusters.

Some common choices of the parameters \(\alpha_i, \alpha_j, \beta, \gamma\) are given in the following table.

















\(\frac{n_i}{n_i + n_j}\)

\(\frac{n_j}{n_i + n_j}\)



Weighted group average






\(\frac{n_i}{n_i + n_j}\)

\(\frac{n_j}{n_i + n_j}\)

\(\frac{-n_i n_j}{(n_i + n_j)^2}\)