Similarity Measures
Contents
17.2. Similarity Measures#
This section provides a review of popular similarity and dissimilarity measures in literature.
17.2.1. Introduction#
17.2.1.1. Similarity Function#
The following is an informal definition of a similarity function.
(Similarity function)
Let \(\bX\) be a dataset of \(n\) data points sampled from the Euclidean space \(\RR^m\). Then a similarity function is some function \(s : \RR^m \times \RR^m \to \RR\) for any pair of data points \(\bx, \by \in \RR^m\) with a structure
which grows as the data points look similar to each other and satisfies the following properties
\(0 \leq s(\bx, \by) \leq 1\).
\(s (\bx, \bx) = 1\) for every \(\bx \in \RR^m\).
\(s(\bx, \by) = s(\by, \bx)\).
Normally, we require that the similarity function is symmetric; i.e.,
There are rare cases where asymmetric similarity functions have been used.
(Similarity from metrics)
Let \(d\) be a metric (distance function) defined on \(\RR^n\). Then we have for any \(\bx, \by, \bz \in \RR^m\),
\(d (\bx, \bx) = 0\).
\(d (\bx, \by) \geq 0\).
\(d (\bx, \by) = d(\by, \bx)\).
\(d (\bx, \by) \leq d(\bx, \bz) + d(\bz, \by)\).
One way to induce a similarity function is:
We can see that
Since \(\exp(t) \geq 0\) for every \(t \in \RR\), hence \(s(\bx, \by) \geq 0\).
Since \(d(\bx, \by) \geq 0\), hence \(s(\bx, \by) \leq 1\).
Since \(d\) is symmetric, hence \(s\) is symmetric.
Since \(d(\bx, \bx) = 0\), hence \(s (\bx, \bx) = 1\).
Thus, \(s\) is indeed a similarity function.
Note that a metric satisfies the triangle inequality. However, this property was not required for the similarity function induced from the metric. Thus, a metric imposes an additional structure which is more stringent than the needs for a similarity measure.
For example, the squared distance function \(d^2\) satisfies the identity of indiscernibles, is nonnegative and symmetric. It doesn’t satisfy triangle inequality. However, we can construct a similarity function
17.2.1.2. Proximity Matrices#
(Distance matrix)
Let \(d\) be a metric on the space \(\RR^m\). Let \(\bX\) be a dataset of \(n\) points \(\{ \bx_1, \dots, \bx_n \}\). Then the distance matrix for \(\bX\) is defined as
where \(d_{i j} = d(\bx_i, \bx_j)\).
(Similarity matrix)
Let \(s\) be a similarity function on the space \(\RR^m\). Let \(\bX\) be a dataset of \(n\) points \(\{ \bx_1, \dots, \bx_n \}\). Then the similarity matrix for \(\bX\) is defined as
where \(s_{i j} = s(\bx_i, \bx_j)\).
(Proximity matrix)
Let \(\bX\) be a dataset of \(n\) points \(\{ \bx_1, \dots, \bx_n \}\). Then an \(n \times n\) square matrix \(\bM\) where every \((i,j)\)-th element is some measure of similarity or distance between \(\bx_i\) and \(\bx_j\) is known as a proximity matrix.
In other words, a proximity matrix is either a distance matrix or a similarity matrix.
Each entry in a proximity matrix is called a proximity index.
(Asymmetry in proximity)
There are several problems where a proximity metric may not be symmetric. For example, when the objects in the dataset are locations in a hilly area and the measure of the distance is the time taken in going from point A to point B, the time taken may be quite different when going uphill or downhill. Such problems do give rise to asymmetrical proximity matrices. However, we will normally be concerned with symmetric cases.
(Proximity graph)
Let \(\bX\) be a dataset of \(n\) points \(\{ \bx_1, \dots, \bx_n \}\). Let \(\bM\) be a proximity matrix associated with the dataset \(\bX\). Then the corresponding proximity graph \(\GGG\) is a weighted graph whose nodes/vertices are the data points \(\bx_i\) and the edges have weights equal to the proximity indices.
A symmetric proximity matrix leads to an undirected graph while an asymmetric proximity matrix leads to a directed graph.
17.2.1.3. Scatter and Covariance#
(Scatter matrix)
Let \(\bX\) be a dataset of \(n\) points \(\{ \bx_1, \dots, \bx_n \}\) sampled from the Euclidean space \(\RR^n\). Let \(\bar{\bx}\) be the arithmetic mean of the data points.
Then the scatter matrix for \(\bX\) is defined as
(Statistical scatter)
The trace of the scatter matrix is known as the statistical scatter of the data set. It is given by
Recall that \(\Trace(\bA \bB) = \Trace (\bB \bA)\).
Hence
since this quantity is a scalar.
Let \(\CCC = \{C_1, \dots, C_k \}\) be a (hard) clustering (partition) of \(\bX\) into \(k\) clusters. Then for a given cluster \(C_j\), the within-scatter-matrix is given by
where \(\bz_j\) is the mean of the cluster \(C_i\)
The within-cluster scatter matrix of the partition is given by
The within-cluster scatter is the trace of this matrix given by
The between-cluster scatter matrix for the partition \(\CCC\) is given by
We shall denote the dataset after mean removal as
where the matrix vector subtraction is done according to the standard broadcasting rules.
The covariance matrix of the dataset is given by
Sample covariance matrix is often used in statistics which is given by
17.2.2. Common Proximity Measures#
We look at some of the common proximity measures for numerical data.
17.2.2.1. Euclidean Distance#
17.2.2.2. Squared Euclidean Distance#
17.2.2.3. Manhattan Distance#
This is also known as city-block distance
Manhattan segmental distance is a variant in which only some of the dimensions selected by a subset \(P \subseteq \{ 1,\dots, m\}\) are used.
17.2.2.4. Maximum Distance#
This is also known as the Chebyshev distance.
17.2.2.5. Minkowski Distance#
This is the generalization of Euclidean, Manhattan and Chebyshev distances for some \(p \geq 1\).
17.2.2.6. Mahalanobis Distance#
As discussed in Example 17.5, Mahalanobis distance can address the issues related to the scale differences between different features and the correlation among features. This is given by
Mahalanobis distance is invariant to nonsingular transformations. Let \(\bC\) be any \(m \times m\) non-singular matrix.
Let \(\bY = \{ \by_1, \dots, \by_n \}\) be a new dataset constructed by transforming the dataset \(\bX\) as
Then
17.2.2.7. Chord Distance#
We normalize the data points to project them to the unit sphere in \(\RR^m\). Then we measure the length of the chord between the two points on the sphere. It is given by
It is easy to see how this formula is arrived at. Assume \(\bx, \by\) lie on the unit sphere. Then
When they are not normalized, we normalize them
Finally, for computing the length of the chord, we need to take the square root.
17.2.2.8. Geodesic Distance#
Once the data points have been normalized to the unit sphere, one can calculate the shortest path (arc) between any two data points along the surface of the unit sphere. This is known as the geodesic distance. It is given by
17.2.2.9. Cosine Similarity#
The cosine similarity between two points is given by
It is easy to see that cosine similarity is bound between \(0\) and \(1\), is symmetric and \(s_{\cos}(\bx, \bx) = 1\).
17.2.3. Proximity Between Clusters#
Many clustering algorithms are hierarchical. In the context of agglomerative hierarchical clustering, one needs to merge clusters which are similar to each other in each iteration. Thus, it is imperative to have some measures of similarity or dissimilarity between clusters.
In this subsection, we shall consider two arbitrary clusters of the dataset \(\bX\) denoted by \(C_1\) and \(C_2\) where
Let the mean of \(C_1\) be given by
Let the mean of \(C_2\) be given by
17.2.3.1. Mean Based Distance#
The mean based distance between two clusters is defined as
17.2.3.2. Nearest Neighbor Distance#
17.2.3.3. Farthest Neighbor Distance#
17.2.3.4. Average Neighbor Distance#
17.2.3.5. Statistical Distance#
17.2.3.6. Lance-Williams Formula#
Consider a step in an agglomerative hierarchical algorithm where clusters \(C_i\) and \(C_j\) are being merged into a new cluster \(C\) and we wish to consider the distance between another cluster \(C_k\) and the new cluster \(C\).
The following formula provides a recursive definition such that we can compute the distance \(d(C_k, C)\) using existing distance information.
where \(d\) is some distance function between two clusters.
Some common choices of the parameters \(\alpha_i, \alpha_j, \beta, \gamma\) are given in the following table.
Algorithm |
\(\alpha_i\) |
\(\alpha_j\) |
\(\beta\) |
\(\gamma\) |
Single-link |
\(\frac{1}{2}\) |
\(\frac{1}{2}\) |
\(0\) |
\(-\frac{1}{2}\) |
Complete-link |
\(\frac{1}{2}\) |
\(\frac{1}{2}\) |
\(0\) |
\(\frac{1}{2}\) |
Group-average |
\(\frac{n_i}{n_i + n_j}\) |
\(\frac{n_j}{n_i + n_j}\) |
\(0\) |
\(0\) |
Weighted group average |
\(\frac{1}{2}\) |
\(\frac{1}{2}\) |
\(0\) |
\(0\) |
Centroid |
\(\frac{n_i}{n_i + n_j}\) |
\(\frac{n_j}{n_i + n_j}\) |
\(\frac{-n_i n_j}{(n_i + n_j)^2}\) |
\(0\) |
Median |
\(\frac{1}{2}\) |
\(\frac{1}{2}\) |
\(-\frac{1}{4}\) |
\(0\) |