f_diss: Euclidean, Mahalanobis and cosine dissimilarity measurements

View source: R/f_diss.R

f_dissR Documentation

Euclidean, Mahalanobis and cosine dissimilarity measurements


\loadmathjax Stable lifecycle

This function is used to compute the dissimilarity between observations based on Euclidean or Mahalanobis distance measures or on cosine dissimilarity measures (a.k.a spectral angle mapper).


f_diss(Xr, Xu = NULL, diss_method = "euclid",
       center = TRUE, scale = FALSE)



a matrix containing the (reference) data.


an optional matrix containing data of a second set of observations (samples).


the method for computing the dissimilarity between observations. Options are "euclid" (Euclidean distance), "mahalanobis" (Mahalanobis distance) and "cosine" (cosine distance, a.k.a spectral angle mapper). See details.


a logical indicating if the spectral data Xr (and Xu if specified) must be centered. If Xu is provided, the data is scaled on the basis of \mjeqnXr \cup XuXr U Xu.


a logical indicating if Xr (and Xu if specified) must be scaled. If Xu is provided the data is scaled on the basis of \mjeqnXr \cup XuXr U Xu.


The results obtained for Euclidean dissimilarity are equivalent to those returned by the stats::dist() function, but are scaled differently. However, f_diss is considerably faster (which can be advantageous when computing dissimilarities for very large matrices). The final scaling of the dissimilarity scores in f_diss where the number of variables is used to scale the squared dissimilarity scores. See the examples section for a comparison between stats::dist() and f_diss.

In the case of both the Euclidean and Mahalanobis distances, the scaled dissimilarity matrix \mjeqnDD between between observations in a given matrix \mjeqnXX is computed as follows:


d(x_i, x_j)^2 = \sum (x_i - x_j)M^-1(x_i - x_j)^\mathrmTd(x_i, x_j)^2 = \sum (x_i - x_j)M^-1(x_i - x_j)^T \mjdeqnd_scaled(x_i, x_j) = \sqrt\frac1pd(x_i, x_j)^2d_scaled (x_i, x_j) = sqrt(1/p d(x_i, x_j)^2)

where \mjeqnpp is the number of variables in \mjeqnXX, \mjeqnMM is the identity matrix in the case of the Euclidean distance and the variance-covariance matrix of \mjeqnXX in the case of the Mahalanobis distance. The Mahalanobis distance can also be viewed as the Euclidean distance after applying a linear transformation of the original variables. Such a linear transformation is done by using a factorization of the inverse covariance matrix as \mjeqnM^-1 = W^TWM^-1 = W^TW, where \mjeqnMM is merely the square root of \mjeqnM^-1M^-1 which can be found by using a singular value decomposition.

Note that when attempting to compute the Mahalanobis distance on a dataset with highly correlated variables (i.e. spectral variables) the variance-covariance matrix may result in a singular matrix which cannot be inverted and therefore the distance cannot be computed. This is also the case when the number of observations in the dataset is smaller than the number of variables.

For the computation of the Mahalanobis distance, the mentioned method is used.

The cosine dissimilarity \mjeqncc between two observations \mjeqnx_ix_i and \mjeqnx_jx_j is computed as follows:


c(x_i, x_j) = cos^-1\frac\sum_k=1^px_i,k x_j,k\sqrt\sum_k=1^p x_i,k^2 \sqrt\sum_k=1^p x_j,k^2c(x_i, x_j) = cos^-1 ((sum_(k=1)^p x_(i,k) x_(j,k))/(sum_(k=1)^p x_(i,k) sum_(k=1)^p x_(j,k)))

where \mjeqnpp is the number of variables of the observations. The function does not accept input data containing missing values. NOTE: The computed distances are divided by the number of variables/columns in Xr.


a matrix of the computed dissimilarities.


Leonardo Ramirez-Lopez and Antoine Stevens



Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]

# Euclidean distances between all the observations in Xr

ed <- f_diss(Xr = Xr, diss_method = "euclid")

# Equivalence with the dist() fucntion of R base
ed_dist <- (as.matrix(dist(Xr))^2 / ncol(Xr))^0.5
round(ed_dist - ed, 5)

# Comparing the computational time
iter <- 20
tm <- proc.time()
for (i in 1:iter) {
f_diss_time <- proc.time() - tm

tm_2 <- proc.time()
for (i in 1:iter) {
dist_time <- proc.time() - tm_2


# Euclidean distances between observations in Xr and observations in Xu
ed_xr_xu <- f_diss(Xr, Xu)

# Mahalanobis distance computed on the first 20 spectral variables
md_xr_xu <- f_diss(Xr[, 1:20], Xu[, 1:20], "mahalanobis")

# Cosine dissimilarity matrix
cdiss_xr_xu <- f_diss(Xr, Xu, "cosine")

resemble documentation built on May 29, 2024, 8:49 a.m.