| dist.matrix | R Documentation |
Compute a symmetric matrix of distances (or similarities) between the rows or columns of a matrix;
or compute cross-distances between the rows or columns of two different matrices.
This implementation is faster than dist and can operate on sparse matrices (in canonical DSM format).
dist.matrix(M, M2 = NULL, method = "cosine", p = 2,
normalized = FALSE, byrow = TRUE, convert = TRUE, as.dist = FALSE,
terms = NULL, terms2 = terms, skip.missing = FALSE)
M |
a dense or sparse matrix representing a scored DSM, or an object of class |
M2 |
an optional dense or sparse matrix representing a second scored DSM, or an object of class |
method |
distance or similarity measure to be used (see “Distance Measures” below for details) |
p |
exponent of the |
normalized |
if |
byrow |
whether to calculate distances between row vectors (default) or between column vectors ( |
convert |
if |
as.dist |
convert the full symmetric distance matrix to a compact object of class |
terms |
a character vector specifying rows of |
terms2 |
a character vector specifying rows of |
skip.missing |
if |
By default, a numeric matrix of class dist.matrix, specifying distances or similarities between term vectors.
A similarity matrix is marked by an additional attribute similarity with value TRUE.
If the distance or similarity matrix is symmetric (i.e. neither a cross-distance matrix nor based on an asymmetric distance measure), it is marked by an attribute symmetric with value TRUE.
If as.dist=TRUE, the matrix is compacted to an object of class dist.
Given two DSM vectors x and y, the following distance metrics can be computed:
euclideanThe Euclidean distance given by
d_2(x, y) = \sqrt{ \sum_i (x_i - y_i)^2 }
manhattanThe Manhattan (or “city block”) distance given by
d_1(x, y) = \sum_i |x_i - y_i|
maximumThe maximum distance given by
d_{\infty}(x, y) = \max_i |x_i - y_i|
minkowskiThe Minkowski distance is a family of metrics determined by a parameter 0 \le p < \infty, which encompasses the Euclidean, Manhattan and maximum distance as special cases. Also known as L_p-metric, it is defined by
d_p(x, y) = \left( \sum_i |x_i - y_i|^p \right)^{1/p}
for p \ge 1 and by
d_p(x, y) = \sum_i | x_i - y_i |^p
for 0 \le p < 1. In the latter case, it is not homogeneous and cannot be derived from a corresponding mathematical norm (cf. rowNorms).
Special cases include the Euclidean metric d_2(x, y) for p = 2 and the Manhattan metric d_1(x, y) for p = 1, but the dedicated methods above provide more efficient implementations. For p \to \infty, d_p(x, y) converges to the maximum distance d_{\infty}(x, y), which is also selected by setting p=Inf. For p = 0, d_p(x, y) corresponds to the Hamming distance, i.e. the number of differences
d_0(x, y) = \#\{ i | x_i \ne y_i \}
canberraThe Canberra metric has been implemented for compatibility with the dist function, even though it is probably not very useful for DSM vectors. It is given by
\sum_i \frac{|x_i - y_i|}{|x_i| + |y_i|}
(see https://en.wikipedia.org/wiki/Canberra_distance). Terms with x_i = y_i = 0 are silently dropped from the summation.
Note that dist uses a different formula
\sum_i \frac{|x_i - y_i|}{|x_i + y_i|}
which is highly problematic unless x and y are guaranteed to be non-negative. Terms with x_i = y_i = 0 are imputed, i.e. set to the average value of all nonzero terms.
In addition, the following similarity measures can be computed and optionally converted to a distance metric (or dissimilarity):
cosine (default)The cosine similarity given by
\cos \phi = \frac{x^T y}{||x||_2 \cdot ||y||_2}
If normalized=TRUE, the denominator is omitted. If convert=TRUE (the default), the cosine similarity is converted to angular distance \phi, given in degrees ranging from 0 to 180.
jaccardThe generalized Jaccard coefficient given by
J(x, y) = \frac{ \sum_i \min(x_i, y_i) }{ \sum_i \max(x_i, y_i) }
which is only defined for non-negative vectors x and y. If convert=TRUE (the default), the Jaccard metric 1 - J(x,y) is returned (see Kosub 2016 for details). Note that J(0, 0) = 1.
overlapAn asymmetric measure of overlap given by
o(x, y) = \frac{ \sum_i \min(x_i, y_i) }{ \sum_i x_i }
for non-negative vectors x and y. If convert=TRUE (the default), the result is converted into a dissimilarity measure 1 - o(x,y), which is not a metric, of course. Note that o(0, y) = 1 and in particular o(0, 0) = 1.
Overlap computes the proportion of the “mass” of x that is shared with y; as a consequence, o(x, y) = 1 whenever x \le y. If both vectors are normalized as probability distributions (||x||_1 = ||y||_1 = 1) then overlap is symmetric (o(x, y) = o(y, x)) and can be thought of as the shared probability mass of the two distributions. In this case, normalized=TRUE can be passed in order to simplify the computation to o(x, y) = \sum_i \min(x_i, y_i).
Stephanie Evert (https://purl.org/stephanie.evert)
plot and head methods for distance matrices; nearest.neighbours and pair.distances also accept a precomputed dist.matrix object instead of a DSM matrix M
rowNorms for length normalization of DSM vectors, which is highly recommended for most distance metrics (and implicit in cosine)
M <- DSM_TermTermMatrix
dist.matrix(M, as.dist=TRUE) # angular distance
dist.matrix(M, method="euclidean", as.dist=TRUE) # Euclidean distance
dist.matrix(M, method="manhattan", as.dist=TRUE) # Manhattan distance
dist.matrix(M, method="minkowski", p=1, as.dist=TRUE) # L_1 distance
dist.matrix(M, method="minkowski", p=99, as.dist=TRUE) # almost L_Inf
dist.matrix(M, method="maximum", as.dist=TRUE) # L_Inf (maximum)
dist.matrix(M, method="minkowski", p=.5, as.dist=TRUE) # L_0.5 distance
dist.matrix(M, method="minkowski", p=0, as.dist=TRUE) # Hamming distance
round(dist.matrix(M, method="cosine", convert=FALSE), 3) # cosine similarity
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.