dist.matrix | R Documentation |
Compute a symmetric matrix of distances (or similarities) between the rows or columns of a matrix;
or compute cross-distances between the rows or columns of two different matrices.
This implementation is faster than dist
and can operate on sparse matrices (in canonical DSM format).
dist.matrix(M, M2 = NULL, method = "cosine", p = 2,
normalized = FALSE, byrow = TRUE, convert = TRUE, as.dist = FALSE,
terms = NULL, terms2 = terms, skip.missing = FALSE)
M |
a dense or sparse matrix representing a scored DSM, or an object of class |
M2 |
an optional dense or sparse matrix representing a second scored DSM, or an object of class |
method |
distance or similarity measure to be used (see “Distance Measures” below for details) |
p |
exponent of the |
normalized |
if |
byrow |
whether to calculate distances between row vectors (default) or between column vectors ( |
convert |
if |
as.dist |
convert the full symmetric distance matrix to a compact object of class |
terms |
a character vector specifying rows of |
terms2 |
a character vector specifying rows of |
skip.missing |
if |
By default, a numeric matrix of class dist.matrix
, specifying distances or similarities between term vectors.
A similarity matrix is marked by an additional attribute similarity
with value TRUE
.
If the distance or similarity matrix is symmetric (i.e. neither a cross-distance matrix nor based on an asymmetric distance measure), it is marked by an attribute symmetric
with value TRUE
.
If as.dist=TRUE
, the matrix is compacted to an object of class dist
.
Given two DSM vectors x
and y
, the following distance metrics can be computed:
euclidean
The Euclidean distance given by
d_2(x, y) = \sqrt{ \sum_i (x_i - y_i)^2 }
manhattan
The Manhattan (or “city block”) distance given by
d_1(x, y) = \sum_i |x_i - y_i|
maximum
The maximum distance given by
d_{\infty}(x, y) = \max_i |x_i - y_i|
minkowski
The Minkowski distance is a family of metrics determined by a parameter 0 \le p < \infty
, which encompasses the Euclidean, Manhattan and maximum distance as special cases. Also known as L_p
-metric, it is defined by
d_p(x, y) = \left( \sum_i |x_i - y_i|^p \right)^{1/p}
for p \ge 1
and by
d_p(x, y) = \sum_i | x_i - y_i |^p
for 0 \le p < 1
. In the latter case, it is not homogeneous and cannot be derived from a corresponding mathematical norm (cf. rowNorms
).
Special cases include the Euclidean metric d_2(x, y)
for p = 2
and the Manhattan metric d_1(x, y)
for p = 1
, but the dedicated methods above provide more efficient implementations. For p \to \infty
, d_p(x, y)
converges to the maximum distance d_{\infty}(x, y)
, which is also selected by setting p=Inf
. For p = 0
, d_p(x, y)
corresponds to the Hamming distance, i.e. the number of differences
d_0(x, y) = \#\{ i | x_i \ne y_i \}
canberra
The Canberra metric has been implemented for compatibility with the dist
function, even though it is probably not very useful for DSM vectors. It is given by
\sum_i \frac{|x_i - y_i|}{|x_i| + |y_i|}
(see https://en.wikipedia.org/wiki/Canberra_distance). Terms with x_i = y_i = 0
are silently dropped from the summation.
Note that dist
uses a different formula
\sum_i \frac{|x_i - y_i|}{|x_i + y_i|}
which is highly problematic unless x
and y
are guaranteed to be non-negative. Terms with x_i = y_i = 0
are imputed, i.e. set to the average value of all nonzero terms.
In addition, the following similarity measures can be computed and optionally converted to a distance metric (or dissimilarity):
cosine
(default)The cosine similarity given by
\cos \phi = \frac{x^T y}{||x||_2 \cdot ||y||_2}
If normalized=TRUE
, the denominator is omitted. If convert=TRUE
(the default), the cosine similarity is converted to angular distance \phi
, given in degrees ranging from 0 to 180.
jaccard
The generalized Jaccard coefficient given by
J(x, y) = \frac{ \sum_i \min(x_i, y_i) }{ \sum_i \max(x_i, y_i) }
which is only defined for non-negative vectors x
and y
. If convert=TRUE
(the default), the Jaccard metric 1 - J(x,y)
is returned (see Kosub 2016 for details). Note that J(0, 0) = 1
.
overlap
An asymmetric measure of overlap given by
o(x, y) = \frac{ \sum_i \min(x_i, y_i) }{ \sum_i x_i }
for non-negative vectors x
and y
. If convert=TRUE
(the default), the result is converted into a dissimilarity measure 1 - o(x,y)
, which is not a metric, of course. Note that o(0, y) = 1
and in particular o(0, 0) = 1
.
Overlap computes the proportion of the “mass” of x
that is shared with y
; as a consequence, o(x, y) = 1
whenever x \le y
. If both vectors are normalized as probability distributions (||x||_1 = ||y||_1 = 1
) then overlap is symmetric (o(x, y) = o(y, x)
) and can be thought of as the shared probability mass of the two distributions. In this case, normalized=TRUE
can be passed in order to simplify the computation to o(x, y) = \sum_i \min(x_i, y_i)
.
Stephanie Evert (https://purl.org/stephanie.evert)
plot
and head
methods for distance matrices; nearest.neighbours
and pair.distances
also accept a precomputed dist.matrix
object instead of a DSM matrix M
rowNorms
for length normalization of DSM vectors, which is highly recommended for most distance metrics (and implicit in cosine
)
M <- DSM_TermTermMatrix
dist.matrix(M, as.dist=TRUE) # angular distance
dist.matrix(M, method="euclidean", as.dist=TRUE) # Euclidean distance
dist.matrix(M, method="manhattan", as.dist=TRUE) # Manhattan distance
dist.matrix(M, method="minkowski", p=1, as.dist=TRUE) # L_1 distance
dist.matrix(M, method="minkowski", p=99, as.dist=TRUE) # almost L_Inf
dist.matrix(M, method="maximum", as.dist=TRUE) # L_Inf (maximum)
dist.matrix(M, method="minkowski", p=.5, as.dist=TRUE) # L_0.5 distance
dist.matrix(M, method="minkowski", p=0, as.dist=TRUE) # Hamming distance
round(dist.matrix(M, method="cosine", convert=FALSE), 3) # cosine similarity
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.