dist.matrix | R Documentation |
Compute a symmetric matrix of distances (or similarities) between the rows or columns of a matrix;
or compute cross-distances between the rows or columns of two different matrices.
This implementation is faster than dist
and can operate on sparse matrices (in canonical DSM format).
dist.matrix(M, M2 = NULL, method = "cosine", p = 2, normalized = FALSE, byrow = TRUE, convert = TRUE, as.dist = FALSE, terms = NULL, terms2 = terms, skip.missing = FALSE)
M |
a dense or sparse matrix representing a scored DSM, or an object of class |
M2 |
an optional dense or sparse matrix representing a second scored DSM, or an object of class |
method |
distance or similarity measure to be used (see “Distance Measures” below for details) |
p |
exponent of the |
normalized |
if |
byrow |
whether to calculate distances between row vectors (default) or between column vectors ( |
convert |
if |
as.dist |
convert the full symmetric distance matrix to a compact object of class |
terms |
a character vector specifying rows of |
terms2 |
a character vector specifying rows of |
skip.missing |
if |
By default, a numeric matrix of class dist.matrix
, specifying distances or similarities between term vectors.
A similarity matrix is marked by an additional attribute similarity
with value TRUE
.
If the distance or similarity matrix is symmetric (i.e. neither a cross-distance matrix nor based on an asymmetric distance measure), it is marked by an attribute symmetric
with value TRUE
.
If as.dist=TRUE
, the matrix is compacted to an object of class dist
.
Given two DSM vectors x and y, the following distance metrics can be computed:
euclidean
The Euclidean distance given by
d_2(x, y) = sqrt( SUM(i) (x_i - y_i)^2 )
manhattan
The Manhattan (or “city block”) distance given by
d_1(x, y) = SUM(i) |x_i - y_i|
maximum
The maximum distance given by
d_Inf(x, y) = MAX(i) |x_i - y_i|
minkowski
The Minkowski distance is a family of metrics determined by a parameter 0 ≤ p < ∞, which encompasses the Euclidean, Manhattan and maximum distance as special cases. Also known as L_p-metric, it is defined by
d_p(x, y) = [ SUM(i) |x_i - y_i|^p ]^(1/p)
for p ≥ 1 and by
d_p(x, y) = SUM(i) |x_i - y_i|^p
for 0 ≤ p < 1. In the latter case, it is not homogeneous and cannot be derived from a corresponding mathematical norm (cf. rowNorms
).
Special cases include the Euclidean metric d_2(x, y) for p = 2 and the Manhattan metric d_1(x, y) for p = 1, but the dedicated methods above provide more efficient implementations. For p -> Inf, d_p(x, y) converges to the maximum distance d_{∞}(x, y), which is also selected by setting p=Inf
. For p = 0, d_p(x, y) corresponds to the Hamming distance, i.e. the number of differences
#{i | x_i != y_i}
canberra
The Canberra metric has been implemented for compatibility with the dist
function, even though it is probably not very useful for DSM vectors. It is given by
SUM(i) |x_i - y_i| / |x_i| + |y_i|
(see https://en.wikipedia.org/wiki/Canberra_distance). Terms with x_i = y_i = 0 are silently dropped from the summation.
Note that dist
uses a different formula
SUM(i) |x_i - y_i| / |x_i + y_i|
which is highly problematic unless x and y are guaranteed to be non-negative. Terms with x_i = y_i = 0 are imputed, i.e. set to the average value of all nonzero terms.
In addition, the following similarity measures can be computed and optionally converted to a distance metric (or dissimilarity):
cosine
(default)The cosine similarity given by
cos(phi) = x' y / |x| |y|
If normalized=TRUE
, the denominator is omitted. If convert=TRUE
(the default), the cosine similarity is converted to angular distance φ, given in degrees ranging from 0 to 180.
jaccard
The generalized Jaccard coefficient given by
J(x,y) = SUM(i) min(x_i, y_i) / SUM(i) max(x_i, y_i)
which is only defined for non-negative vectors x and y. If convert=TRUE
(the default), the Jaccard metric 1 - J(x,y) is returned (see Kosub 2016 for details). Note that J(0, 0) = 1.
overlap
An asymmetric measure of overlap given by
o(x,y) = SUM(i) min(x_i, y_i) / SUM(i) x_i
for non-negative vectors x and y. If convert=TRUE
(the default), the result is converted into a dissimilarity measure 1 - o(x,y), which is not a metric, of course. Note that o(0, y) = 1 and in particular o(0, 0) = 1.
Overlap computes the proportion of the “mass” of x that is shared with y; as a consequence, o(x, y) = 1 whenever x <= y. If both vectors are normalized as probability distributions (|x|_1 = |y|_1 = 1) then overlap is symmetric (o(x, y) = o(y, x)) and can be thought of as the shared probability mass of the two distributions. In this case, normalized=TRUE
can be passed in order to simplify the computation to o(x, y) = SUM(i) min(x_i, y_i).
Stephanie Evert (https://purl.org/stephanie.evert)
plot
and head
methods for distance matrices; nearest.neighbours
and pair.distances
also accept a precomputed dist.matrix
object instead of a DSM matrix M
rowNorms
for length normalization of DSM vectors, which is highly recommended for most distance metrics (and implicit in cosine
)
M <- DSM_TermTermMatrix dist.matrix(M, as.dist=TRUE) # angular distance dist.matrix(M, method="euclidean", as.dist=TRUE) # Euclidean distance dist.matrix(M, method="manhattan", as.dist=TRUE) # Manhattan distance dist.matrix(M, method="minkowski", p=1, as.dist=TRUE) # L_1 distance dist.matrix(M, method="minkowski", p=99, as.dist=TRUE) # almost L_Inf dist.matrix(M, method="maximum", as.dist=TRUE) # L_Inf (maximum) dist.matrix(M, method="minkowski", p=.5, as.dist=TRUE) # L_0.5 distance dist.matrix(M, method="minkowski", p=0, as.dist=TRUE) # Hamming distance round(dist.matrix(M, method="cosine", convert=FALSE), 3) # cosine similarity
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.