dist_matrix: Distances/Similarities between Row or Column Vectors...
In wordspace: Distributional Semantic Models in R

dist.matrix

R Documentation

Distances/Similarities between Row or Column Vectors (wordspace)

Description

Compute a symmetric matrix of distances (or similarities) between the rows or columns of a matrix; or compute cross-distances between the rows or columns of two different matrices. This implementation is faster than dist and can operate on sparse matrices (in canonical DSM format).

Usage


dist.matrix(M, M2 = NULL, method = "cosine", p = 2, 
            normalized = FALSE, byrow = TRUE, convert = TRUE, as.dist = FALSE, 
            terms = NULL, terms2 = terms, skip.missing = FALSE)

Arguments

`M`	a dense or sparse matrix representing a scored DSM, or an object of class `dsm`
`M2`	an optional dense or sparse matrix representing a second scored DSM, or an object of class `dsm`. If present, cross-distances between the rows (or columns) of `M` and those of `M2` will be computed.
`method`	distance or similarity measure to be used (see “Distance Measures” below for details)
`p`	exponent of the `minkowski` L_p-metric, a numeric value in the range 0 ≤ p < ∞. The range 0 ≤ p < 1 represents a generalization of the standard Minkowski distance, which cannot be derived from a proper mathematical norm (see details below).
`normalized`	if `TRUE`, assume that the row (or column) vectors of `M` and `M2` have been appropriately normalised (depending on the selected distance measure) in order to speed up calculations. This option is often used with the `cosine` metric, for which vectors must be normalized wrt. the Euclidean norm. It is currently ignored for other distance measures.
`byrow`	whether to calculate distances between row vectors (default) or between column vectors (`byrow=FALSE`)
`convert`	if `TRUE`, similarity measures are automatically converted to distances in an appropriate way (see “Distance Measures” below for details). Note that this is the default setting and `convert=FALSE` has to be specified explicitly in order to obtain a similarity matrix.
`as.dist`	convert the full symmetric distance matrix to a compact object of class `dist`. This option cannot be used if cross-distances are calculated (with argument `M2`) or if a similarity measure has been selected (with option `convert=FALSE`).
`terms`	a character vector specifying rows of `M` for which distance matrix is to be computed (or columns if `byrow=FALSE`)
`terms2`	a character vector specifying rows of `M2` for which the cross-distance matrix is to be computed (or columns if `byrow=FALSE`). If only the argument `terms` is specified, the same set of rows (or columns) will be selected from both `M` and `M2`; you can explicitly specify `terms2=NULL` in order to compute cross-distances for all rows (or columns) of `M2`.
`skip.missing`	if `TRUE`, silently ignores terms not found in `M` (or in `M2`). By default (`skip.missing=FALSE`) an error is raised in this case.

Value

By default, a numeric matrix of class dist.matrix, specifying distances or similarities between term vectors. A similarity matrix is marked by an additional attribute similarity with value TRUE. If the distance or similarity matrix is symmetric (i.e. neither a cross-distance matrix nor based on an asymmetric distance measure), it is marked by an attribute symmetric with value TRUE.

If as.dist=TRUE, the matrix is compacted to an object of class dist.

Distance Measures

Given two DSM vectors x and y, the following distance metrics can be computed:

euclidean

The Euclidean distance given by

d_2(x, y) = sqrt( SUM(i) (x_i - y_i)^2 )

manhattan

The Manhattan (or “city block”) distance given by

d_1(x, y) = SUM(i) |x_i - y_i|

maximum

The maximum distance given by

d_Inf(x, y) = MAX(i) |x_i - y_i|

minkowski

The Minkowski distance is a family of metrics determined by a parameter 0 ≤ p < ∞, which encompasses the Euclidean, Manhattan and maximum distance as special cases. Also known as L_p-metric, it is defined by

d_p(x, y) = [ SUM(i) |x_i - y_i|^p ]^(1/p)

for p ≥ 1 and by

d_p(x, y) = SUM(i) |x_i - y_i|^p

for 0 ≤ p < 1. In the latter case, it is not homogeneous and cannot be derived from a corresponding mathematical norm (cf. rowNorms).

Special cases include the Euclidean metric d_2(x, y) for p = 2 and the Manhattan metric d_1(x, y) for p = 1, but the dedicated methods above provide more efficient implementations. For p -> Inf, d_p(x, y) converges to the maximum distance d_{∞}(x, y), which is also selected by setting p=Inf. For p = 0, d_p(x, y) corresponds to the Hamming distance, i.e. the number of differences

#{i | x_i != y_i}

canberra

The Canberra metric has been implemented for compatibility with the dist function, even though it is probably not very useful for DSM vectors. It is given by

SUM(i) |x_i - y_i| / |x_i| + |y_i|

(see https://en.wikipedia.org/wiki/Canberra_distance). Terms with x_i = y_i = 0 are silently dropped from the summation.

Note that dist uses a different formula

SUM(i) |x_i - y_i| / |x_i + y_i|

which is highly problematic unless x and y are guaranteed to be non-negative. Terms with x_i = y_i = 0 are imputed, i.e. set to the average value of all nonzero terms.

In addition, the following similarity measures can be computed and optionally converted to a distance metric (or dissimilarity):

cosine (default)

The cosine similarity given by

cos(phi) = x' y / |x| |y|

If normalized=TRUE, the denominator is omitted. If convert=TRUE (the default), the cosine similarity is converted to angular distance φ, given in degrees ranging from 0 to 180.

jaccard

The generalized Jaccard coefficient given by

J(x,y) = SUM(i) min(x_i, y_i) / SUM(i) max(x_i, y_i)

which is only defined for non-negative vectors x and y. If convert=TRUE (the default), the Jaccard metric 1 - J(x,y) is returned (see Kosub 2016 for details). Note that J(0, 0) = 1.

overlap

An asymmetric measure of overlap given by

o(x,y) = SUM(i) min(x_i, y_i) / SUM(i) x_i

for non-negative vectors x and y. If convert=TRUE (the default), the result is converted into a dissimilarity measure 1 - o(x,y), which is not a metric, of course. Note that o(0, y) = 1 and in particular o(0, 0) = 1.

Overlap computes the proportion of the “mass” of x that is shared with y; as a consequence, o(x, y) = 1 whenever x <= y. If both vectors are normalized as probability distributions (|x|_1 = |y|_1 = 1) then overlap is symmetric (o(x, y) = o(y, x)) and can be thought of as the shared probability mass of the two distributions. In this case, normalized=TRUE can be passed in order to simplify the computation to o(x, y) = SUM(i) min(x_i, y_i).

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples


M <- DSM_TermTermMatrix
dist.matrix(M, as.dist=TRUE)                     # angular distance
dist.matrix(M, method="euclidean", as.dist=TRUE) # Euclidean distance
dist.matrix(M, method="manhattan", as.dist=TRUE) # Manhattan distance
dist.matrix(M, method="minkowski", p=1, as.dist=TRUE)  # L_1 distance
dist.matrix(M, method="minkowski", p=99, as.dist=TRUE) # almost L_Inf
dist.matrix(M, method="maximum", as.dist=TRUE)         # L_Inf (maximum)
dist.matrix(M, method="minkowski", p=.5, as.dist=TRUE) # L_0.5 distance
dist.matrix(M, method="minkowski", p=0, as.dist=TRUE)  # Hamming distance

round(dist.matrix(M, method="cosine", convert=FALSE), 3) # cosine similarity

wordspace documentation built on Aug. 23, 2022, 1:06 a.m.