pair.distances | R Documentation |
Compute semantic distances (or similarities) between pairs of target terms based on a scored DSM matrix M
,
according to any of the distance measures supported by dist.matrix
.
If one of the terms in a pair is not represented in the DSM, the distance is set to Inf
(or to -Inf
in the case of a similarity measure).
pair.distances(w1, w2, M, ..., transform = NULL, rank = c("none", "fwd", "bwd", "avg"), avg.method = c("arithmetic", "geometric", "harmonic"), batchsize = 10e6, verbose = FALSE)
w1 |
a character vector specifying the first term of each pair |
w2 |
a character vector of the same length as |
M |
a sparse or dense DSM matrix, suitable for passing to |
... |
further arguments are passed to |
rank |
whether to return the distance between the two terms ( |
transform |
an optional transformation function applied to the distance, similarity or rank values (e.g. |
avg.method |
with |
batchsize |
maximum number of similarity values to compute per batch. This parameter has an essential influence on efficiency and memory use of the algorithm and has to be tuned carefully for optimal performance. |
verbose |
if |
The rank
argument controls whether semantic distance is measured directly by geometric distance (none
),
by forward neighbour rank (fwd
), by backward neighbour rank (bwd
), or by the average of forward and backward rank (avg
).
Forward neighbour rank is the rank of w2
among the nearest neighbours of w1
.
Backward neighbour rank is the rank of w1
among the nearest neighbours of w2
.
The average can be computed as an arithmetic, geometric or harmonic mean, depending on avg.method
.
Note that a transformation function is applied after averaging.
In order to compute the arithmetic mean of log ranks, set transform=log10
, rank="avg"
and avg.method="geometric"
.
Neighbour ranks assume that each target term is its own nearest neighbour and adjust ranks to account for this (i.e. w1 == w2
should return a rank of 0).
If M
is a pre-computed distance matrix, the adjustment is only applied if it is also marked as symmetric (because otherwise w1
might not appear in the list of neighbours at all). This might lead to unexpected results once asymmetric measures are implemented in dist.matrix
.
For a sparse pre-computed similarity matrix M
, only non-zero cells are considered as neighbours and all other ranks are set to Inf
. This is consistent with the behaviour of nearest.neighbours
.
pair.distances
is used as a default callback in several evaluation functions, which rely on the attribute similarity
to distinguish between distance measures and similarity scores. For this reason, transformation functions should always be isotonic (order-preserving) so as not to mislead the evaluation procedure.
If rank="none"
(the default), a numeric vector of the same length as w1
and w2
specifying the distances or similarities between the term pairs, according to the metric selected with the extra arguments (...
).
Otherwise, an integer or numeric vector of the same length as w1
and w2
specifying
forward, backward or average neighbour rank for the two terms.
In either case, a distance or rank of Inf
(or a similarity of -Inf
) is returned for any term pair not represented in the DSM.
Attribute similarity
is set to TRUE
if the returned values are similarity scores rather than distances.
Stephanie Evert (https://purl.org/stephanie.evert)
dist.matrix
, eval.similarity.correlation
, eval.multiple.choice
, nearest.neighbours
transform(RG65, angle=pair.distances(word1, word2, DSM_Vectors))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.