nearest_neighbours: Find Nearest Neighbours in DSM Space (wordspace)

nearest.neighboursR Documentation

Find Nearest Neighbours in DSM Space (wordspace)

Description

Find the nearest neighbours of a term vector in a DSM, given either as a scored cooccurrence matrix or a pre-computed distance matrix. The target term can be selected by name (in which case the cooccurrence or distance matrix must be labelled appropriately) or specified as a vector (if the DSM is given as a matrix).

Usage


nearest.neighbours(M, term, n = 10, M2 = NULL, byrow = TRUE,
                   drop = TRUE, skip.missing = FALSE, dist.matrix = FALSE,
                   ..., batchsize=50e6, verbose=FALSE)

Arguments

M

either a dense or sparse matrix representing a scored DSM (or an object of class dsm), or a pre-computed distance matrix returned by dist.matrix (as an object of class dist.matrix). Note that the compact representation produced by the dist function (class dist) is not accepted.

term

either a character vector specifying one or more target terms for which nearest neighbours will be found, or a matrix specifying the target vectors directly. A plain vector is interpreted as a single-row matrix.

n

an integer giving the number of nearest neighbours to be returned for each target term

M2

an optional dense or sparse matrix (or object of class dsm). If specified, nearest neighbours are found among the rows (default) or columns (byrow=FALSE) of M2, allowing for NN search in a cross-distance setting.

byrow

whether target terms are looked up in rows (default) or columns (byrow=FALSE) of M. NB: Target vectors in the term argument are always given as row vectors, even if byrow=FALSE.

drop

if TRUE, the return value is simplified to a vector (or distance matrix) if it contains nearest neighbours for exactly one target term (default). Set drop=FALSE to ensure that nearest.neighbours always returns a list.

skip.missing

if TRUE, silently ignores target terms not found in the DSM or distance matrix. By default (skip.missing=FALSE) an error is raised in this case.

dist.matrix

if TRUE, return a full distance matrix between the target term and its nearest neighbours (instead of a vector of neighbours). Note that a pre-computed distance matrix M must be symmetric in this case.

...

additional arguments are passed to dist.matrix if M is a scored DSM matrix. See the manpage of dist.matrix for details on available parameters and settings.

batchsize

if term is a long list of lookup terms, it will automatically be processed in batches. The number of terms per batch is chosen in such a way that approximately batchsize intermediate similarity values have to be computed and stored at a time (not used if M is a pre-computed distance matrix).

verbose

if TRUE, display some progress messages indicating how data are split into batches

Details

In most cases, the target term itself is automatically excluded from the list of neighbours. There are two exceptions:

  1. The target term is given as a vector rather than by name.

  2. Nearest neighbours are determined in a cross-distance setting. This is the case if (i) M2 is specified or (ii) M is a pre-computed distance matrix and not marked to be symmetric.

With dist.matrix=TRUE, the returned distance matrix always includes the target term.

M can also be a pre-computed distance or similarity matrix from an external source, which must be marked with as.distmat. If M is a sparse similarity matrix, only non-zero cells will be considered when looking for the nearest neighbours. Keep in mind that dist.matrix=TRUE is only valid if M is a symmetric matrix and marked as such.

Value

A list with one entry for each target term found in M, giving

  • dist.matrix=FALSE (default): the nearest neighbours as a numeric vector of distances or similarities labelled with the corresponding terms and ordered by distance

  • dist.matrix=TRUE: a full distance or similarity matrix for the target term and its nearest neighbours (as an object of class dist.matrix). An additional attribute selected contains a logical vector indicating the position of the target term in the matrix.

If drop=TRUE, a list containing only a single target term will be simplified to a plain vector or distance matrix.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

See Also

dist.matrix for more information on available distance metrics and similarity measures

Examples


nearest.neighbours(DSM_Vectors, c("apple_N", "walk_V"), n=10)

nearest.neighbours(DSM_Vectors, "apple_N", n=10, method="maximum")

as.dist(nearest.neighbours(DSM_Vectors, "apple_N", n=10, dist.matrix=TRUE))


wordspace documentation built on Aug. 23, 2022, 1:06 a.m.