tdist: Compute thresholded distances between rows or columns of a...

Description Usage Arguments Value Note References See Also Examples

View source: R/tdist.R

Description

Compute and return distances and indices of rows or columns within a specified distance threshold with respect to a specified distance metric. The algorithm works best for Euclidean distance (the default option). Alternatively compute the t closest rows when rank=TRUE. Or use columns=TRUE to compute distances between columns instead, which is somewhat cheaper for this algorithm than computing row distances. Increase p to cut down the total number of candidate pairs evaluated, at the expense of costlier truncated SVDs.

Usage

1
2
3
tdist(A, t, p = 10, filter = c("distributed", "local"),
  method = c("euclidean", "manhattan", "maximum"), rank = FALSE,
  dry_run = FALSE, max_iter = 4, columns = FALSE, restart, ...)

Arguments

A

an m by n real-valued dense or sparse matrix

t

a threshold distance value either in absolute distance (the default) or rank order (see rank below); if missing an estimate derived from a 1-d SVD projection will be used

p

projected subspace dimension

filter

"local" filters candidate set sequentially, "distributed" computes thresholded correlations in a parallel code section which can be faster but requires that the data matrix is available (see notes).

method

the distance measure to be used, one of "euclidean", or "manhattan". Any unambiguous substring can be given.

rank

when TRUE, the threshold t represents the top t closest vectors, otherwise the threshold t specifies absolute distance; when rank=TRUE then t must also be specified

dry_run

set TRUE to return statistics and truncated SVD for tuning p (see notes)

max_iter

when rank=TRUE, a portion of the algorithm may iterate; this number sets the maximum numer of such iterations

columns

set to TRUE to compute distances between matrix columns instead of rows, saving the expense of a matrix transpose (which can be significant if A is large)

restart

either output from a previous run of tdist with dry_run=TRUE, or direct output from from irlba used to restart the irlba algorithm when tuning p (see notes)

...

additional arguments passed to irlba

Value

A list with elements:

  1. indices A three-column matrix. The first two columns contain indices of rows meeting the distance threshold t, the third column contains the corresponding distance value (not returned when dry_run=TRUE).

  2. restart A truncated SVD returned by the IRLBA used to restart the algorithm (only returned when dry_run=TRUE).

  3. tot The total number of _possible_ vectors that meet the correlation threshold identified by the algorithm.

  4. longest_run The largest number of successive entries in the ordered first singular vector within a projected distance defined by the correlation threshold; Equivalently, the number of n * p matrix vector products employed in the algorithm, not counting the truncated SVD step.

  5. t The threshold value.

  6. svd_time Time to compute truncated SVD.

  7. total_time Total run time.

Note

When rank=TRUE the method returns at least, and perhaps more than, the top t closest indices and their distances, unless they could not be found within the iteration limit max_iter.

References

http://arxiv.org/abs/1512.07246 (preprint)

See Also

dist, tcor

Examples

1
2
3
4
5
6
7
8
x <- matrix(rnorm(100 * 20), nrow=100)
# Find the top 10 closest vectors with respect to Euclidean distance:
td <- tdist(x, 10, rank=TRUE)
print(td$indices[1:10, ])

# Compare with distances from `dist`:
d <- dist(x)
print(sort(d)[1:10])

bwlewis/tcor documentation built on Sept. 6, 2020, 4:18 p.m.