tcor: Thresholded Correlation

Description Usage Arguments Value Note References See Also Examples

View source: R/tcor.R

Description

Compute a thresholded correlation matrix, returning vector indices and correlation values that exceed the specified threshold t. If y is a matrix then the thresholded correlations between the columns of x and the columns of y are computed, otherwise the correlation matrix defined by the columns of x is computed.

Usage

1
2
3
tcor(x, y = NULL, t = 0.99, p = 10, include_anti = FALSE,
  filter = c("distributed", "local"), dry_run = FALSE, rank = FALSE,
  max_iter = 4, restart, ...)

Arguments

x

an m by n real-valued dense or sparse matrix

y

NULL (default) or a matrix with compatible dimensions to x (same number of rows). The default is equivalent to y=x but more efficient.

t

a threshold value for correlation, -1 < t < 1, but usually t is near 1 (see include_anti below).

p

projected subspace dimension, p << n (if p >= n it will be reduced) (Increase p to cut down the total number of candidate pairs evaluated. at the expense of costlier matrix-vector products. See the notes on tuning p.)

include_anti

logical value, if TRUE then return both correlated and anti-correlated values that meet the threshold in absolute value. NB Can be much more expensive when TRUE.

filter

"local" filters candidate set sequentially, "distributed" computes thresholded correlations in a parallel code section which can be faster but requires the data matrix (see notes).

dry_run

set TRUE to return statistics and truncated SVD for tuning p (see notes).

rank

when TRUE, the threshold t represents the top t closest vectors, otherwise the threshold t specifies absolute correlation value.

max_iter

when rank=TRUE, a portion of the algorithm may iterate; this number sets the maximum numer of such iterations.

restart

either output from a previous run of tcor with dry_run=TRUE, or direct output from from irlba used to restart the irlba algorithm when tuning p (see notes).

...

additional arguments passed to irlba.

Value

A list with elements:

  1. indices A three-column matrix. The first two columns contain indices of vectors meeting the correlation threshold t, the third column contains the corresponding correlation value (not returned when dry_run=TRUE).

  2. restart The truncated SVD from irlba, used to restart the irlba algorithm (only returned when dry_run=TRUE).

  3. longest_run The largest number of successive entries in the ordered first singular vector within a projected distance defined by the correlation threshold. This is the minimum number of n * p matrix-vector products required by the algorithm.

  4. tot The total number of _candidate_ vectors that met the correlation threshold identified by the algorithm, subsequently filtered down to just those indices corresponding to values meeting the threshold.

  5. t The threshold value.

  6. svd_time Time spent computing truncated SVD.

  7. total_time Total run time.

Note

Register a parallel backend with foreach before invoking tcor to run in parallel, otherwise it runs sequentially. When A is large, use filter=local to avoid copying A to the parallel R worker processes (unless the doMC parallel backend is used with foreach).

Specify dry_run=TRUE to compute and return a truncated SVD of rank p, a lower bound on the number of n*p matrix vector products required by the full algorithm, and a lower-bound estimate on the number of unpruned candidate vector pairs to be evaluated by the algorithm. You can pass the returned value back in as input using the restart parameter to avoid fully recomputing a truncated SVD. Use these options to tune p for a balance between the matrix-vector product work and pruning efficiency.

When rank=TRUE, the method returns at least, and perhaps more than, the top t most correlated indices, unless they couldn't be found within max_iter iterations.

References

http://arxiv.org/abs/1512.07246 (preprint)

See Also

cor, tdist

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Construct a 100 x 2,000 example matrix A:
set.seed(1)
s <- svd(matrix(rnorm(100 * 2000), nrow=100))
A <- s$u %*% (1 /( 1:100) * t(s$v)) 

C <- cor(A)
C <- C * upper.tri(C)
# Compare i with x$indices below:
(i <- which(C >= 0.98, arr.ind=TRUE))
(x <- tcor(A, t=0.98))

# Same example with thresholded correlation _and_ anticorrelation
(i <- which(abs(C) >= 0.98, arr.ind=TRUE))
(x <- tcor(A, t=0.98, include_anti=TRUE))

# Example of tuning p with dry_run=TRUE:
x1 <- tcor(A, t=0.98, p=3, dry_run=TRUE)
print(x1$tot)
# 211, see how much we can reduce this without increasing p too much...
x1 <- tcor(A, t=0.98, p=5, dry_run=TRUE, restart=x1)
print(x1$tot)
# 39,  much better...
x1 <- tcor(A, t=0.98, p=10, dry_run=TRUE, restart=x1)
print(x1$tot)
# 3,   even better!

# Once tuned, compute the full thresholded correlation:
x <- tcor(A, t=0.98, p=10, restart=x1)

## Not run: 
# Optionally, register a parallel backend first:
library(doMC)
registerDoMC()
x <- tcor(A, t=0.98)  # Should now run faster on a multicore machine

## End(Not run)

bwlewis/tcor documentation built on Sept. 6, 2020, 4:18 p.m.