Description Usage Arguments Value Note References See Also Examples
Compute a thresholded correlation matrix, returning vector indices
and correlation values that exceed the specified threshold t
.
If y
is a matrix then the thresholded correlations
between the columns of x
and the columns of y
are computed,
otherwise the correlation matrix defined by the columns of x
is computed.
1 2 3 |
x |
an m by n real-valued dense or sparse matrix |
y |
|
t |
a threshold value for correlation, -1 < t < 1, but usually t is near 1 (see |
p |
projected subspace dimension, p << n (if p >= n it will be reduced)
(Increase |
include_anti |
logical value, if |
filter |
"local" filters candidate set sequentially, "distributed" computes thresholded correlations in a parallel code section which can be faster but requires the data matrix (see notes). |
dry_run |
set |
rank |
when |
max_iter |
when |
restart |
either output from a previous run of |
... |
additional arguments passed to |
A list with elements:
indices
A three-column matrix. The first two columns contain
indices of vectors meeting the correlation threshold t
,
the third column contains the corresponding correlation value
(not returned when dry_run=TRUE
).
restart
The truncated SVD from irlba
, used to restart
the irlba
algorithm (only returned when dry_run=TRUE
).
longest_run
The largest number of successive entries in the
ordered first singular vector within a projected distance defined by the
correlation threshold. This is the minimum number of n * p
matrix-vector
products required by the algorithm.
tot
The total number of _candidate_ vectors that met
the correlation threshold identified by the algorithm, subsequently filtered
down to just those indices corresponding to values meeting the threshold.
t
The threshold value.
svd_time
Time spent computing truncated SVD.
total_time
Total run time.
Register a parallel backend with foreach
before invoking tcor
to run in parallel, otherwise it runs sequentially.
When A
is large, use filter=local
to avoid copying A to the
parallel R worker processes (unless the doMC
parallel backend is used with
foreach
).
Specify dry_run=TRUE
to compute and return a truncated SVD of rank p
,
a lower bound on the number of n*p
matrix vector products required by the full algorithm, and a lower-bound
estimate on the number of unpruned candidate vector pairs to be evaluated by the algorithm. You
can pass the returned value back in as input using the restart
parameter to avoid
fully recomputing a truncated SVD. Use these options to tune p
for a balance between
the matrix-vector product work and pruning efficiency.
When rank=TRUE
, the method returns at least, and perhaps more than, the top t
most correlated
indices, unless they couldn't be found within max_iter
iterations.
http://arxiv.org/abs/1512.07246 (preprint)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | # Construct a 100 x 2,000 example matrix A:
set.seed(1)
s <- svd(matrix(rnorm(100 * 2000), nrow=100))
A <- s$u %*% (1 /( 1:100) * t(s$v))
C <- cor(A)
C <- C * upper.tri(C)
# Compare i with x$indices below:
(i <- which(C >= 0.98, arr.ind=TRUE))
(x <- tcor(A, t=0.98))
# Same example with thresholded correlation _and_ anticorrelation
(i <- which(abs(C) >= 0.98, arr.ind=TRUE))
(x <- tcor(A, t=0.98, include_anti=TRUE))
# Example of tuning p with dry_run=TRUE:
x1 <- tcor(A, t=0.98, p=3, dry_run=TRUE)
print(x1$tot)
# 211, see how much we can reduce this without increasing p too much...
x1 <- tcor(A, t=0.98, p=5, dry_run=TRUE, restart=x1)
print(x1$tot)
# 39, much better...
x1 <- tcor(A, t=0.98, p=10, dry_run=TRUE, restart=x1)
print(x1$tot)
# 3, even better!
# Once tuned, compute the full thresholded correlation:
x <- tcor(A, t=0.98, p=10, restart=x1)
## Not run:
# Optionally, register a parallel backend first:
library(doMC)
registerDoMC()
x <- tcor(A, t=0.98) # Should now run faster on a multicore machine
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.