This function computes the product-moment correlation coefficients between the columns of sparse matrices. Performance-wise, this improves over the approach taken in the
cor function. However, because the resulting matrix is not-sparse, this function still cannot be used with very large matrices.
a sparse matrix in a format of the
a second matrix in a format of the
To compute the covariance matrix, the code uses the principle that
E[(X - μ(X))' (Y - μ(Y))] = E[X' Y] - μ(X') μ(Y)
With sample correction n/(n-1) this leads to the covariance between X and Y as
( X' Y - n * μ(X') μ(Y) ) / (n-1)
The computation of the standard deviation (to turn covariance into correlation) is trivial in the case
Y = NULL, as they are found on the diagonal of the covariance matrix. In the case
Y != NULL uses the principle that
E[X - μ(X)]^2 = E[X^2] - μ(X)^2
With sample correction n/(n-1) this leads to
sd^2 = ( X^2 - n * μ(X)^2 ) / (n-1)
The result is a regular square (non-sparse!) Matrix with the Pearson product-moment correlation coefficients between the columns of
Y is specified, the result is a rectangular (non-sparse!) Matrix of size
nrow(Y) with the correlation coefficients between the columns of
cov = T, the result is a covariance matrix (i.e. a non-normalized correlation).
Because of the ‘centering’ of the Pearson correlation, the resulting Matrix is completely filled. This implies that this approach is normally not feasible with resulting matrices with more than 1e8 cells or so (except in dedicated computational environments with lots of RAM). However, in most sparse data situations, the cosine similarity
cosSparse will almost be identical to the Pearson correlation, so consider using that one instead. For a comparison, see examples below.
For further usage, the many small coefficients are often unnecessary anyway, and can be removed for reasons of sparsity. Consider something like
M <- drop0(M, tol = value) on the resulting
M matrix (which removes all values between -value and +value). See examples below.
Slightly extended and optimized, based on the code from a discussion at stackoverflow.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
## Not run: # reasonably fast (though not instantly!) with # sparse matrices up to a resulting matrix size of 1e8 cells. # However, the calculations and the resulting matrix take up lots of memory X <- rSparseMatrix(1e4, 1e4, 1e5) system.time(M <- corSparse(X)) print(object.size(M), units = "auto") # more than 750 Mb # Most values are low, so it often makes sense # to remove low values to keep results sparse M <- drop0(M, tol = 0.4) print(object.size(M), units = "auto") # normally reduces size by half or more length(M@x) / prod(dim(M)) # down to less than 0.05% non-zero entries ## End(Not run) # comparison with other methods # corSparse is much faster than cor from the stats package # but cosSparse is even quicker than both! X <- rSparseMatrix(1e3, 1e3, 1e4) X2 <- as.matrix(X) # if there is a warning, try again with different random X system.time(McorRegular <- cor(X2)) system.time(McorSparse <- corSparse(X)) system.time(McosSparse <- cosSparse(X)) # cor and corSparse give identical results all.equal(McorSparse, McorRegular) # corSparse and cosSparse are not identical, but close McosSparse <- as.matrix(McosSparse) dimnames(McosSparse) <- NULL all.equal(McorSparse, McosSparse) # Actually, cosSparse and corSparse are *almost* identical! cor(as.dist(McorSparse), as.dist(McosSparse)) # Visually it looks completely identical # Note: this takes some time to plot ## Not run: plot(as.dist(McorSparse), as.dist(McosSparse)) ## End(Not run) # So: consider using cosSparse instead of cor or corSparse. # With sparse matrices, this gives mostly the same results, # but much larger matrices are possible # and the computations are quicker and more sparse
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.