corSparse: Pearson correlation between columns (sparse matrices)

Description Usage Arguments Details Value Note Author(s) See Also Examples

View source: R/assoc.R

Description

This function computes the product-moment correlation coefficients between the columns of sparse matrices. Performance-wise, this improves over the approach taken in the cor function. However, because the resulting matrix is not-sparse, this function still cannot be used with very large matrices.

Usage

1
corSparse(X, Y = NULL, cov = FALSE)

Arguments

X

a sparse matrix in a format of the Matrix package, typically dgCMatrix . The correlations will be calculated between the columns of this matrix.

Y

a second matrix in a format of the Matrix package. When Y = NULL, then the correlations between the columns of X and itself will be taken. If Y is specified, the association between the columns of X and the columns of Y will be calculated.

cov

when TRUE the covariance matrix is returned, instead of the default correlation matrix.

Details

To compute the covariance matrix, the code uses the principle that

E[(X - μ(X))' (Y - μ(Y))] = E[X' Y] - μ(X') μ(Y)

With sample correction n/(n-1) this leads to the covariance between X and Y as

( X' Y - n * μ(X') μ(Y) ) / (n-1)

The computation of the standard deviation (to turn covariance into correlation) is trivial in the case Y = NULL, as they are found on the diagonal of the covariance matrix. In the case Y != NULL uses the principle that

E[X - μ(X)]^2 = E[X^2] - μ(X)^2

With sample correction n/(n-1) this leads to

sd^2 = ( X^2 - n * μ(X)^2 ) / (n-1)

Value

The result is a regular square (non-sparse!) Matrix with the Pearson product-moment correlation coefficients between the columns of X.

When Y is specified, the result is a rectangular (non-sparse!) Matrix of size nrow(X) by nrow(Y) with the correlation coefficients between the columns of X and Y.

When cov = T, the result is a covariance matrix (i.e. a non-normalized correlation).

Note

Because of the ‘centering’ of the Pearson correlation, the resulting Matrix is completely filled. This implies that this approach is normally not feasible with resulting matrices with more than 1e8 cells or so (except in dedicated computational environments with lots of RAM). However, in most sparse data situations, the cosine similarity cosSparse will almost be identical to the Pearson correlation, so consider using that one instead. For a comparison, see examples below.

For further usage, the many small coefficients are often unnecessary anyway, and can be removed for reasons of sparsity. Consider something like M <- drop0(M, tol = value) on the resulting M matrix (which removes all values between -value and +value). See examples below.

Author(s)

Michael Cysouw

Slightly extended and optimized, based on the code from a discussion at stackoverflow.

See Also

cor in the base packages, cosSparse, assocSparse for other sparse association measures.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
## Not run: 

# reasonably fast (though not instantly!) with
# sparse matrices up to a resulting matrix size of 1e8 cells.
# However, the calculations and the resulting matrix take up lots of memory

X <- rSparseMatrix(1e4, 1e4, 1e5)
system.time(M <- corSparse(X))
print(object.size(M), units = "auto") # more than 750 Mb

# Most values are low, so it often makes sense 
# to remove low values to keep results sparse

M <- drop0(M, tol = 0.4)
print(object.size(M), units = "auto") # normally reduces size by half or more
length(M@x) / prod(dim(M)) # down to less than 0.05% non-zero entries

## End(Not run)

# comparison with other methods
# corSparse is much faster than cor from the stats package
# but cosSparse is even quicker than both!

X <- rSparseMatrix(1e3, 1e3, 1e4)
X2 <- as.matrix(X)

# if there is a warning, try again with different random X
system.time(McorRegular <- cor(X2)) 
system.time(McorSparse <- corSparse(X))
system.time(McosSparse <- cosSparse(X))

# cor and corSparse give identical results

all.equal(McorSparse, McorRegular)

# corSparse and cosSparse are not identical, but close

McosSparse <- as.matrix(McosSparse)
dimnames(McosSparse) <- NULL
all.equal(McorSparse, McosSparse) 

# Actually, cosSparse and corSparse are *almost* identical!

cor(as.dist(McorSparse), as.dist(McosSparse))

# Visually it looks completely identical
# Note: this takes some time to plot

## Not run: 
plot(as.dist(McorSparse), as.dist(McosSparse))	

## End(Not run)

# So: consider using cosSparse instead of cor or corSparse.
# With sparse matrices, this gives mostly the same results, 
# but much larger matrices are possible
# and the computations are quicker and more sparse

cysouw/qlcMatrix documentation built on April 22, 2018, 4:59 a.m.