Description Usage Arguments Details Value Note Author(s) See Also Examples

This function computes the product-moment correlation coefficients between the columns of sparse matrices. Performance-wise, this improves over the approach taken in the `cor`

function. However, because the resulting matrix is not-sparse, this function still cannot be used with very large matrices.

1 |

`X` |
a sparse matrix in a format of the |

`Y` |
a second matrix in a format of the |

`cov` |
when |

To compute the covariance matrix, the code uses the principle that

*E[(X - μ(X))' (Y - μ(Y))] = E[X' Y] - μ(X') μ(Y)*

With sample correction n/(n-1) this leads to the covariance between X and Y as

*( X' Y - n * μ(X') μ(Y) ) / (n-1)*

The computation of the standard deviation (to turn covariance into correlation) is trivial in the case `Y = NULL`

, as they are found on the diagonal of the covariance matrix. In the case `Y != NULL`

uses the principle that

*E[X - μ(X)]^2 = E[X^2] - μ(X)^2*

With sample correction n/(n-1) this leads to

*sd^2 = ( X^2 - n * μ(X)^2 ) / (n-1)*

The result is a regular square (non-sparse!) Matrix with the Pearson product-moment correlation coefficients between the columns of `X`

.

When `Y`

is specified, the result is a rectangular (non-sparse!) Matrix of size `nrow(X)`

by `nrow(Y)`

with the correlation coefficients between the columns of `X`

and `Y`

.

When `cov = T`

, the result is a covariance matrix (i.e. a non-normalized correlation).

Because of the ‘centering’ of the Pearson correlation, the resulting Matrix is completely filled. This implies that this approach is normally not feasible with resulting matrices with more than 1e8 cells or so (except in dedicated computational environments with lots of RAM). However, in most sparse data situations, the cosine similarity `cosSparse`

will almost be identical to the Pearson correlation, so consider using that one instead. For a comparison, see examples below.

For further usage, the many small coefficients are often unnecessary anyway, and can be removed for reasons of sparsity. Consider something like `M <- drop0(M, tol = value)`

on the resulting `M`

matrix (which removes all values between -value and +value). See examples below.

Michael Cysouw

Slightly extended and optimized, based on the code from a discussion at stackoverflow.

`cor`

in the base packages, `cosSparse`

, `assocSparse`

for other sparse association measures.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | ```
## Not run:
# reasonably fast (though not instantly!) with
# sparse matrices up to a resulting matrix size of 1e8 cells.
# However, the calculations and the resulting matrix take up lots of memory
X <- rSparseMatrix(1e4, 1e4, 1e5)
system.time(M <- corSparse(X))
print(object.size(M), units = "auto") # more than 750 Mb
# Most values are low, so it often makes sense
# to remove low values to keep results sparse
M <- drop0(M, tol = 0.4)
print(object.size(M), units = "auto") # normally reduces size by half or more
length(M@x) / prod(dim(M)) # down to less than 0.05% non-zero entries
## End(Not run)
# comparison with other methods
# corSparse is much faster than cor from the stats package
# but cosSparse is even quicker than both!
X <- rSparseMatrix(1e3, 1e3, 1e4)
X2 <- as.matrix(X)
# if there is a warning, try again with different random X
system.time(McorRegular <- cor(X2))
system.time(McorSparse <- corSparse(X))
system.time(McosSparse <- cosSparse(X))
# cor and corSparse give identical results
all.equal(McorSparse, McorRegular)
# corSparse and cosSparse are not identical, but close
McosSparse <- as.matrix(McosSparse)
dimnames(McosSparse) <- NULL
all.equal(McorSparse, McosSparse)
# Actually, cosSparse and corSparse are *almost* identical!
cor(as.dist(McorSparse), as.dist(McosSparse))
# Visually it looks completely identical
# Note: this takes some time to plot
## Not run:
plot(as.dist(McorSparse), as.dist(McosSparse))
## End(Not run)
# So: consider using cosSparse instead of cor or corSparse.
# With sparse matrices, this gives mostly the same results,
# but much larger matrices are possible
# and the computations are quicker and more sparse
``` |

cysouw/qlcMatrix documentation built on Dec. 18, 2017, 9:12 a.m.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.