# corSparse: Pearson correlation between columns (sparse matrices) In cysouw/qlcMatrix: Utility Sparse Matrix Functions for Quantitative Language Comparison

## Description

This function computes the product-moment correlation coefficients between the columns of sparse matrices. Performance-wise, this improves over the approach taken in the `cor` function. However, because the resulting matrix is not-sparse, this function still cannot be used with very large matrices.

## Usage

 `1` ```corSparse(X, Y = NULL, cov = FALSE) ```

## Arguments

 `X` a sparse matrix in a format of the `Matrix` package, typically `dgCMatrix` . The correlations will be calculated between the columns of this matrix. `Y` a second matrix in a format of the `Matrix` package. When `Y = NULL`, then the correlations between the columns of X and itself will be taken. If Y is specified, the association between the columns of X and the columns of Y will be calculated. `cov` when `TRUE` the covariance matrix is returned, instead of the default correlation matrix.

## Details

To compute the covariance matrix, the code uses the principle that

E[(X - μ(X))' (Y - μ(Y))] = E[X' Y] - μ(X') μ(Y)

With sample correction n/(n-1) this leads to the covariance between X and Y as

( X' Y - n * μ(X') μ(Y) ) / (n-1)

The computation of the standard deviation (to turn covariance into correlation) is trivial in the case `Y = NULL`, as they are found on the diagonal of the covariance matrix. In the case `Y != NULL` uses the principle that

E[X - μ(X)]^2 = E[X^2] - μ(X)^2

With sample correction n/(n-1) this leads to

sd^2 = ( X^2 - n * μ(X)^2 ) / (n-1)

## Value

The result is a regular square (non-sparse!) Matrix with the Pearson product-moment correlation coefficients between the columns of `X`.

When `Y` is specified, the result is a rectangular (non-sparse!) Matrix of size `nrow(X)` by `nrow(Y)` with the correlation coefficients between the columns of `X` and `Y`.

When `cov = T`, the result is a covariance matrix (i.e. a non-normalized correlation).

## Note

Because of the ‘centering’ of the Pearson correlation, the resulting Matrix is completely filled. This implies that this approach is normally not feasible with resulting matrices with more than 1e8 cells or so (except in dedicated computational environments with lots of RAM). However, in most sparse data situations, the cosine similarity `cosSparse` will almost be identical to the Pearson correlation, so consider using that one instead. For a comparison, see examples below.

For further usage, the many small coefficients are often unnecessary anyway, and can be removed for reasons of sparsity. Consider something like `M <- drop0(M, tol = value)` on the resulting `M` matrix (which removes all values between -value and +value). See examples below.

## Author(s)

Michael Cysouw

Slightly extended and optimized, based on the code from a discussion at stackoverflow.

`cor` in the base packages, `cosSparse`, `assocSparse` for other sparse association measures.
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57``` ```## Not run: # reasonably fast (though not instantly!) with # sparse matrices up to a resulting matrix size of 1e8 cells. # However, the calculations and the resulting matrix take up lots of memory X <- rSparseMatrix(1e4, 1e4, 1e5) system.time(M <- corSparse(X)) print(object.size(M), units = "auto") # more than 750 Mb # Most values are low, so it often makes sense # to remove low values to keep results sparse M <- drop0(M, tol = 0.4) print(object.size(M), units = "auto") # normally reduces size by half or more length(M@x) / prod(dim(M)) # down to less than 0.05% non-zero entries ## End(Not run) # comparison with other methods # corSparse is much faster than cor from the stats package # but cosSparse is even quicker than both! X <- rSparseMatrix(1e3, 1e3, 1e4) X2 <- as.matrix(X) # if there is a warning, try again with different random X system.time(McorRegular <- cor(X2)) system.time(McorSparse <- corSparse(X)) system.time(McosSparse <- cosSparse(X)) # cor and corSparse give identical results all.equal(McorSparse, McorRegular) # corSparse and cosSparse are not identical, but close McosSparse <- as.matrix(McosSparse) dimnames(McosSparse) <- NULL all.equal(McorSparse, McosSparse) # Actually, cosSparse and corSparse are *almost* identical! cor(as.dist(McorSparse), as.dist(McosSparse)) # Visually it looks completely identical # Note: this takes some time to plot ## Not run: plot(as.dist(McorSparse), as.dist(McosSparse)) ## End(Not run) # So: consider using cosSparse instead of cor or corSparse. # With sparse matrices, this gives mostly the same results, # but much larger matrices are possible # and the computations are quicker and more sparse ```