# assocSparse: Association between columns (sparse matrices) In qlcMatrix: Utility Sparse Matrix Functions for Quantitative Language Comparison

## Description

This function offers an interface to various different measures of association between columns in sparse matrices (based on functions of ‘observed’ and ‘expected’ values). Currently, the following measures are available: pointwise mutual information (aka log-odds), a poisson-based measure and Pearson residuals. Further measures can easily be specifically defined by the user. The calculations are optimized to be able to deal with large sparse matrices. Note that these association values are really only (sensibly) defined for binary data.

## Usage

 `1` ```assocSparse(X, Y = NULL, method = res, N = nrow(X), sparse = TRUE ) ```

## Arguments

 `X` a sparse matrix in a format of the `Matrix` package, typically a `dgCMatrix` with only zeros or ones. The association will be calculated between the columns of this matrix. `Y` a second matrix in a format of the `Matrix` package with the same number of rows as X. When `Y=NULL`, then the associations between the columns of X and itself will be taken. If Y is specified, the association between the columns of X and the columns of Y will be calculated. `method` The method to be used for the calculation. Currently `res` (residuals), `poi` (poisson), `pmi` (pointwise mutual information) and `wpmi` (weighted pointwise mutual information) are available, but further methods can be specified by the user. See details for more information. `N` Variable that is needed for the calculations of the expected values. Only in exceptional situations this should be different from the default value (i.e. the number of rows of the matrix). `sparse` By default, nothing is computed when the observed co-occurrence of two columns is zero. This keeps the computations and the resulting matrix nicely sparse. However, for some measures (specifically the Pearson residuals ‘res’) this leads to incorrect results. Mostly the error is negligible, but if the correct behavior is necessary, chose `sparse = F`. Note that the result will then be a full matrix, so this is not feasible for large datasets.

## Details

Computations are based on a comparison of the observed interaction `crossprod(X,Y)` and the expected interaction. Expectation is in principle computed as `tcrossprod(rowSums(abs(X)),rowSums(abs(Y)))/nrow(X)`, though in practice the code is more efficient than that.

Note that calculating the observed interaction as `crossprod(X,Y)` really only makes sense for binary data (i.e. matrices with only ones and zeros). Currently, all input is coerced to such data by `as(X, "nMatrix")*1`, meaning that all values that are not one or zero are turned into one (including negative values!).

Any method can be defined as a function with two arguments, `o` and `e`, e.g. simply by specifying `method = function(o,e){o/e}`. See below for more examples.

The predefined functions are:

• `pmi`: pointwise mutual information, aka as log-odds in bioinformatics, defined as
`pmi <- function(o,e) { log(o/e) }`.

• `wpmi`: weighted pointwise mutual information, defined as
`wpmi <- function(o,e) { o * log(o/e) }`.

• `res`: Pearson residuals, defined as
`res <- function(o,e) { (o-e) / sqrt(e) }`.

• `poi`: association assuming a poisson-distribution of the values, defined as
`poi <- function(o,e) { sign(o-e) * (o * log(o/e) - (o-e)) }`.
Seems to be very useful when the non-zero data is strongly skewed along the rows, i.e. some rows are much fuller than others. A short explanation of this method can be found in Proki<c4><87> and Cysouw (2013).

## Value

The result is a sparse matrix with the non-zero association values. Values range between -Inf and +Inf, with values close to zero indicating low association. The exact interpretation of the values depends on the method used.

When `Y = NULL`, then the result is a symmetric matrix, so a matrix of type `dsCMatrix` with size `ncol(X)` by `ncol{X}` is returned. When `X` and `Y` are both specified, a matrix of type `dgCMatrix` with size `ncol(X)` by `ncol{Y}` is returned.

## Note

Care is taken in the implementation not to compute any association between columns that will end up with a value of zero anyway. However, very small association values will be computed. For further usage, these small values are often unnecessary, and can be removed for reasons of sparsity. Consider something like `X <- drop0(X, tol = value)` on the resulting `X` matrix (which removes all values between -value and +value). See examples below.

It is important to realize, that by default noting is computed when the observed co-occurrence is zero. However, this leads to wrong results with `method = res`, as `(o-e)/sqrt(e)` will be a negative value when `o = 0`. In most practically situations this error will be small and not important. However, when needed, the option `sparse = F` will give the correct results (though the resulting matrix will not be sparse anymore). Note that with all other methods implemented here, the default behavior leads to correct results (i.e. for `log(O)` nothing is calculated).

The current implementation will not lead to correct results with lots of missing data (that option is simply not yet implemented). See `cosMissing` for now.

Michael Cysouw

## References

Proki<c4><87>, Jelena & Michael Cysouw. 2013. Combining regular sound correspondences and geographic spread. Language Dynamics and Change 3(2). 147–168.

See `assocCol` and `assocRow` for this measure defined for nominal data. Also, see `corSparse` and `cosSparse` for other sparse association measures.
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57``` ```# ----- reasonably fast with large very sparse matrices ----- X <- rSparseMatrix(1e6, 1e6, 1e6, NULL) system.time(M <- assocSparse(X, method = poi)) length(M@x) / prod(dim(M)) # only one in 1e6 cells non-zero ## Not run: # ----- reaching limits of sparsity ----- # watch out: # with slightly less sparse matrices the result will not be very sparse, # so this will easily fill up your RAM during computation! X <- rSparseMatrix(1e4, 1e4, 1e6, NULL) system.time(M <- assocSparse(X, method = poi)) print(object.size(M), units = "auto") # about 350 Mb length(M@x) / prod(dim(M)) # 30% filled # most values are low, so it often makes sense # to remove low values to keep results sparse M <- drop0(M, tol = 2) print(object.size(M), units = "auto") # reduces to 10 Mb length(M@x) / prod(dim(M)) # down to less than 1% filled ## End(Not run) # ----- defining new methods ----- # Using the following simple 'div' method is the same as # using a cosine similarity with a 1-norm, up to a factor nrow(X) div <- function(o,e) {o/e} X <- rSparseMatrix(10, 10, 30, NULL) all.equal( assocSparse(X, method = div), cosSparse(X, norm = norm1) * nrow(X) ) # ----- comparing methods ----- # Compare various methods on random data # ignore values on diagonal, because different methods differ strongly here # Note the different behaviour of pointwise mutual information (and division) X <- rSparseMatrix(1e2, 1e2, 1e3, NULL) p <- assocSparse(X, method = poi); diag(p) <- 0 r <- assocSparse(X, method = res); diag(r) <- 0 m <- assocSparse(X, method = pmi); diag(m) <- 0 w <- assocSparse(X, method = wpmi); diag(w) <- 0 d <- assocSparse(X, method = div); diag(d) <- 0 pairs(~w@x+p@x+r@x+d@x+m@x, labels=c("weighted pointwise\nmutual information","poisson","residuals","division", "pointwise\nmutual\ninformation"), cex = 0.7) ```