cosSparse: Cosine similarity between columns (sparse matrices)

Description Usage Arguments Details Value Note Author(s) See Also Examples

View source: R/assoc.R

Description

cosSparse computes the cosine similarity between the columns of sparse matrices. Different normalizations and weightings can be specified. Performance-wise, this strongly improves over the approach taken in the corSparse function, though the results are almost identical for large sparse matrices. cosMissing adds the possibility to deal with large amounts of missing data.

Usage

1
2
cosSparse(X, Y = NULL, norm = norm2, weight = NULL)
cosMissing(X, availX, Y = NULL, availY = NULL, norm = norm2 , weight = NULL)

Arguments

X

a sparse matrix in a format of the Matrix package, typically dgCMatrix . The similarity will be calculated between the columns of this matrix.

Y

a second matrix in a format of the Matrix package. When Y = NULL, then the similarity between the columns of X and itself will be taken. If Y is specified, the similarity between the columns of X and the columns of Y will be calculated.

availX, availY

sparse Matrices (typically pattern matrices) of the same size of X and Y, respectively, indicating the available information for each matrix.

norm

The function to be used for the normalization of the columns. Currently norm2 (euclidean norm) and norm1 (manhattan norm) are available, but further methods can be easily specified by the user. See details for more information.

weight

The function to be used for the weighting of the rows. Currently idf (inverse document frequency) and isqrt (inverse square root) are available, but further methods can be easily specified by the user. See details for more information.

Details

This measure is called a ‘cosine’ similarity as it computes the cosine of the angle between high-dimensional vectors. It can also be considered a Pearson correlation without centering. Because centering removes sparsity, and because centering has almost no influence for highly sparse matrices, this cosine similarity performs much better that the Pearson correlation, both related for speed and memory consumption.

The variant cosMissing can be used when the available information itself is also sparse. In such a situation, a zero in the data matrix X, Y can mean either ‘zero value’ or ‘missing data’. To deal with the missing data, matrices indicating the available data can be specified. Note that this really only makes sense when the available data is sparse itself. When, say, 90% of the data is available, the availX matrix becomes very large, and the results does not differ strongly from the regular cosSparse, i.e. ignoring the missing data.

Different normalizations of the columns and weightings of the rows can be specified.

The predefined normalizations are defined as a function of the matrix x and a ‘summation function’ s (to be specified as a sparse matrix or a vector). This slight complexity is needed to be able to deal with missing data. With complete data, then s = rep(1,nrow(X)), leads to crossprod(X,s) == colSums(X).

The predefined weightings are defined as a function of the frequency of a row (s) and the number of columns (N):

It is trivial to define further norms of weighting functions. For example, one might define the following norm:

normL <- function(x,s) { abs(drop(crossprod(x,s))) ^ (1/2) }

When used on a symmetric pattern Matrix X, using this norm will result in a matrix that is known in spectral clustering as a ‘normalized Laplacian’ of the graph X.

Value

The result is a sparse matrix with the non-zero association values. Values range between -1 and +1, with values close to zero indicating low association.

When Y = NULL, then the result is a symmetric matrix, so a matrix of type dsCMatrix with size ncol(X) by ncol{X} is returned. When X and Y are both specified, a matrix of type dgCMatrix with size ncol(X) by ncol{Y} is returned.

Note

For large sparse matrices, consider this as an alternative to cor. See corSparse for a comparison of performance and results.

Author(s)

Michael Cysouw

See Also

corSparse, assocSparse for other sparse association measures. See also cosRow, cosCol for variants of cosSparse dealing with nominal data.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# functional upto limits of RAM of the hardware used
# consider removing small values of result to improve sparsity

## Not run: 

X <- rSparseMatrix(1e6, 1e6, 1e7)
print(object.size(X), units = "auto") # 118 Mb
system.time(M <- cosSparse(X)) # might take half a minute, depending on hardware
print(object.size(M), units = "auto") # 587 Mb
M <- drop0(M, tol = 0.1) # remove small values
print(object.size(M), units = "auto") # reduced to 141 Mb

## End(Not run)

# Compare various weightings.
# with random data from a normal distribution there is almost no difference
#
# data from a normal distribution
# X <- rSparseMatrix(1e2, 1e2, 1e3) 

# with heavily skewed data there is a strong difference!
X <- rSparseMatrix(1e2, 1e2, 1e3,
	rand.x = function(n){round(rpois(1e3, 10), 2)})

w0 <- cosSparse(X, norm = norm2, weight = NULL)@x
wi <- cosSparse(X, norm = norm2, weight = idf)@x
ws <- cosSparse(X, norm = norm2, weight = isqrt)@x

pairs(~ w0 + wi + ws, 
  labels=c("no weighting","inverse\ndocument\nfrequency","inverse\nsquare root"))

cysouw/qlcMatrix documentation built on Dec. 18, 2017, 9:12 a.m.