sparse.cos: Cosine similaritiy on sparse matrices/vectors

Description Usage Arguments Details Examples

View source: R/sparse.cos.R

Description

Cosine similarity between columns or rows of a single sparse matrix or a pair of sparse matrices and/or vectors

Usage

1
sparse.cos(x, y = NULL, return.sparse = FALSE)

Arguments

x

matrix or vector of, or coercible to, class "dgCMatrix" or "sparseVector"

y

(optional) matrix or vector of, or coercible to, class "dgCMatrix" or "sparseVector"

return.sparse

if result is a matrix, return as a "dgeMatrix", otherwise dense "matrix"

Details

Cosine similarity is an exceptionally efficient calculation for sparse matrices due to extremely fast vector operations.

"sparse.cos" applies a Euclidean norm to provide very similar results to Pearson correlation, restricted to the positive orthant.

This function adopts the sparse matrix computational strategy applied by qlcMatrix::cosSparse, and extends it to sparse vectors.

Note that negative values may be returned due to the use of Euclidean normalization. However, this is usually only the case in random matrices.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
## Not run: 
library(Matrix)

m1 <- rsparsematrix(1000, 10000, density = 0.1)
m2 <- rsparsematrix(1000, 100, density = 0.2)

# Input a vector and a vector
r <- sparse.cos(m1[,1],m1[,2])

# Input a vector and a matrix
r <- sparse.cos(m1[,1],m1[,1:100])

# Input a matrix and a vector
r <- sparse.cos(m1[,1:100],m1[,1])

# Input just a single matrix
res_m2 <- sparse.cos(m2)

# Input a matrix and a matrix
res <- sparse.cos(m1, m2)
# note that negative values are returned, the above are random matrices
plot(density(res@x))

# have a look at a non-random matrix.
# this matrix shows similarity of gene expression across cells from mouse embryos
data(moca7k)
res <- sparse.cos(moca7k[,1:1000])
plot(density(res@x))
# note how the non-random signal resulted in no negative values

# calculate distance from similarity
# subtract by 1 + very small number to avoid machine tolerance causing negative values
dist <- 1 + 1e-10 - res
lines(density(dist@x), col = "red")

# qlcMatrix::cosSparse is a great standard for comparison
# also consider wordspace::dist.matrix, but it only is faster in some conditions

library(qlcMatrix)
max(abs(as.matrix(qlcMatrix::cosSparse(moca7k[,1:1000])) - res))
[1] 3.352874e-14

library(rbenchmark)

# compare to qlcMatrix::cosSparse
moca.sparse <- moca7k[,1:1000]
moca.dense <- as.matrix(moca.sparse)
#' benchmark(
       "lsmf::sparse.cos" = sparse.cos(moca.sparse),
   "qlcMatrix::cosSparse" = qlcMatrix::cosSparse(moca.sparse), 
   replications = 10)

#                   test replications elapsed relative 
# 1     lsmf::sparse.cos           10    2.98    1.000      
# 2 qlcMatrix::cosSparse           10    3.08    1.034     

# compare to base::cor
benchmark(
   "lsmf::sparse.cos"     = sparse.cos(moca.sparse),
   "base::cor"            = cor(moca.dense),
   replications = 1)

#                test  replications elapsed relative 
# 1         base::cor             1    6.42   22.138     
# 2  lsmf::sparse.cos             1    0.29    1.000      

## End(Not run)

zdebruine/LSMF documentation built on Jan. 1, 2021, 1:50 p.m.