simil | R Documentation |
Fast similarity/distance computation function for large sparse matrices. You
can floor small similarity value to to save computation time and storage
space by an arbitrary threshold (min_simil
) or rank (rank
). You
can specify the number of threads for parallel computing via
options(proxyC.threads)
.
simil(
x,
y = NULL,
margin = 1,
method = c("cosine", "correlation", "jaccard", "ejaccard", "fjaccard", "dice", "edice",
"hamann", "faith", "simple matching"),
min_simil = NULL,
rank = NULL,
drop0 = FALSE,
diag = FALSE,
use_nan = NULL,
digits = 14
)
dist(
x,
y = NULL,
margin = 1,
method = c("euclidean", "chisquared", "kullback", "jeffreys", "jensen", "manhattan",
"maximum", "canberra", "minkowski", "hamming"),
p = 2,
smooth = 0,
drop0 = FALSE,
diag = FALSE,
use_nan = NULL,
digits = 14
)
x |
matrix or Matrix object. Dense matrices are covered to the CsparseMatrix-class internally. |
y |
if a matrix or Matrix object is provided, proximity
between documents or features in |
margin |
integer indicating margin of similarity/distance computation. 1 indicates rows or 2 indicates columns. |
method |
method to compute similarity or distance |
min_simil |
the minimum similarity value to be recorded. |
rank |
an integer value specifying top-n most similarity values to be recorded. |
drop0 |
if |
diag |
if |
use_nan |
if |
digits |
determines rounding of small values towards zero. Use primarily to correct rounding errors in C++. See zapsmall. |
p |
weight for Minkowski distance |
smooth |
adds a fixed value to all the cells to avoid division by zero.
Only used when |
Similarity:
cosine
: cosine similarity
correlation
: Pearson's correlation
jaccard
: Jaccard coefficient
ejaccard
: the real value version of jaccard
fjaccard
: Fuzzy Jaccard coefficient
dice
: Dice coefficient
edice
: the real value version of dice
hamann
: Hamann similarity
faith
: Faith similarity
simple matching
: the percentage of common elements
Distance:
euclidean
: Euclidean distance
chisquared
: chi-squared distance
kullback
: Kullback–Leibler divergence
jeffreys
: Jeffreys divergence
jensen
: Jensen–Shannon divergence
manhattan
: Manhattan distance
maximum
: the largest difference between values
canberra
: Canberra distance
minkowski
: Minkowski distance
hamming
: Hamming distance
See the vignette for how the similarity and distance are computed:
vignette("measures", package = "proxyC")
It performs parallel computing using Intel oneAPI Threads Building Blocks.
The number of threads for parallel computing should be specified via
options(proxyC.threads)
before calling the functions. If the value is -1,
all the available threads will be used. Unless the option is used, the
number of threads will be limited by the environmental variables
(OMP_THREAD_LIMIT
or RCPP_PARALLEL_NUM_THREADS
) to comply with CRAN
policy and offer backward compatibility.
zapsmall
mt <- Matrix::rsparsematrix(100, 100, 0.01)
simil(mt, method = "cosine")[1:5, 1:5]
mt <- Matrix::rsparsematrix(100, 100, 0.01)
dist(mt, method = "euclidean")[1:5, 1:5]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.