| simil | R Documentation |
Fast similarity/distance computation function for large sparse matrices. You
can floor small similarity value to to save computation time and storage
space by an arbitrary threshold (min_simil) or rank (rank). You
can specify the number of threads for parallel computing via
options(proxyC.threads).
simil(
x,
y = NULL,
margin = 1,
method = c("cosine", "correlation", "dice", "edice", "jaccard", "ejaccard", "fjaccard",
"hamann", "faith", "simple matching"),
mask = NULL,
min_simil = NULL,
rank = NULL,
drop0 = FALSE,
diag = FALSE,
use_nan = NULL,
sparse = TRUE,
digits = 14
)
dist(
x,
y = NULL,
margin = 1,
method = c("euclidean", "chisquared", "kullback", "jeffreys", "jensen", "manhattan",
"maximum", "canberra", "minkowski", "hamming"),
mask = NULL,
p = 2,
smooth = 0,
drop0 = FALSE,
diag = FALSE,
use_nan = NULL,
sparse = TRUE,
digits = 14
)
x |
a base::matrix or Matrix::Matrix object. Dense matrices are covered to the Matrix::CsparseMatrix internally. |
y |
if a base::matrix or Matrix::Matrix object is provided, proximity
between documents or features in |
margin |
integer indicating margin of similarity/distance computation. 1 indicates rows or 2 indicates columns. |
method |
method to compute similarity or distance |
mask |
a pattern matrix created using |
min_simil |
the minimum similarity value to be recorded. |
rank |
an integer value specifying top-n most similarity values to be recorded. |
drop0 |
if |
diag |
if |
use_nan |
if |
sparse |
if |
digits |
determines rounding of small values towards zero. Use primarily to correct floating point errors. Rounding is performed in C++ in a similar way as base::zapsmall. |
p |
weight for Minkowski distance. |
smooth |
adds a fixed value to all the cells to avoid division by zero.
Only used when |
Similarity:
cosine: cosine similarity
correlation: Pearson's correlation
jaccard: Jaccard coefficient
ejaccard: the real value version of jaccard
fjaccard: Fuzzy Jaccard coefficient
dice: Dice coefficient
edice: the real value version of dice
hamann: Hamann similarity
faith: Faith similarity
simple matching: the percentage of common elements
Distance:
euclidean: Euclidean distance
chisquared: chi-squared distance
kullback: Kullback–Leibler divergence
jeffreys: Jeffreys divergence
jensen: Jensen–Shannon divergence
manhattan: Manhattan distance
maximum: the largest difference between values
canberra: Canberra distance
minkowski: Minkowski distance
hamming: Hamming distance
See the vignette for how the similarity and distance are computed:
vignette("measures", package = "proxyC")
It performs parallel computing using Intel oneAPI Threads Building Blocks.
The number of threads for parallel computing should be specified via
options(proxyC.threads) before calling the functions. If the value is -1,
all the available threads will be used. Unless the option is used, the
number of threads will be limited by the environmental variables
(OMP_THREAD_LIMIT or RCPP_PARALLEL_NUM_THREADS) to comply with CRAN
policy and offer backward compatibility.
zapsmall
mt <- Matrix::rsparsematrix(100, 100, 0.01)
simil(mt, method = "cosine")[1:5, 1:5]
mt <- Matrix::rsparsematrix(100, 100, 0.01)
dist(mt, method = "euclidean")[1:5, 1:5]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.