simil: Compute similarity/distance between rows or columns of large...

View source: R/proxy.R

similR Documentation

Compute similarity/distance between rows or columns of large matrices

Description

Fast similarity/distance computation function for large sparse matrices. You can floor small similarity value to to save computation time and storage space by an arbitrary threshold (min_simil) or rank (rank). Please increase the number of threads for better performance using setThreadOptions.

Usage

simil(
  x,
  y = NULL,
  margin = 1,
  method = c("cosine", "correlation", "jaccard", "ejaccard", "fjaccard", "dice", "edice",
    "hamann", "faith", "simple matching"),
  min_simil = NULL,
  rank = NULL,
  drop0 = FALSE,
  diag = FALSE,
  use_nan = NULL,
  digits = 14
)

dist(
  x,
  y = NULL,
  margin = 1,
  method = c("euclidean", "chisquared", "kullback", "jeffreys", "jensen", "manhattan",
    "maximum", "canberra", "minkowski", "hamming"),
  p = 2,
  smooth = 0,
  drop0 = FALSE,
  diag = FALSE,
  use_nan = NULL,
  digits = 14
)

Arguments

x

matrix or Matrix object. Dense matrices are covered to the CsparseMatrix-class internally.

y

if a matrix or Matrix object is provided, proximity between documents or features in x and y is computed.

margin

integer indicating margin of similarity/distance computation. 1 indicates rows or 2 indicates columns.

method

method to compute similarity or distance

min_simil

the minimum similarity value to be recorded.

rank

an integer value specifying top-n most similarity values to be recorded.

drop0

if TRUE, zero values are removed regardless of min_simil or rank.

diag

if TRUE, only compute diagonal elements of the similarity/distance matrix; useful when comparing corresponding rows or columns of x and y.

use_nan

if TRUE, return NaN if the standard deviation of a vector is zero when method is "correlation"; if all the values are zero in a vector when method is "cosine", "chisquared", "kullback", "jeffreys" or "jensen". Note that use of NaN makes the similarity/distance matrix denser and therefore larger in RAM. If FALSE, return zero in same use situations as above. If NULL, will also return zero but also generate a warning (default).

digits

determines rounding of small values towards zero. Use primarily to correct rounding errors in C++. See zapsmall.

p

weight for Minkowski distance

smooth

adds a fixed value to all the cells to avoid division by zero. Only used when method is "chisquared", "kullback", "jeffreys" or "jensen".

Details

Available methods for similarity:

  • cosine: cosine similarity

  • correlation: Pearson's correlation

  • jaccard: Jaccard coefficient

  • ejaccard: the real value version of jaccard

  • fjaccard: Fuzzy Jaccard coefficient

  • dice: Dice coefficient

  • edice: the real value version of dice

  • hamann: Hamann similarity

  • faith: Faith similarity

  • ⁠simple matching⁠: the percentage of common elements

Available methods for distance:

  • euclidean: Euclidean distance

  • chisquared: chi-squared distance

  • kullback: Kullback–Leibler divergence

  • jeffreys: Jeffreys divergence

  • jensen: Jensen–Shannon divergence

  • manhattan: Manhattan distance

  • maximum: the largest difference between values

  • canberra: Canberra distance

  • minkowski: Minkowski distance

  • hamming: Hamming distance

See the vignette for how the similarity and distance are computed: vignette("measures", package = "proxyC")

See Also

zapsmall

Examples

mt <- Matrix::rsparsematrix(100, 100, 0.01)
simil(mt, method = "cosine")[1:5, 1:5]
mt <- Matrix::rsparsematrix(100, 100, 0.01)
dist(mt, method = "euclidean")[1:5, 1:5]

proxyC documentation built on Oct. 25, 2023, 9:06 a.m.