README.md
In dselivanov/LSHR: Locality Sensitive Hashing in R

Locality Sensitive Hashing in R

LSHR - fast and memory efficient package for near-neighbor search in high-dimensional data. Two LSH schemes implemented at the moment:

Minhashing for jaccard similarity
Sketching (or random projections) for cosine similarity. Most of ideas are based on brilliant Mining of Massive Datasets book.

Materials

Slides (in english) and video (in russian) from my talk at Moscow Data Science meetup.

Quick reference

# devtools::install_github('dselivanov/text2vec')
library(text2vec)
library(LSHR)
data("movie_review")
it <- itoken(movie_review$review, preprocess_function = tolower, tokenizer = word_tokenizer)
dtm <- create_dtm(it, hash_vectorizer())
dtm = as(dtm, "RsparseMatrix")

hashfun_number = 120
s_curve <- get_s_curve(hashfun_number, n_bands_min = 5, n_rows_per_band_min = 5)
# Examine S-curve.
# Find tradeoff between accuracy and false-positive rate.

S-curves

seed = 1
pairs = get_similar_pairs(dtm, bands_number = 10, rows_per_band = 32, distance = 'cosine', seed = seed)

pairs[order(-N)]

#        id1  id2  N
#    1: 1054 1417 10
#    2: 1084 3462 10
#    3: 1291 1356 10
#    4: 1615 3846 10
#    5: 2805 4763  4
#   ---             
# 2304: 4767 4961  1
# 2305: 4772 4776  1
# 2306: 4810 4859  1
# 2307: 4854 4945  1
# 2308: 4905 4918  1

dselivanov/LSHR documentation built on May 15, 2019, 2:59 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com