get_similar_pairs: Calculating candidate pairs using locality sensitive hashing.

Description Usage Arguments Value Examples

Description

For a given matrix function generate indices of similar rows.

Usage

1
2
3
4
5
get_similar_pairs(X, bands_number, rows_per_band, distance = c("cosine",
  "jaccard"), seed = 1L, verbose = FALSE, mc.cores = 1, ...)

get_similar_pairs_cosine(X, bands_number, rows_per_band, seed = 1L,
  verbose = FALSE, mc.cores = 1, n_band_join = bands_number, ...)

Arguments

X

input matrix - sparse or dense

bands_number

number of bands for LSH algorithm - tradeoff between precision and number of false positive candidates. See get_s_curve for details.

rows_per_band

number of rows in each band for LSH algorithm - tradeoff between precision and number of false positive candidates. See get_s_curve for details. For "cosine" distance due to performance reasons (bit arifmetics) only values less than 32 are supported.

distance

on of "cosine" or "jaccard" - how to measure distance between rows of input matrix

seed

random seed for reproducibility

verbose

logical print lsh process information. (such as expected false positive rate, false negative rate,timings, etc.)

mc.cores

number of cores to use for bands processing - random projection and candidate selection (this is embrassingly parallel task - can be done independently for each band). Most epensive operations - random projection. It is itself parallelized with OpenMP, so when mc.cores > 1 random projection becomes single threaded. usually we recommend use mc.cores = 1 and rely on internal OpenMP parallelism. Candidate selection which not trivially parallelizable is not usually a bottleneck.

...

other parameters to mclapply (used if mc.cores > 1 )

n_band_join

calculate in how many bands signatures became same. Since each bucket is independant obvious way is to calculate this stastics at the end (by default), so we will do it only once. On the other side we can calculate it each n_band_join so we can save some memory (if this becomes a issue). in most cases we recommend to use default value for this parameter.

Value

pairs of candidates with similarity => similarity - data.table with 3 colums: id1, id2, N - index of first candidate, index of second candidate, and number of buckets where they share same value. The latter provided only for information. (Intutition is following: the bigger N - the stronger similarity)

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
## Not run: 
library(text2vec)
library(LSHR)
library(Matrix)
data("movie_review")
it <- itoken(movie_review$review, preprocess_function = tolower,
tokenizer = word_tokenizer)
dtm <- create_dtm(it, hash_vectorizer())
dtm = as(dtm, "RsparseMatrix")
pairs = get_similar_pairs(dtm, bands_number = 4, rows_per_band = 32,
distance = 'cosine', verbose = TRUE)
pairs[order(-N)]

## End(Not run)

dselivanov/LSHR documentation built on May 15, 2019, 2:59 p.m.