get_similar_pairs: Calculating candidate pairs using locality sensitive hashing.
In dselivanov/LSHR: Locality Sensitive Hashing in R

Description Usage Arguments Value Examples

For a given matrix function generate indices of similar rows.

get_similar_pairs(X, bands_number, rows_per_band, distance = c("cosine",
  "jaccard"), seed = 1L, verbose = FALSE, mc.cores = 1, ...)

get_similar_pairs_cosine(X, bands_number, rows_per_band, seed = 1L,
  verbose = FALSE, mc.cores = 1, n_band_join = bands_number, ...)

`X`	input matrix - sparse or dense
`bands_number`	number of bands for LSH algorithm - tradeoff between precision and number of false positive candidates. See get_s_curve for details.
`rows_per_band`	number of rows in each band for LSH algorithm - tradeoff between precision and number of false positive candidates. See get_s_curve for details. For "cosine" distance due to performance reasons (bit arifmetics) only values less than 32 are supported.
`distance`	on of "cosine" or "jaccard" - how to measure distance between rows of input matrix
`seed`	random seed for reproducibility
`verbose`	`logical` print lsh process information. (such as expected false positive rate, false negative rate,timings, etc.)
`mc.cores`	number of cores to use for bands processing - random projection and candidate selection (this is embrassingly parallel task - can be done independently for each band). Most epensive operations - random projection. It is itself parallelized with OpenMP, so when `mc.cores > 1` random projection becomes single threaded. usually we recommend use `mc.cores = 1` and rely on internal OpenMP parallelism. Candidate selection which not trivially parallelizable is not usually a bottleneck.
`...`	other parameters to `mclapply` (used if `mc.cores > 1` )
`n_band_join`	calculate in how many bands signatures became same. Since each bucket is independant obvious way is to calculate this stastics at the end (by default), so we will do it only once. On the other side we can calculate it each `n_band_join` so we can save some memory (if this becomes a issue). in most cases we recommend to use default value for this parameter.

pairs of candidates with similarity => similarity - data.table with 3 colums: id1, id2, N - index of first candidate, index of second candidate, and number of buckets where they share same value. The latter provided only for information. (Intutition is following: the bigger N - the stronger similarity)

## Not run: 
library(text2vec)
library(LSHR)
library(Matrix)
data("movie_review")
it <- itoken(movie_review$review, preprocess_function = tolower,
tokenizer = word_tokenizer)
dtm <- create_dtm(it, hash_vectorizer())
dtm = as(dtm, "RsparseMatrix")
pairs = get_similar_pairs(dtm, bands_number = 4, rows_per_band = 32,
distance = 'cosine', verbose = TRUE)
pairs[order(-N)]

## End(Not run)