Description Usage Arguments Value Examples
For a given matrix function generate indices of similar rows.
1 2 3 4 5 | get_similar_pairs(X, bands_number, rows_per_band, distance = c("cosine",
"jaccard"), seed = 1L, verbose = FALSE, mc.cores = 1, ...)
get_similar_pairs_cosine(X, bands_number, rows_per_band, seed = 1L,
verbose = FALSE, mc.cores = 1, n_band_join = bands_number, ...)
|
X |
input matrix - sparse or dense |
bands_number |
number of bands for LSH algorithm - tradeoff between precision and number of false positive candidates. See get_s_curve for details. |
rows_per_band |
number of rows in each band for LSH algorithm - tradeoff between precision and number of false positive candidates. See get_s_curve for details. For "cosine" distance due to performance reasons (bit arifmetics) only values less than 32 are supported. |
distance |
on of "cosine" or "jaccard" - how to measure distance between rows of input matrix |
seed |
random seed for reproducibility |
verbose |
|
mc.cores |
number of cores to use for bands processing - random projection and candidate selection
(this is embrassingly parallel task - can be done independently for each band).
Most epensive operations - random projection. It is itself parallelized with OpenMP, so when |
... |
other parameters to |
n_band_join |
calculate in how many bands signatures became same. Since each bucket is independant obvious way is
to calculate this stastics at the end (by default), so we will do it only once. On the other side we can calculate it
each |
pairs of candidates with similarity => similarity
-
data.table
with 3 colums: id1, id2, N -
index of first candidate, index of second candidate,
and number of buckets where they share same value. The latter provided only for information.
(Intutition is following: the bigger N - the stronger similarity)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | ## Not run:
library(text2vec)
library(LSHR)
library(Matrix)
data("movie_review")
it <- itoken(movie_review$review, preprocess_function = tolower,
tokenizer = word_tokenizer)
dtm <- create_dtm(it, hash_vectorizer())
dtm = as(dtm, "RsparseMatrix")
pairs = get_similar_pairs(dtm, bands_number = 4, rows_per_band = 32,
distance = 'cosine', verbose = TRUE)
pairs[order(-N)]
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.