lsh_subset: List of all candidates in a corpus

Description Usage Arguments Value Examples

Description

List of all candidates in a corpus

Usage

1
lsh_subset(candidates)

Arguments

candidates

A data frame of candidate pairs from lsh_candidates.

Value

A character vector of document IDs from the candidate pairs, to be used to subset the TextReuseCorpus.

Examples

1
2
3
4
5
6
7
8
9
dir <- system.file("extdata/legal", package = "textreuse")
minhash <- minhash_generator(200, seed = 234)
corpus <- TextReuseCorpus(dir = dir,
                          tokenizer = tokenize_ngrams, n = 5,
                          minhash_func = minhash)
buckets <- lsh(corpus, bands = 50)
candidates <- lsh_candidates(buckets)
lsh_subset(candidates)
corpus[lsh_subset(candidates)]


Search within the textreuse package
Search all R packages, documentation and source code

Questions? Problems? Suggestions? or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.