textreuse: Detect Text Reuse and Document Similarity

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

AuthorLincoln Mullen [aut, cre]
Date of publication2016-11-28 16:54:10
MaintainerLincoln Mullen <lincoln@lincolnmullen.com>
LicenseMIT + file LICENSE
Version0.1.4
https://github.com/ropensci/textreuse

View on CRAN

Functions

align_local Man page
as.matrix.textreuse_candidates Man page
content Man page
content<- Man page
filenames Man page
has_content Man page
has_hashes Man page
hashes Man page
hashes<- Man page
hash_string Man page
has_minhashes Man page
has_tokens Man page
is.TextReuseCorpus Man page
is.TextReuseTextDocument Man page
jaccard_bag_similarity Man page
jaccard_dissimilarity Man page
jaccard_similarity Man page
lsh Man page
lsh_candidates Man page
lsh_compare Man page
lsh_probability Man page
lsh_query Man page
lsh_subset Man page
lsh_threshold Man page
meta Man page
meta<- Man page
minhashes Man page
minhashes<- Man page
minhash_generator Man page
pairwise_candidates Man page
pairwise_compare Man page
ratio_of_matches Man page
reexports Man page
rehash Man page
similarity-functions Man page
skipped Man page
textreuse Man page
TextReuseCorpus Man page
textreuse-package Man page
TextReuseTextDocument Man page
TextReuseTextDocument-accessors Man page
tokenize Man page
tokenize_ngrams Man page
tokenizers Man page
tokenize_sentences Man page
tokenize_skip_ngrams Man page
tokenize_words Man page
tokens Man page
tokens<- Man page
wordcount Man page

Files

textreuse
textreuse/inst
textreuse/inst/extdata
textreuse/inst/extdata/legal
textreuse/inst/extdata/legal/ny1850-match.txt
textreuse/inst/extdata/legal/ca1851-nomatch.txt
textreuse/inst/extdata/legal/ca1851-match.txt
textreuse/inst/extdata/ats
textreuse/inst/extdata/ats/lifeofrevrichard00baxt.txt
textreuse/inst/extdata/ats/gospeltruth00whit.txt
textreuse/inst/extdata/ats/remember00palm.txt
textreuse/inst/extdata/ats/practicalthought00nev.txt
textreuse/inst/extdata/ats/memoirjamesbrai00ricegoog.txt
textreuse/inst/extdata/ats/remembermeorholy00palm.txt
textreuse/inst/extdata/ats/thoughtsonpopery00nevi.txt
textreuse/inst/extdata/ats/calltounconv00baxt.txt
textreuse/inst/doc
textreuse/inst/doc/textreuse-introduction.html
textreuse/inst/doc/textreuse-introduction.R
textreuse/inst/doc/textreuse-pairwise.R
textreuse/inst/doc/textreuse-minhash.R
textreuse/inst/doc/textreuse-introduction.Rmd
textreuse/inst/doc/textreuse-alignment.Rmd
textreuse/inst/doc/textreuse-minhash.html
textreuse/inst/doc/textreuse-minhash.Rmd
textreuse/inst/doc/textreuse-pairwise.html
textreuse/inst/doc/textreuse-alignment.html
textreuse/inst/doc/textreuse-pairwise.Rmd
textreuse/inst/doc/textreuse-alignment.R
textreuse/tests
textreuse/tests/testthat.R
textreuse/tests/testthat
textreuse/tests/testthat/test-tokenizers.R
textreuse/tests/testthat/test-utils.R
textreuse/tests/testthat/test-jaccard.R
textreuse/tests/testthat/test-pairwise_cf.R
textreuse/tests/testthat/test-TextReuseTextDocument.R
textreuse/tests/testthat/newman.txt
textreuse/tests/testthat/test-candidate_pairs.R
textreuse/tests/testthat/test-minhash.R
textreuse/tests/testthat/test-alignment.R
textreuse/tests/testthat/test-lsh.R
textreuse/tests/testthat/test-wordcount.R
textreuse/tests/testthat/test-ratio_of_matches.R
textreuse/tests/testthat/test-hashing.R
textreuse/tests/testthat/test-TextReuseCorpus.R
textreuse/tests/testthat/test-filenames.R
textreuse/src
textreuse/src/sw_matrix.cpp
textreuse/src/skip_ngrams.cpp
textreuse/src/hash_string.cpp
textreuse/src/shingle_ngrams.cpp
textreuse/src/RcppExports.cpp
textreuse/NAMESPACE
textreuse/NEWS.md
textreuse/R
textreuse/R/utils.R textreuse/R/align_local.R textreuse/R/pairwise_compare.R textreuse/R/lsh_probability.R textreuse/R/TextReuseCorpus.R textreuse/R/parallel.R
textreuse/R/textreuse-package.r
textreuse/R/tokenize.R textreuse/R/TextReuseTextDocument.R textreuse/R/lsh_candidates.R textreuse/R/lsh_query.R textreuse/R/filenames.R textreuse/R/minhash.R textreuse/R/conversion-functions.R textreuse/R/lsh.R textreuse/R/rehash.R textreuse/R/RcppExports.R textreuse/R/wordcount.R textreuse/R/lsh_compare.R textreuse/R/similarity.R textreuse/R/pairwise_candidates.R textreuse/R/tokenizers.R textreuse/R/lsh_subset.R
textreuse/vignettes
textreuse/vignettes/textreuse-introduction.Rmd
textreuse/vignettes/textreuse-alignment.Rmd
textreuse/vignettes/textreuse-minhash.Rmd
textreuse/vignettes/textreuse-pairwise.Rmd
textreuse/README.md
textreuse/MD5
textreuse/build
textreuse/build/vignette.rds
textreuse/DESCRIPTION
textreuse/man
textreuse/man/lsh_candidates.Rd textreuse/man/align_local.Rd textreuse/man/wordcount.Rd textreuse/man/lsh_query.Rd textreuse/man/tokenizers.Rd textreuse/man/lsh.Rd textreuse/man/rehash.Rd textreuse/man/textreuse-package.Rd textreuse/man/pairwise_candidates.Rd textreuse/man/pairwise_compare.Rd textreuse/man/lsh_compare.Rd textreuse/man/hash_string.Rd textreuse/man/reexports.Rd textreuse/man/filenames.Rd textreuse/man/as.matrix.textreuse_candidates.Rd textreuse/man/similarity-functions.Rd textreuse/man/lsh_probability.Rd textreuse/man/TextReuseTextDocument.Rd textreuse/man/tokenize.Rd textreuse/man/TextReuseTextDocument-accessors.Rd textreuse/man/minhash_generator.Rd textreuse/man/lsh_subset.Rd textreuse/man/TextReuseCorpus.Rd
textreuse/LICENSE

Questions? Problems? Suggestions? or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.