textreuse: Detect Text Reuse and Document Similarity

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

AuthorLincoln Mullen [aut, cre]
Date of publication2016-11-28 16:54:10
MaintainerLincoln Mullen <lincoln@lincolnmullen.com>
LicenseMIT + file LICENSE
Version0.1.4
https://github.com/ropensci/textreuse

View on CRAN

Files in this package

textreuse
textreuse/inst
textreuse/inst/extdata
textreuse/inst/extdata/legal
textreuse/inst/extdata/legal/ny1850-match.txt
textreuse/inst/extdata/legal/ca1851-nomatch.txt
textreuse/inst/extdata/legal/ca1851-match.txt
textreuse/inst/extdata/ats
textreuse/inst/extdata/ats/lifeofrevrichard00baxt.txt
textreuse/inst/extdata/ats/gospeltruth00whit.txt
textreuse/inst/extdata/ats/remember00palm.txt
textreuse/inst/extdata/ats/practicalthought00nev.txt
textreuse/inst/extdata/ats/memoirjamesbrai00ricegoog.txt
textreuse/inst/extdata/ats/remembermeorholy00palm.txt
textreuse/inst/extdata/ats/thoughtsonpopery00nevi.txt
textreuse/inst/extdata/ats/calltounconv00baxt.txt
textreuse/inst/doc
textreuse/inst/doc/textreuse-introduction.html
textreuse/inst/doc/textreuse-introduction.R
textreuse/inst/doc/textreuse-pairwise.R
textreuse/inst/doc/textreuse-minhash.R
textreuse/inst/doc/textreuse-introduction.Rmd
textreuse/inst/doc/textreuse-alignment.Rmd
textreuse/inst/doc/textreuse-minhash.html
textreuse/inst/doc/textreuse-minhash.Rmd
textreuse/inst/doc/textreuse-pairwise.html
textreuse/inst/doc/textreuse-alignment.html
textreuse/inst/doc/textreuse-pairwise.Rmd
textreuse/inst/doc/textreuse-alignment.R
textreuse/tests
textreuse/tests/testthat.R
textreuse/tests/testthat
textreuse/tests/testthat/test-tokenizers.R
textreuse/tests/testthat/test-utils.R
textreuse/tests/testthat/test-jaccard.R
textreuse/tests/testthat/test-pairwise_cf.R
textreuse/tests/testthat/test-TextReuseTextDocument.R
textreuse/tests/testthat/newman.txt
textreuse/tests/testthat/test-candidate_pairs.R
textreuse/tests/testthat/test-minhash.R
textreuse/tests/testthat/test-alignment.R
textreuse/tests/testthat/test-lsh.R
textreuse/tests/testthat/test-wordcount.R
textreuse/tests/testthat/test-ratio_of_matches.R
textreuse/tests/testthat/test-hashing.R
textreuse/tests/testthat/test-TextReuseCorpus.R
textreuse/tests/testthat/test-filenames.R
textreuse/src
textreuse/src/sw_matrix.cpp
textreuse/src/skip_ngrams.cpp
textreuse/src/hash_string.cpp
textreuse/src/shingle_ngrams.cpp
textreuse/src/RcppExports.cpp
textreuse/NAMESPACE
textreuse/NEWS.md
textreuse/R
textreuse/R/utils.R textreuse/R/align_local.R textreuse/R/pairwise_compare.R textreuse/R/lsh_probability.R textreuse/R/TextReuseCorpus.R textreuse/R/parallel.R
textreuse/R/textreuse-package.r
textreuse/R/tokenize.R textreuse/R/TextReuseTextDocument.R textreuse/R/lsh_candidates.R textreuse/R/lsh_query.R textreuse/R/filenames.R textreuse/R/minhash.R textreuse/R/conversion-functions.R textreuse/R/lsh.R textreuse/R/rehash.R textreuse/R/RcppExports.R textreuse/R/wordcount.R textreuse/R/lsh_compare.R textreuse/R/similarity.R textreuse/R/pairwise_candidates.R textreuse/R/tokenizers.R textreuse/R/lsh_subset.R
textreuse/vignettes
textreuse/vignettes/textreuse-introduction.Rmd
textreuse/vignettes/textreuse-alignment.Rmd
textreuse/vignettes/textreuse-minhash.Rmd
textreuse/vignettes/textreuse-pairwise.Rmd
textreuse/README.md
textreuse/MD5
textreuse/build
textreuse/build/vignette.rds
textreuse/DESCRIPTION
textreuse/man
textreuse/man/lsh_candidates.Rd textreuse/man/align_local.Rd textreuse/man/wordcount.Rd textreuse/man/lsh_query.Rd textreuse/man/tokenizers.Rd textreuse/man/lsh.Rd textreuse/man/rehash.Rd textreuse/man/textreuse-package.Rd textreuse/man/pairwise_candidates.Rd textreuse/man/pairwise_compare.Rd textreuse/man/lsh_compare.Rd textreuse/man/hash_string.Rd textreuse/man/reexports.Rd textreuse/man/filenames.Rd textreuse/man/as.matrix.textreuse_candidates.Rd textreuse/man/similarity-functions.Rd textreuse/man/lsh_probability.Rd textreuse/man/TextReuseTextDocument.Rd textreuse/man/tokenize.Rd textreuse/man/TextReuseTextDocument-accessors.Rd textreuse/man/minhash_generator.Rd textreuse/man/lsh_subset.Rd textreuse/man/TextReuseCorpus.Rd
textreuse/LICENSE

Questions? Problems? Suggestions? or email at ian@mutexlabs.com.

All documentation is copyright its authors; we didn't write any of that.