textreuse: Detect Text Reuse and Document Similarity

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

Author
Lincoln Mullen [aut, cre]
Date of publication
2016-11-28 16:54:10
Maintainer
Lincoln Mullen <lincoln@lincolnmullen.com>
License
MIT + file LICENSE
Version
0.1.4
URLs

View on CRAN

Man pages

align_local
Local alignment of natural language texts
as.matrix.textreuse_candidates
Convert candidates data frames to other formats
filenames
Filenames from paths
hash_string
Hash a string to an integer
lsh
Locality sensitive hashing for minhash
lsh_candidates
Candidate pairs from LSH comparisons
lsh_compare
Compare candidates identified by LSH
lsh_probability
Probability that a candidate pair will be detected with LSH
lsh_query
Query a LSH cache for matches to a single document
lsh_subset
List of all candidates in a corpus
minhash_generator
Generate a minhash function
pairwise_candidates
Candidate pairs from pairwise comparisons
pairwise_compare
Pairwise comparisons among documents in a corpus
reexports
Objects exported from other packages
rehash
Recompute the hashes for a document or corpus
similarity-functions
Measure similarity/dissimilarity in documents
TextReuseCorpus
TextReuseCorpus
textreuse-package
Detect Text Reuse and Document Similarity
TextReuseTextDocument
TextReuseTextDocument
TextReuseTextDocument-accessors
Accessors for TextReuse objects
tokenize
Recompute the tokens for a document or corpus
tokenizers
Split texts into tokens
wordcount
Count words

Files in this package

textreuse
textreuse/inst
textreuse/inst/extdata
textreuse/inst/extdata/legal
textreuse/inst/extdata/legal/ny1850-match.txt
textreuse/inst/extdata/legal/ca1851-nomatch.txt
textreuse/inst/extdata/legal/ca1851-match.txt
textreuse/inst/extdata/ats
textreuse/inst/extdata/ats/lifeofrevrichard00baxt.txt
textreuse/inst/extdata/ats/gospeltruth00whit.txt
textreuse/inst/extdata/ats/remember00palm.txt
textreuse/inst/extdata/ats/practicalthought00nev.txt
textreuse/inst/extdata/ats/memoirjamesbrai00ricegoog.txt
textreuse/inst/extdata/ats/remembermeorholy00palm.txt
textreuse/inst/extdata/ats/thoughtsonpopery00nevi.txt
textreuse/inst/extdata/ats/calltounconv00baxt.txt
textreuse/inst/doc
textreuse/inst/doc/textreuse-introduction.html
textreuse/inst/doc/textreuse-introduction.R
textreuse/inst/doc/textreuse-pairwise.R
textreuse/inst/doc/textreuse-minhash.R
textreuse/inst/doc/textreuse-introduction.Rmd
textreuse/inst/doc/textreuse-alignment.Rmd
textreuse/inst/doc/textreuse-minhash.html
textreuse/inst/doc/textreuse-minhash.Rmd
textreuse/inst/doc/textreuse-pairwise.html
textreuse/inst/doc/textreuse-alignment.html
textreuse/inst/doc/textreuse-pairwise.Rmd
textreuse/inst/doc/textreuse-alignment.R
textreuse/tests
textreuse/tests/testthat.R
textreuse/tests/testthat
textreuse/tests/testthat/test-tokenizers.R
textreuse/tests/testthat/test-utils.R
textreuse/tests/testthat/test-jaccard.R
textreuse/tests/testthat/test-pairwise_cf.R
textreuse/tests/testthat/test-TextReuseTextDocument.R
textreuse/tests/testthat/newman.txt
textreuse/tests/testthat/test-candidate_pairs.R
textreuse/tests/testthat/test-minhash.R
textreuse/tests/testthat/test-alignment.R
textreuse/tests/testthat/test-lsh.R
textreuse/tests/testthat/test-wordcount.R
textreuse/tests/testthat/test-ratio_of_matches.R
textreuse/tests/testthat/test-hashing.R
textreuse/tests/testthat/test-TextReuseCorpus.R
textreuse/tests/testthat/test-filenames.R
textreuse/src
textreuse/src/sw_matrix.cpp
textreuse/src/skip_ngrams.cpp
textreuse/src/hash_string.cpp
textreuse/src/shingle_ngrams.cpp
textreuse/src/RcppExports.cpp
textreuse/NAMESPACE
textreuse/NEWS.md
textreuse/R
textreuse/R/utils.R
textreuse/R/align_local.R
textreuse/R/pairwise_compare.R
textreuse/R/lsh_probability.R
textreuse/R/TextReuseCorpus.R
textreuse/R/parallel.R
textreuse/R/textreuse-package.r
textreuse/R/tokenize.R
textreuse/R/TextReuseTextDocument.R
textreuse/R/lsh_candidates.R
textreuse/R/lsh_query.R
textreuse/R/filenames.R
textreuse/R/minhash.R
textreuse/R/conversion-functions.R
textreuse/R/lsh.R
textreuse/R/rehash.R
textreuse/R/RcppExports.R
textreuse/R/wordcount.R
textreuse/R/lsh_compare.R
textreuse/R/similarity.R
textreuse/R/pairwise_candidates.R
textreuse/R/tokenizers.R
textreuse/R/lsh_subset.R
textreuse/vignettes
textreuse/vignettes/textreuse-introduction.Rmd
textreuse/vignettes/textreuse-alignment.Rmd
textreuse/vignettes/textreuse-minhash.Rmd
textreuse/vignettes/textreuse-pairwise.Rmd
textreuse/README.md
textreuse/MD5
textreuse/build
textreuse/build/vignette.rds
textreuse/DESCRIPTION
textreuse/man
textreuse/man/lsh_candidates.Rd
textreuse/man/align_local.Rd
textreuse/man/wordcount.Rd
textreuse/man/lsh_query.Rd
textreuse/man/tokenizers.Rd
textreuse/man/lsh.Rd
textreuse/man/rehash.Rd
textreuse/man/textreuse-package.Rd
textreuse/man/pairwise_candidates.Rd
textreuse/man/pairwise_compare.Rd
textreuse/man/lsh_compare.Rd
textreuse/man/hash_string.Rd
textreuse/man/reexports.Rd
textreuse/man/filenames.Rd
textreuse/man/as.matrix.textreuse_candidates.Rd
textreuse/man/similarity-functions.Rd
textreuse/man/lsh_probability.Rd
textreuse/man/TextReuseTextDocument.Rd
textreuse/man/tokenize.Rd
textreuse/man/TextReuseTextDocument-accessors.Rd
textreuse/man/minhash_generator.Rd
textreuse/man/lsh_subset.Rd
textreuse/man/TextReuseCorpus.Rd
textreuse/LICENSE