textreuse: Detect Text Reuse and Document Similarity
Version 0.1.4

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

AuthorLincoln Mullen [aut, cre]
Date of publication2016-11-28 16:54:10
MaintainerLincoln Mullen <lincoln@lincolnmullen.com>
LicenseMIT + file LICENSE
Version0.1.4
URL https://github.com/ropensci/textreuse
Package repositoryView on CRAN
InstallationInstall the latest version of this package by entering the following in R:
install.packages("textreuse")

Getting started

Package overview
README.md
Introduction to the textreuse package
Minhash and locality-sensitive hashing
Pairwise comparisons for document similarity
Text Alignment

Popular man pages

filenames: Filenames from paths
lsh: Locality sensitive hashing for minhash
lsh_candidates: Candidate pairs from LSH comparisons
lsh_probability: Probability that a candidate pair will be detected with LSH
similarity-functions: Measure similarity/dissimilarity in documents
tokenize: Recompute the tokens for a document or corpus
wordcount: Count words
See all...

All man pages Function index File listing

Man pages

align_local: Local alignment of natural language texts
as.matrix.textreuse_candidates: Convert candidates data frames to other formats
filenames: Filenames from paths
hash_string: Hash a string to an integer
lsh: Locality sensitive hashing for minhash
lsh_candidates: Candidate pairs from LSH comparisons
lsh_compare: Compare candidates identified by LSH
lsh_probability: Probability that a candidate pair will be detected with LSH
lsh_query: Query a LSH cache for matches to a single document
lsh_subset: List of all candidates in a corpus
minhash_generator: Generate a minhash function
pairwise_candidates: Candidate pairs from pairwise comparisons
pairwise_compare: Pairwise comparisons among documents in a corpus
reexports: Objects exported from other packages
rehash: Recompute the hashes for a document or corpus
similarity-functions: Measure similarity/dissimilarity in documents
TextReuseCorpus: TextReuseCorpus
textreuse-package: Detect Text Reuse and Document Similarity
TextReuseTextDocument: TextReuseTextDocument
TextReuseTextDocument-accessors: Accessors for TextReuse objects
tokenize: Recompute the tokens for a document or corpus
tokenizers: Split texts into tokens
wordcount: Count words

Functions

Files

inst
inst/extdata
inst/extdata/legal
inst/extdata/legal/ny1850-match.txt
inst/extdata/legal/ca1851-nomatch.txt
inst/extdata/legal/ca1851-match.txt
inst/extdata/ats
inst/extdata/ats/lifeofrevrichard00baxt.txt
inst/extdata/ats/gospeltruth00whit.txt
inst/extdata/ats/remember00palm.txt
inst/extdata/ats/practicalthought00nev.txt
inst/extdata/ats/memoirjamesbrai00ricegoog.txt
inst/extdata/ats/remembermeorholy00palm.txt
inst/extdata/ats/thoughtsonpopery00nevi.txt
inst/extdata/ats/calltounconv00baxt.txt
inst/doc
inst/doc/textreuse-introduction.html
inst/doc/textreuse-introduction.R
inst/doc/textreuse-pairwise.R
inst/doc/textreuse-minhash.R
inst/doc/textreuse-introduction.Rmd
inst/doc/textreuse-alignment.Rmd
inst/doc/textreuse-minhash.html
inst/doc/textreuse-minhash.Rmd
inst/doc/textreuse-pairwise.html
inst/doc/textreuse-alignment.html
inst/doc/textreuse-pairwise.Rmd
inst/doc/textreuse-alignment.R
tests
tests/testthat.R
tests/testthat
tests/testthat/test-tokenizers.R
tests/testthat/test-utils.R
tests/testthat/test-jaccard.R
tests/testthat/test-pairwise_cf.R
tests/testthat/test-TextReuseTextDocument.R
tests/testthat/newman.txt
tests/testthat/test-candidate_pairs.R
tests/testthat/test-minhash.R
tests/testthat/test-alignment.R
tests/testthat/test-lsh.R
tests/testthat/test-wordcount.R
tests/testthat/test-ratio_of_matches.R
tests/testthat/test-hashing.R
tests/testthat/test-TextReuseCorpus.R
tests/testthat/test-filenames.R
src
src/sw_matrix.cpp
src/skip_ngrams.cpp
src/hash_string.cpp
src/shingle_ngrams.cpp
src/RcppExports.cpp
NAMESPACE
NEWS.md
R
R/utils.R
R/align_local.R
R/pairwise_compare.R
R/lsh_probability.R
R/TextReuseCorpus.R
R/parallel.R
R/textreuse-package.r
R/tokenize.R
R/TextReuseTextDocument.R
R/lsh_candidates.R
R/lsh_query.R
R/filenames.R
R/minhash.R
R/conversion-functions.R
R/lsh.R
R/rehash.R
R/RcppExports.R
R/wordcount.R
R/lsh_compare.R
R/similarity.R
R/pairwise_candidates.R
R/tokenizers.R
R/lsh_subset.R
vignettes
vignettes/textreuse-introduction.Rmd
vignettes/textreuse-alignment.Rmd
vignettes/textreuse-minhash.Rmd
vignettes/textreuse-pairwise.Rmd
README.md
MD5
build
build/vignette.rds
DESCRIPTION
man
man/lsh_candidates.Rd
man/align_local.Rd
man/wordcount.Rd
man/lsh_query.Rd
man/tokenizers.Rd
man/lsh.Rd
man/rehash.Rd
man/textreuse-package.Rd
man/pairwise_candidates.Rd
man/pairwise_compare.Rd
man/lsh_compare.Rd
man/hash_string.Rd
man/reexports.Rd
man/filenames.Rd
man/as.matrix.textreuse_candidates.Rd
man/similarity-functions.Rd
man/lsh_probability.Rd
man/TextReuseTextDocument.Rd
man/tokenize.Rd
man/TextReuseTextDocument-accessors.Rd
man/minhash_generator.Rd
man/lsh_subset.Rd
man/TextReuseCorpus.Rd
LICENSE
textreuse documentation built on May 20, 2017, 1:13 a.m.

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs in the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.