textreuse-package | R Documentation |
Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.
The best place to begin with this package in the introductory vignette.
vignette("textreuse-introduction", package = "textreuse")
After reading that vignette, the "pairwise" and "minhash" vignettes introduce specific paths for working with the package.
vignette("textreuse-pairwise", package = "textreuse")
vignette("textreuse-minhash", package = "textreuse")
vignette("textreuse-alignment", package = "textreuse")
Another good place to begin with the package is the documentation for loading
documents (TextReuseTextDocument
and
TextReuseCorpus
), for tokenizers,
similarity functions, and
locality-sensitive hashing.
Maintainer: Yaoxiang Li liyaoxiang@outlook.com (ORCID)
Authors:
Lincoln Mullen lincoln@lincolnmullen.com (ORCID)
The sample data provided in the extdata/legal
directory is
taken from a
corpus
of American Tract Society publications from the nineteen-century,
gathered from the Internet Archive.
The sample data provided in the extdata/legal
directory, are taken
from the following nineteenth-century codes of civil procedure from
California and New York.
Final Report of the Commissioners on Practice and Pleadings, in 2 Documents of the Assembly of New York, 73rd Sess., No. 16, (1850): 243-250, sections 597-613. Google Books.
An Act To Regulate Proceedings in Civil Cases, 1851 California Laws 51, 51-53 sections 4-17; 101, sections 313-316. Google Books.
Useful links:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.