tokenize: Recompute the tokens for a document or corpus
In textreuse: Detect Text Reuse and Document Similarity

Description Usage Arguments Value Examples

Given a TextReuseTextDocument or a TextReuseCorpus, this function recomputes the tokens and hashes with the functions specified. Optionally, it can also recompute the minhash signatures.

tokenize(
  x,
  tokenizer,
  ...,
  hash_func = hash_string,
  minhash_func = NULL,
  keep_tokens = FALSE,
  keep_text = TRUE
)

`x`	A `TextReuseTextDocument` or `TextReuseCorpus`.
`tokenizer`	A function to split the text into tokens. See `tokenizers`.
`...`	Arguments passed on to the `tokenizer`.
`hash_func`	A function to hash the tokens. See `hash_string`.
`minhash_func`	A function to create minhash signatures. See `minhash_generator`.
`keep_tokens`	Should the tokens be saved in the document that is returned or discarded?
`keep_text`	Should the text be saved in the document that is returned or discarded?

The modified TextReuseTextDocument or TextReuseCorpus.

dir <- system.file("extdata/legal", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, tokenizer = NULL)
corpus <- tokenize(corpus, tokenize_ngrams)
head(tokens(corpus[[1]]))

[1] "4 every action"      "every action shall"  "action shall be"    
[4] "shall be prosecuted" "be prosecuted in"    "prosecuted in the"

textreuse documentation built on July 8, 2020, 6:40 p.m.

textreuse index

Package overview README.md Introduction to the textreuse package Minhash and locality-sensitive hashing Pairwise comparisons for document similarity Text Alignment

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

textreuse
Detect Text Reuse and Document Similarity

tokenize: Recompute the tokens for a document or corpus
In textreuse: Detect Text Reuse and Document Similarity

Description

Usage

Arguments

Value

Examples

Example output

Related to tokenize in textreuse...

R Package Documentation

Browse R Packages

We want your feedback!

textreuse Detect Text Reuse and Document Similarity

tokenize: Recompute the tokens for a document or corpus In textreuse: Detect Text Reuse and Document Similarity

Description

Usage

Arguments

Value

Examples

Example output

Related to tokenize in textreuse...

R Package Documentation

Browse R Packages

We want your feedback!

textreuse
Detect Text Reuse and Document Similarity

tokenize: Recompute the tokens for a document or corpus
In textreuse: Detect Text Reuse and Document Similarity