tokenize: Recompute the tokens for a document or corpus

Description Usage Arguments Value Examples

View source: R/tokenize.R

Description

Given a TextReuseTextDocument or a TextReuseCorpus, this function recomputes the tokens and hashes with the functions specified. Optionally, it can also recompute the minhash signatures.

Usage

1
2
tokenize(x, tokenizer, ..., hash_func = hash_string, minhash_func = NULL,
  keep_tokens = FALSE, keep_text = TRUE)

Arguments

x

A TextReuseTextDocument or TextReuseCorpus.

tokenizer

A function to split the text into tokens. See tokenizers.

...

Arguments passed on to the tokenizer.

hash_func

A function to hash the tokens. See hash_string.

minhash_func

A function to create minhash signatures. See minhash_generator.

keep_tokens

Should the tokens be saved in the document that is returned or discarded?

keep_text

Should the text be saved in the document that is returned or discarded?

Value

The modified TextReuseTextDocument or TextReuseCorpus.

Examples

1
2
3
4
dir <- system.file("extdata/legal", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, tokenizer = NULL)
corpus <- tokenize(corpus, tokenize_ngrams)
head(tokens(corpus[[1]]))

Example output

[1] "4 every action"      "every action shall"  "action shall be"    
[4] "shall be prosecuted" "be prosecuted in"    "prosecuted in the"  

textreuse documentation built on May 30, 2017, 3:32 a.m.