View source: R/preprocessing.r
preprocess_tokens | R Documentation |
Preprocess tokens in a character vector
preprocess_tokens(
x,
context = NULL,
language = "english",
use_stemming = F,
lowercase = T,
ngrams = 1,
replace_whitespace = F,
as_ascii = F,
remove_punctuation = T,
remove_stopwords = F,
remove_numbers = F,
min_freq = NULL,
min_docfreq = NULL,
max_freq = NULL,
max_docfreq = NULL,
min_char = NULL,
max_char = NULL,
ngram_skip_empty = T
)
x |
A character or factor vector in which each element is a token (i.e. a tokenized text) |
context |
Optionally, a character vector of the same length as x, specifying the context of token (e.g., document, sentence). Has to be given if ngram > 1 |
language |
The language used for stemming and removing stopwords |
use_stemming |
Logical, use stemming. (Make sure the specify the right language!) |
lowercase |
Logical, make token lowercase |
ngrams |
A number, specifying the number of tokens per ngram. Default is unigrams (1). |
replace_whitespace |
Logical. If TRUE, all whitespace is replaced by underscores |
as_ascii |
Logical. If TRUE, tokens will be forced to ascii |
remove_punctuation |
Logical. if TRUE, punctuation is removed |
remove_stopwords |
Logical. If TRUE, stopwords are removed (Make sure to specify the right language!) |
remove_numbers |
remove features that are only numbers |
min_freq |
an integer, specifying minimum token frequency. |
min_docfreq |
an integer, specifying minimum document frequency. |
max_freq |
an integer, specifying minimum token frequency. |
max_docfreq |
an integer, specifying minimum document frequency. |
min_char |
an integer, specifying minimum number of characters in a term |
max_char |
an integer, specifying maximum number of characters in a term |
ngram_skip_empty |
if ngrams are used, determines whether empty (filtered out) terms are skipped (i.e. c("this", NA, "test"), becomes "this_test") or |
a factor vector
tokens = c('I', 'am', 'a', 'SHORT', 'example', 'sentence', '!')
## default is lowercase without punctuation
preprocess_tokens(tokens)
## optionally, delete stopwords, perform stemming, and make ngrams
preprocess_tokens(tokens, remove_stopwords = TRUE, use_stemming = TRUE)
preprocess_tokens(tokens, context = NA, ngrams = 3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.