preprocess_tokens: Preprocess tokens in a character vector
In corpustools: Managing, Querying and Analyzing Tokenized Text

preprocess_tokens

R Documentation

Preprocess tokens in a character vector

Description

Preprocess tokens in a character vector

Usage

preprocess_tokens(
  x,
  context = NULL,
  language = "english",
  use_stemming = F,
  lowercase = T,
  ngrams = 1,
  replace_whitespace = F,
  as_ascii = F,
  remove_punctuation = T,
  remove_stopwords = F,
  remove_numbers = F,
  min_freq = NULL,
  min_docfreq = NULL,
  max_freq = NULL,
  max_docfreq = NULL,
  min_char = NULL,
  max_char = NULL,
  ngram_skip_empty = T
)

Arguments

`x`	A character or factor vector in which each element is a token (i.e. a tokenized text)
`context`	Optionally, a character vector of the same length as x, specifying the context of token (e.g., document, sentence). Has to be given if ngram > 1
`language`	The language used for stemming and removing stopwords
`use_stemming`	Logical, use stemming. (Make sure the specify the right language!)
`lowercase`	Logical, make token lowercase
`ngrams`	A number, specifying the number of tokens per ngram. Default is unigrams (1).
`replace_whitespace`	Logical. If TRUE, all whitespace is replaced by underscores
`as_ascii`	Logical. If TRUE, tokens will be forced to ascii
`remove_punctuation`	Logical. if TRUE, punctuation is removed
`remove_stopwords`	Logical. If TRUE, stopwords are removed (Make sure to specify the right language!)
`remove_numbers`	remove features that are only numbers
`min_freq`	an integer, specifying minimum token frequency.
`min_docfreq`	an integer, specifying minimum document frequency.
`max_freq`	an integer, specifying minimum token frequency.
`max_docfreq`	an integer, specifying minimum document frequency.
`min_char`	an integer, specifying minimum number of characters in a term
`max_char`	an integer, specifying maximum number of characters in a term
`ngram_skip_empty`	if ngrams are used, determines whether empty (filtered out) terms are skipped (i.e. c("this", NA, "test"), becomes "this_test") or

Value

a factor vector

Examples

tokens = c('I', 'am', 'a', 'SHORT', 'example', 'sentence', '!')

## default is lowercase without punctuation
preprocess_tokens(tokens)

## optionally, delete stopwords, perform stemming, and make ngrams
preprocess_tokens(tokens, remove_stopwords = TRUE, use_stemming = TRUE)
preprocess_tokens(tokens, context = NA, ngrams = 3)

corpustools documentation built on May 31, 2023, 8:45 p.m.