tokens_wordstem: Stem the terms in an object
In quanteda: Quantitative Analysis of Textual Data

tokens_wordstem

R Documentation

Stem the terms in an object

Description

Apply a stemmer to words. This is a wrapper to wordStem designed to allow this function to be called without loading the entire SnowballC package. wordStem uses Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball.

Usage

tokens_wordstem(
  x,
  language = quanteda_options("language_stemmer"),
  verbose = quanteda_options("verbose")
)

char_wordstem(
  x,
  language = quanteda_options("language_stemmer"),
  check_whitespace = TRUE
)

dfm_wordstem(
  x,
  language = quanteda_options("language_stemmer"),
  verbose = quanteda_options("verbose")
)

Arguments

`x`	a character, tokens, or dfm object whose word stems are to be removed. If tokenized texts, the tokenization must be word-based.
`language`	the name of a recognized language, as returned by getStemLanguages, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes)
`verbose`	if `TRUE` print the number of tokens and documents before and after the function is applied. The number of tokens does not include paddings.
`check_whitespace`	logical; if `TRUE`, stop with a warning when trying to stem inputs containing whitespace

Value

tokens_wordstem() returns a tokens object whose word types have been stemmed.

char_wordstem() returns a character object whose word types have been stemmed.

dfm_wordstem() returns a dfm object whose word types (features) have been stemmed, and recombined to consolidate features made equivalent because of stemming.

References

https://snowballstem.org/

https://www.iso.org/iso-639-language-code for the ISO-639 language codes

Examples

# example applied to tokens
txt <- c(one = "eating eater eaters eats ate",
         two = "taxing taxes taxed my tax return")
th <- tokens(txt)
tokens_wordstem(th)

# simple example
char_wordstem(c("win", "winning", "wins", "won", "winner"))

# example applied to a dfm
(origdfm <- dfm(tokens(txt)))
dfm_wordstem(origdfm)

quanteda documentation built on April 7, 2026, 1:06 a.m.