tokens_trim: Trim tokens using frequency threshold-based feature selection
In quanteda: Quantitative Analysis of Textual Data

tokens_trim

R Documentation

Trim tokens using frequency threshold-based feature selection

Description

Returns a tokens object reduced in size based on document and term frequency, usually in terms of a minimum frequency, but may also be in terms of maximum frequencies. Setting a combination of minimum and maximum frequencies will select features based on a range.

Usage

tokens_trim(
  x,
  min_termfreq = NULL,
  max_termfreq = NULL,
  termfreq_type = c("count", "prop", "rank", "quantile"),
  min_docfreq = NULL,
  max_docfreq = NULL,
  docfreq_type = c("count", "prop", "rank", "quantile"),
  padding = FALSE,
  verbose = quanteda_options("verbose")
)

Arguments

`x`	a dfm object
`min_termfreq`, `max_termfreq`	minimum/maximum values of feature frequencies across all documents, below/above which features will be removed
`termfreq_type`	how `min_termfreq` and `max_termfreq` are interpreted. `"count"` sums the frequencies; `"prop"` divides the term frequencies by the total sum; `"rank"` is matched against the inverted ranking of features in terms of overall frequency, so that 1, 2, ... are the highest and second highest frequency features, and so on; `"quantile"` sets the cutoffs according to the quantiles (see `quantile()`) of term frequencies.
`min_docfreq`, `max_docfreq`	minimum/maximum values of a feature's document frequency, below/above which features will be removed
`docfreq_type`	specify how `min_docfreq` and `max_docfreq` are interpreted. `"count"` is the same as `⁠[docfreq](x, scheme = "count")⁠`; `"prop"` divides the document frequencies by the total sum; `"rank"` is matched against the inverted ranking of document frequency, so that 1, 2, ... are the features with the highest and second highest document frequencies, and so on; `"quantile"` sets the cutoffs according to the quantiles (see `quantile()`) of document frequencies.
`padding`	if `TRUE`, leave an empty string where the removed tokens previously existed.
`verbose`	if `TRUE` print the number of tokens and documents before and after the function is applied. The number of tokens does not include paddings.

Value

A tokens object with reduced size.

Examples

toks <- tokens(data_corpus_inaugural)

# keep only words occurring >= 10 times and in >= 2 documents
tokens_trim(toks, min_termfreq = 10, min_docfreq = 2, padding = TRUE)

# keep only words occurring >= 10 times and no more than 90% of the documents
tokens_trim(toks, min_termfreq = 10, max_docfreq = 0.9, docfreq_type = "prop",
            padding = TRUE)

quanteda documentation built on June 8, 2025, 9:41 p.m.