tokens_trim: Trim tokens using frequency threshold-based feature selection

View source: R/tokens_trim.R

tokens_trimR Documentation

Trim tokens using frequency threshold-based feature selection

Description

Returns a tokens object reduced in size based on document and term frequency, usually in terms of a minimum frequency, but may also be in terms of maximum frequencies. Setting a combination of minimum and maximum frequencies will select features based on a range.

Usage

tokens_trim(
  x,
  min_termfreq = NULL,
  max_termfreq = NULL,
  termfreq_type = c("count", "prop", "rank", "quantile"),
  min_docfreq = NULL,
  max_docfreq = NULL,
  docfreq_type = c("count", "prop", "rank", "quantile"),
  padding = FALSE,
  verbose = quanteda_options("verbose")
)

Arguments

x

a dfm object

min_termfreq, max_termfreq

minimum/maximum values of feature frequencies across all documents, below/above which features will be removed

termfreq_type

how min_termfreq and max_termfreq are interpreted. "count" sums the frequencies; "prop" divides the term frequencies by the total sum; "rank" is matched against the inverted ranking of features in terms of overall frequency, so that 1, 2, ... are the highest and second highest frequency features, and so on; "quantile" sets the cutoffs according to the quantiles (see quantile()) of term frequencies.

min_docfreq, max_docfreq

minimum/maximum values of a feature's document frequency, below/above which features will be removed

docfreq_type

specify how min_docfreq and max_docfreq are interpreted. "count" is the same as ⁠[docfreq](x, scheme = "count")⁠; "prop" divides the document frequencies by the total sum; "rank" is matched against the inverted ranking of document frequency, so that 1, 2, ... are the features with the highest and second highest document frequencies, and so on; "quantile" sets the cutoffs according to the quantiles (see quantile()) of document frequencies.

padding

if TRUE, leave an empty string where the removed tokens previously existed.

verbose

print messages

Value

A tokens object with reduced size.

See Also

dfm_trim()

Examples

toks <- tokens(data_corpus_inaugural)

# keep only words occurring >= 10 times and in >= 2 documents
tokens_trim(toks, min_termfreq = 10, min_docfreq = 2, padding = TRUE)

# keep only words occurring >= 10 times and no more than 90% of the documents
tokens_trim(toks, min_termfreq = 10, max_docfreq = 0.9, docfreq_type = "prop",
            padding = TRUE)


quanteda documentation built on Sept. 11, 2024, 6:08 p.m.