| dtm_stopper | R Documentation | 
dtm_stopper will "stop" terms from the analysis by removing columns in a
DTM based on stop rules. Rules include matching terms in a precompiled or
custom list, terms meeting an upper or lower document frequency threshold,
or terms meeting an upper or lower term frequency threshold.
dtm_stopper(
  dtm,
  stop_list = NULL,
  stop_termfreq = NULL,
  stop_termrank = NULL,
  stop_termprop = NULL,
  stop_docfreq = NULL,
  stop_docprop = NULL,
  stop_hapax = FALSE,
  stop_null = FALSE,
  omit_empty = FALSE,
  dense = FALSE,
  ignore_case = TRUE
)
| dtm | Document-term matrix with terms as columns. Works with DTMs
produced by any popular text analysis package, or you can use the
 | 
| stop_list | Vector of terms, from a precompiled stoplist or
custom list such as  | 
| stop_termfreq | Vector of two numbers indicating the lower and upper
threshold for exclusion (see details). Use  | 
| stop_termrank | Single integer indicating upper term rank threshold for exclusion (see details). | 
| stop_termprop | Vector of two numbers indicating the lower and upper
threshold for exclusion (see details). Use  | 
| stop_docfreq | Vector of two numbers indicating the lower and upper
threshold for exclusion (see details). Use  | 
| stop_docprop | Vector of two numbers indicating the lower and upper
threshold for exclusion (see details). Use  | 
| stop_hapax | Logical (default = FALSE) indicating whether to remove terms occurring one time (or zero times), a.k.a. hapax legomena | 
| stop_null | Logical (default = FALSE) indicating whether to remove terms that occur zero times in the DTM. | 
| omit_empty | Logical (default = FALSE) indicating whether to omit rows that are empty after stopping any terms. | 
| dense | The default ( | 
| ignore_case | Logical (default = TRUE) indicating whether to ignore capitalization. | 
Stopping terms by removing their respective columns in the DTM is
significantly more efficient than searching raw text with string matching
and deletion rules. Behind the scenes, the function relies on
the fastmatch package to quickly match/not-match terms.
The stop_list arguments takes a list of terms which are matched and
removed from the DTM. If ignore_case = TRUE (the default) then word
case will be ignored.
The stop_termfreq argument provides rules based on a term's occurrences
in the DTM as a whole – regardless of its within document frequency. If
real numbers between 0 and 1 are provided then terms will be removed by
corpus proportion. For example c(0.01, 0.99), terms that are either below
1% of the total tokens or above 99% of the total tokens will be removed. If
integers are provided then terms will be removed by total count. For example
c(100, 9000), occurring less than 100 or more than 9000 times in the
corpus will be removed. This also means that if c(0, 1) is provided, then
the will only keep terms occurring once.
The stop_termrank argument provides the upper threshold for a terms' rank
in the corpus. For example, 5L will remove the five most frequent terms.
The stop_docfreq argument provides rules based on a term's document
frequency – i.e. the number of documents within which it occurs, regardless
of how many times it occurs. If real numbers between 0 and 1 are provided
then terms will be removed by corpus proportion. For example c(0.01, 0.99),
terms in more than 99% of all documents or terms that are in less than 1% of
all documents. For example c(100, 9000), then words occurring in less than
100 documents or more than 9000 documents will be removed. This means that if
c(0, 1) is provided, then the function will only keep terms occurring in
exactly one document, and remove terms in more than one.
The stop_hapax argument is a shortcut for removing terms occurring just one
time in the corpus – called hapax legomena. Typically, a size-able portion
of the corpus tends to be hapax terms, and removing them is a quick solution
to reducing the dimensions of a DTM. The DTM must be frequency counts (not
relative frequencies).
The stop_null argument removes terms that do not occur at all.
In other words, there is a column for the term, but the entire column
is zero. This can occur for a variety of reasons, such as starting with
a predefined vocabulary (e.g., using dtm_builder's vocab argument) or
through some cleaning processes.
The omit_empty argument will remove documents that are empty
returns a document-term matrix of class "dgCMatrix"
Dustin Stoltz
# create corpus and DTM
my_corpus <- data.frame(
  text = c(
    "I hear babies crying I watch them grow",
    "They'll learn much more than I'll ever know",
    "And I think to myself",
    "What a wonderful world",
    "Yes I think to myself",
    "What a wonderful world"
  ),
  line_id = paste0("line", seq_len(6))
)
## some text preprocessing
my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text))
dtm <- dtm_builder(
  data = my_corpus,
  text = clean_text,
  doc_id = line_id
)
## example 1 with R 4.1 pipe
dtm_st <- dtm |>
  dtm_stopper(stop_list = c("world", "babies"))
## example 2 without pipe
dtm_st <- dtm_stopper(
  dtm,
  stop_list = c("world", "babies")
)
## example 3 precompiled stoplist
dtm_st <- dtm_stopper(
  dtm,
  stop_list = get_stoplist("snowball2014")
)
## example 4, stop top 2
dtm_st <- dtm_stopper(
  dtm,
  stop_termrank = 2L
)
## example 5, stop docfreq
dtm_st <- dtm_stopper(
  dtm,
  stop_docfreq = c(2, 5)
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.