View source: R/tokens_select.R
tokens_select | R Documentation |
These function select or discard tokens from a tokens object. For
convenience, the functions tokens_remove
and tokens_keep
are defined as
shortcuts for tokens_select(x, pattern, selection = "remove")
and
tokens_select(x, pattern, selection = "keep")
, respectively. The most
common usage for tokens_remove
will be to eliminate stop words from a text
or text-based object, while the most common use of tokens_select
will be to
select tokens with only positive pattern matches from a list of regular
expressions, including a dictionary. startpos
and endpos
determine the
positions of tokens searched for pattern
and areas affected are expanded by
window
.
tokens_select(
x,
pattern,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
padding = FALSE,
window = 0,
min_nchar = NULL,
max_nchar = NULL,
startpos = 1L,
endpos = -1L,
apply_if = NULL,
verbose = quanteda_options("verbose")
)
tokens_remove(x, ...)
tokens_keep(x, ...)
x |
tokens object whose token elements will be removed or kept |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
selection |
whether to |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
padding |
if |
window |
integer of length 1 or 2; the size of the window of tokens
adjacent to Terms from overlapping windows are never double-counted, but simply
returned in the pattern match. This is because |
min_nchar , max_nchar |
optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
|
startpos , endpos |
integer; position of tokens in documents where pattern
matching starts and ends, where 1 is the first token in a document. For
negative indexes, counting starts at the ending token of the document, so
that -1 denotes the last token in the document, -2 the second to last, etc.
When the length of the vector is equal to |
apply_if |
logical vector of length |
verbose |
if |
... |
additional arguments passed by |
a tokens object with tokens selected or removed based on their
match to pattern
## tokens_select with simple examples
toks <- as.tokens(list(letters, LETTERS))
tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = FALSE)
tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = TRUE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", padding = FALSE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", padding = TRUE)
# how case_insensitive works
tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = TRUE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = FALSE)
# use window
tokens_select(toks, c("b", "f"), selection = "keep", window = 1)
tokens_select(toks, c("b", "f"), selection = "remove", window = 1)
tokens_remove(toks, c("b", "f"), window = c(0, 1))
tokens_select(toks, pattern = c("e", "g"), window = c(1, 2))
# tokens_remove example: remove stopwords
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my
country to execute the functions of its Chief Magistrate.",
wash2 <- "When the occasion proper for it shall arrive, I shall
endeavor to express the high sense I entertain of this
distinguished honor.")
tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english"))
# token_keep example: keep two-letter words
tokens_keep(tokens(txt, remove_punct = TRUE), "??")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.