View source: R/tokens_segment.R
tokens_segment | R Documentation |
Segment tokens by splitting on a pattern match. This is useful for breaking
the tokenized texts into smaller document units, based on a regular pattern
or a user-supplied annotation. While it normally makes more sense to do this
at the corpus level (see corpus_segment()
), tokens_segment
provides the option to perform this operation on tokens.
tokens_segment(
x,
pattern,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
extract_pattern = FALSE,
pattern_position = c("before", "after"),
use_docvars = TRUE
)
x |
tokens object whose token elements will be segmented |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
extract_pattern |
remove matched patterns from the texts and save in
docvars, if |
pattern_position |
either |
use_docvars |
if |
tokens_segment
returns a tokens object whose documents
have been split by patterns
txts <- "Fellow citizens, I am again called upon by the voice of my country to
execute the functions of its Chief Magistrate. When the occasion proper for
it shall arrive, I shall endeavor to express the high sense I entertain of
this distinguished honor."
toks <- tokens(txts)
# split by any punctuation
tokens_segment(toks, "^\\p{Sterm}$", valuetype = "regex",
extract_pattern = TRUE,
pattern_position = "after")
tokens_segment(toks, c(".", "?", "!"), valuetype = "fixed",
extract_pattern = TRUE,
pattern_position = "after")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.