tokens_split: Split tokens by a separator pattern

Description Usage Arguments Examples

View source: R/tokens_split.R

Description

Replaces tokens by multiple replacements consisting of elements split by a separator pattern, with the option of retaining the separator. This function effectively reverses the operation of tokens_compound().

Usage

1
2
3
4
5
6
tokens_split(
  x,
  separator = " ",
  valuetype = c("fixed", "regex"),
  remove_separator = TRUE
)

Arguments

x

a tokens object

separator

a single-character pattern match by which tokens are separated

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

remove_separator

if TRUE, remove separator from new tokens

Examples

1
2
3
4
5
6
7
8
9
# undo tokens_compound()
toks1 <- tokens("pork barrel is an idiomatic multi-word expression")
tokens_compound(toks1, phrase("pork barrel"))
tokens_compound(toks1, phrase("pork barrel")) %>%
    tokens_split(separator = "_")
    
# similar to tokens(x, remove_hyphen = TRUE) but post-tokenization 
toks2 <- tokens("UK-EU negotiation is not going anywhere as of 2018-12-24.")
tokens_split(toks2, separator = "-", remove_separator = FALSE)

koheiw/quanteda.core documentation built on Sept. 21, 2020, 3:44 p.m.