2017)

Description Usage Arguments Value Examples

Split each element of a character vector by 'split_re' into its constituent 'ngram' tokens.

1	tokenize_text(strings, ngram, split_re = " ", ...)

`strings`	character vector of text documents to be tokenized.
`ngram`	positive integer specifying size of ngram chunks.
`split_re`	regular expression denoting the token boundary to split strings by.
`...`	named arguments passed to 'strs;lit()'

if 'length(strings)==1', returns a character vector of 'ngram' tokens. If 'length(strings) > 1', returns a list each of whose elements is a character vector of 'ngram' tokens.

{
  string <- "hai mi name timi + me girl nam dootza--tza"
  tokenize_text(string, 1)
  tokenize_text(string, 2)
  lapply(1:3, function(x) tokenize_text(string, x))
  tokenize_text(string, 2, "[ -]")
  tokenize_text("me.lava.me.dootzi", 3, "\\.")
  tokenize_text("me.lava.me.dootzi", 3, ".", fixed=TRUE)
  tokenize_text(rep("me.lava.me.dootzi", 2), 3, ".", fixed=TRUE)
  tokenize_text(c(string, "waow me fillin heppi meby beby"), 3)
  tokenize_text(c(string, "waow me fillin heppi meby beby", NA), 3)
  tokenize_text(c(string, "waow me fillin heppi meby beby", ""), 3)
  tokenize_text(NA, 3)
}