tokenize_text: Tokenize text

Description Usage Arguments Value Examples

Description

Split each element of a character vector by 'split_re' into its constituent 'ngram' tokens.

Usage

1
tokenize_text(strings, ngram, split_re = " ", ...)

Arguments

strings

character vector of text documents to be tokenized.

ngram

positive integer specifying size of ngram chunks.

split_re

regular expression denoting the token boundary to split strings by.

...

named arguments passed to 'strs;lit()'

Value

if 'length(strings)==1', returns a character vector of 'ngram' tokens. If 'length(strings) > 1', returns a list each of whose elements is a character vector of 'ngram' tokens.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
  string <- "hai mi name timi + me girl nam dootza--tza"
  tokenize_text(string, 1)
  tokenize_text(string, 2)
  lapply(1:3, function(x) tokenize_text(string, x))
  tokenize_text(string, 2, "[ -]")
  tokenize_text("me.lava.me.dootzi", 3, "\\.")
  tokenize_text("me.lava.me.dootzi", 3, ".", fixed=TRUE)
  tokenize_text(rep("me.lava.me.dootzi", 2), 3, ".", fixed=TRUE)
  tokenize_text(c(string, "waow me fillin heppi meby beby"), 3)
  tokenize_text(c(string, "waow me fillin heppi meby beby", NA), 3)
  tokenize_text(c(string, "waow me fillin heppi meby beby", ""), 3)
  tokenize_text(NA, 3)
}

lefft/lefftpack documentation built on May 8, 2019, 1:13 p.m.