tokenize: Tokenize raw text for training word embeddings.
In PsychWordVec: Word Embedding Research Framework for Psychological Science

tokenize

R Documentation

Tokenize raw text for training word embeddings.

Description

Tokenize raw text for training word embeddings.

Usage

tokenize(
  text,
  tokenizer = text2vec::word_tokenizer,
  split = " ",
  remove = "_|'|<br/>|<br />|e\\.g\\.|i\\.e\\.",
  encoding = "UTF-8",
  simplify = TRUE,
  verbose = TRUE
)

Arguments

`text`	A character vector of text, or a file path on disk containing text.
`tokenizer`	Function used to tokenize the text. Defaults to `text2vec::word_tokenizer()`.
`split`	Separator between tokens, only used when `simplify=TRUE`. Defaults to `" "`.
`remove`	Strings (in regular expression) to be removed from the text. Defaults to `"_\|'\|<br/>\|<br />\|e\\\\.g\\\\.\|i\\\\.e\\\\."`. You may turn off this by specifying `remove=NULL`.
`encoding`	Text encoding (only used if `text` is a file). Defaults to `"UTF-8"`.
`simplify`	Return a character vector (`TRUE`) or a list of character vectors (`FALSE`). Defaults to `TRUE`.
`verbose`	Print information to the console? Defaults to `TRUE`.

Value

simplify=TRUE: A tokenized character vector, with each element as a sentence.
simplify=FALSE: A list of tokenized character vectors, with each element as a vector of tokens in a sentence.

Examples

txt1 = c(
  "I love natural language processing (NLP)!",
  "I've been in this city for 10 years. I really like here!",
  "However, my computer is not among the \"Top 10\" list."
)
tokenize(txt1, simplify=FALSE)
tokenize(txt1) %>% cat(sep="\n----\n")

txt2 = text2vec::movie_review$review[1:5]
texts = tokenize(txt2)

txt2[1]
texts[1:20]  # all sentences in txt2[1]

PsychWordVec documentation built on Aug. 21, 2025, 5:53 p.m.