tokenize | R Documentation |
Tokenize raw text for training word embeddings.
tokenize(
text,
tokenizer = text2vec::word_tokenizer,
split = " ",
remove = "_|'|<br/>|<br />|e\\.g\\.|i\\.e\\.",
encoding = "UTF-8",
simplify = TRUE,
verbose = TRUE
)
text |
A character vector of text, or a file path on disk containing text. |
tokenizer |
Function used to tokenize the text.
Defaults to |
split |
Separator between tokens, only used when |
remove |
Strings (in regular expression) to be removed from the text.
Defaults to |
encoding |
Text encoding (only used if |
simplify |
Return a character vector ( |
verbose |
Print information to the console? Defaults to |
simplify=TRUE
: A tokenized character vector,
with each element as a sentence.
simplify=FALSE
: A list of tokenized character vectors,
with each element as a vector of tokens in a sentence.
train_wordvec
txt1 = c(
"I love natural language processing (NLP)!",
"I've been in this city for 10 years. I really like here!",
"However, my computer is not among the \"Top 10\" list."
)
tokenize(txt1, simplify=FALSE)
tokenize(txt1) %>% cat(sep="\n----\n")
txt2 = text2vec::movie_review$review[1:5]
texts = tokenize(txt2)
txt2[1]
texts[1:20] # all sentences in txt2[1]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.