Description Usage Arguments Value Examples
View source: R/spacy_tokenize.R
Efficient tokenization (without POS tagging, dependency parsing, lemmatization, or named entity recognition) of texts using spaCy.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
x |
a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif) |
what |
the unit for splitting the text, available alternatives are:
|
remove_punct |
remove punctuation tokens. |
remove_url |
remove tokens that look like a url or email address. |
remove_numbers |
remove tokens that look like a number (e.g. "334", "3.1415", "fifty"). |
remove_separators |
remove spaces as separators when
all other remove functionalities (e.g. |
remove_symbols |
remove symbols. The symbols are either |
padding |
if |
multithread |
logical; If |
output |
type of returning object. Either |
... |
not used directly |
either list
or data.frame
of tokens
1 2 3 4 5 6 7 8 | spacy_initialize()
txt <- "And now for something completely different."
spacy_tokenize(txt)
txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.",
doc2 = "This is the second document.",
doc3 = "This is a \\\"quoted\\\" text." )
spacy_tokenize(txt2)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.