tokenize_internal | R Documentation |
Internal methods for tokenization providing default and legacy methods for text segmentation.
tokenize_word2(
x,
split_hyphens = FALSE,
verbose = quanteda_options("verbose"),
...
)
tokenize_word3(
x,
split_hyphens = FALSE,
verbose = quanteda_options("verbose"),
...
)
tokenize_word4(
x,
split_hyphens = FALSE,
split_tags = FALSE,
split_elisions = FALSE,
verbose = quanteda_options("verbose"),
...
)
tokenize_word1(
x,
split_hyphens = FALSE,
verbose = quanteda_options("verbose"),
...
)
tokenize_character(x, ...)
tokenize_sentence(x, verbose = FALSE, ...)
tokenize_fasterword(x, ...)
tokenize_fastestword(x, ...)
x |
(named) character; input texts |
split_hyphens |
logical; if |
verbose |
if |
... |
used to pass arguments among the functions |
split_tags |
logical; if |
Each of the word tokenizers corresponds to a major version of quanteda,
kept here for backward compatibility and comparison. tokenize_word3()
is
identical to tokenize_word2()
.
a list of characters corresponding to the (most conservative)
tokenization, including whitespace where applicable; except for
tokenize_word1()
, which is a special tokenizer for Internet language that
includes URLs, #hashtags, @usernames, and email addresses.
## Not run:
txt <- c(doc1 = "Tweet https://quanteda.io using @quantedainit and #rstats.",
doc2 = "The £1,000,000 question.",
doc4 = "Line 1.\nLine2\n\nLine3.",
doc5 = "?",
doc6 = "Self-aware machines! \U0001f600",
doc7 = "Qu'est-ce que c'est?")
tokenize_word2(txt)
tokenize_word2(txt, split_hyphens = FALSE)
tokenize_word1(txt, split_hyphens = FALSE)
tokenize_word4(txt, split_hyphens = FALSE, split_elisions = TRUE)
tokenize_fasterword(txt)
tokenize_fastestword(txt)
tokenize_sentence(txt)
tokenize_character(txt[2])
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.