Description Usage Arguments Details References Examples
vocab()
creates a vocabulary from a text corpus; update_vocab()
and
prune_vocab()
update and prune an existing vocabulary respectively.
1 2 3 4 5 6 7 8 9 | vocab(corpus, ngram = c(1, 1), ngram_sep = "_",
regex = "[[:space:]]+")
update_vocab(vocab, corpus)
prune_vocab(vocab, max_terms = Inf, term_count_min = 1L,
term_count_max = Inf, doc_proportion_min = 0,
doc_proportion_max = 1, doc_count_min = 1L, doc_count_max = Inf,
nbuckets = attr(vocab, "nbuckets"))
|
corpus |
A collection of ASCII or UTF-8 encoded documents. It can be a list of character vectors, a character vector or a data.frame with at least two columns - id and documents. See details. |
ngram |
a vector of length 2 of the form |
ngram_sep |
separator to link terms within ngrams. |
regex |
a regexp to be used for segmentation of documents when |
vocab |
|
max_terms |
max number of terms to preserve |
term_count_min |
keep terms occurring at least this many times over all docs |
term_count_max |
keep terms occurring at most this many times over all docs |
doc_count_min, doc_proportion_min |
keep terms appearing in at least this many docs |
doc_count_max, doc_proportion_max |
keep terms appearing in at most this many docs |
nbuckets |
How many unknown buckets to create along the remaining terms
of the pruned |
When corpus
is a character vector each string is tokenized with regex
with the internal tokenizer. When corpus
has names, names will be used to
name the output whenever appropriate.
When corpus is a data.frame
, the documents must be in last column, which
can be either a list of strings or a character vector. All other columns are
considered document ids. If first column is a character vector most function
will use it to name the output.
https://en.cppreference.com/w/cpp/regex/ecmascript
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | corpus <-
list(a = c("The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"),
b = c("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog",
"the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"))
vocab(corpus)
vocab(corpus, ngram = 3)
vocab(corpus, ngram = c(2, 3))
v <- vocab(corpus)
extra_corpus <- list(extras = c("apples", "oranges"))
v <- update_vocab(v, extra_corpus)
v
prune_vocab(v, max_terms = 7)
prune_vocab(v, term_count_min = 2)
prune_vocab(v, max_terms = 7, nbuckets = 2)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.