create_vocabulary | R Documentation |
This function collects unique terms and corresponding statistics. See the below for details.
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)
vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)
## S3 method for class 'character'
create_vocabulary(it, ngram = c(ngram_min = 1L,
ngram_max = 1L), stopwords = character(0), sep_ngram = "_",
window_size = 0L, ...)
## S3 method for class 'itoken'
create_vocabulary(it, ngram = c(ngram_min = 1L,
ngram_max = 1L), stopwords = character(0), sep_ngram = "_",
window_size = 0L, ...)
## S3 method for class 'itoken_parallel'
create_vocabulary(it, ngram = c(ngram_min = 1L,
ngram_max = 1L), stopwords = character(0), sep_ngram = "_",
window_size = 0L, ...)
it |
iterator over a |
ngram |
|
stopwords |
|
sep_ngram |
|
window_size |
|
... |
placeholder for additional arguments (not used at the moment). |
text2vec_vocabulary
object, which is actually a data.frame
with following columns:
term |
|
term_count |
|
doc_count |
|
Also it contains metainformation in attributes:
ngram
: integer
vector, the lower and upper boundary of the
range of n-gram-values.
document_count
: integer
number of documents vocabulary was
built.
stopwords
: character
vector of stopwords
sep_ngram
: character
separator for ngrams
character
: creates text2vec_vocabulary
from predefined
character vector. Terms will be inserted as is, without any checks
(ngrams number, ngram delimiters, etc.).
itoken
: collects unique terms and corresponding statistics from object.
itoken_parallel
: collects unique terms and corresponding
statistics from iterator.
data("movie_review")
txt = movie_review[['review']][1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
vocab = create_vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8,
doc_proportion_min = 0.001, vocab_term_max = 20000)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.