itoken | R Documentation |
This family of function creates iterators over input objects in order to create vocabularies, or DTM and TCM matrices. iterators usually used in following functions : create_vocabulary, create_dtm, vectorizers, create_tcm. See them for details.
itoken(iterable, ...)
## S3 method for class 'character'
itoken(iterable, preprocessor = identity,
tokenizer = space_tokenizer, n_chunks = 10,
progressbar = interactive(), ids = NULL, ...)
## S3 method for class 'list'
itoken(iterable, n_chunks = 10,
progressbar = interactive(), ids = names(iterable), ...)
## S3 method for class 'iterator'
itoken(iterable, preprocessor = identity,
tokenizer = space_tokenizer, progressbar = interactive(), ...)
itoken_parallel(iterable, ...)
## S3 method for class 'character'
itoken_parallel(iterable, preprocessor = identity,
tokenizer = space_tokenizer, n_chunks = 10, ids = NULL, ...)
## S3 method for class 'iterator'
itoken_parallel(iterable, preprocessor = identity,
tokenizer = space_tokenizer, n_chunks = 1L, ...)
## S3 method for class 'list'
itoken_parallel(iterable, n_chunks = 10, ids = NULL,
...)
iterable |
an object from which to generate an iterator |
... |
arguments passed to other methods |
preprocessor |
|
tokenizer |
|
n_chunks |
|
progressbar |
|
ids |
|
S3 methods for creating an itoken iterator from list of tokens
list
: all elements of the input list should be
character vectors containing tokens
character
: raw text
source: the user must provide a tokenizer function
ifiles
: from files, a user must provide a function to read in the file
(to ifiles) and a function to tokenize it (to itoken)
idir
: from a directory, the user must provide a function to
read in the files (to idir) and a function to tokenize it (to itoken)
ifiles_parallel
: from files in parallel
ifiles, idir, create_vocabulary, create_dtm, vectorizers, create_tcm
data("movie_review")
txt = movie_review$review[1:100]
ids = movie_review$id[1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10, ids = ids)
# Example of stemming tokenizer
# stem_tokenizer =function(x) {
# lapply(word_tokenizer(x), SnowballC::wordStem, language="en")
# }
it = itoken_parallel(movie_review$review[1:100], n_chunks = 4)
system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'TsparseMatrix'))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.