atom_dtm | R Documentation |
atom_dtm
take a corpora, tokenized or not, and create the
corresponding DocumentTermMatrix
(DTM) stored as sparse
simple_triplet_matrix
(see Details).
atom_dtm(corpus, step = 500L, parallel = FALSE, ..., ncores = parallel::detectCores() - 1) ## S3 method for class 'list' atom_dtm(corpus, step = 500L, parallel = FALSE, ..., ncores = parallel::detectCores() - 1) ## S3 method for class 'VCorpus' atom_dtm(corpus, step = 500L, parallel = FALSE, ..., ncores = parallel::detectCores() - 1) ## S3 method for class 'character' atom_dtm(corpus, step = 500L, parallel = FALSE, ..., ncores = parallel::detectCores() - 1, docs_or_tokens = c("docs", "tokens")) ## Default S3 method: atom_dtm(corpus, step = 500L, parallel = FALSE, ..., ncores = parallel::detectCores() - 1)
corpus |
(list) of documents, or a list of character vectors, each element reporting tokens from a document |
step |
(num) integer value (default is 500L) used to broken the
procedure in parts of at maximum |
parallel |
(lgl) if |
... |
further option passed to the function |
ncores |
(int) number of core to use in the parallel computation (default is number of machine cores minus one) |
docs_or_tokens |
(chr) if |
The algrithm of the simple triplet matrix considers three indeces i
,
j
, v
, in which the indeces i
, j
represent
respectively the row (document) and the column (term/token) coordinate of an
entry v
rapresent its weight (commonly the frequency).
Moreover, for compatibility reasons (with some machine learning R
implementation of algorithms which use different convention for the
representation of sparse matrices), the indeces are ordered with priority
i
, j
.
a multiclass DocumentTermMatrix
and
simple_triplet_matrix
object weigthed with simple
term frequencies, rappresenting a document-term matrix in which each
row represent a document, each columns a term (or token) and the
content the simple frequencies of the terms in the document.
data(liu_4h28) corpus <- data2corpus(liu_4h28) atom_dtm(corpus) atom_dtm(c('one', 'two', 'one two')) # three documents, two token atom_dtm(c('one', 'two', 'one two'), docs_or_tokens = 'tokens') # one docs ## Not run: atom_dtm(corpus, parallel = TRUE) # parallel computation atom_dtm(c(1, 2, 3)) # error ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.