atom_dtm: Create a dtm from a corpus (tf weights)

View source: R/atom_dtm.R

atom_dtmR Documentation

Create a dtm from a corpus (tf weights)

Description

atom_dtm take a corpora, tokenized or not, and create the corresponding DocumentTermMatrix (DTM) stored as sparse simple_triplet_matrix (see Details).

Usage

atom_dtm(corpus, step = 500L, parallel = FALSE, ...,
  ncores = parallel::detectCores() - 1)

## S3 method for class 'list'
atom_dtm(corpus, step = 500L, parallel = FALSE, ...,
  ncores = parallel::detectCores() - 1)

## S3 method for class 'VCorpus'
atom_dtm(corpus, step = 500L, parallel = FALSE, ...,
  ncores = parallel::detectCores() - 1)

## S3 method for class 'character'
atom_dtm(corpus, step = 500L, parallel = FALSE, ...,
  ncores = parallel::detectCores() - 1, docs_or_tokens = c("docs",
  "tokens"))

## Default S3 method:
atom_dtm(corpus, step = 500L, parallel = FALSE, ...,
  ncores = parallel::detectCores() - 1)

Arguments

corpus

(list) of documents, or a list of character vectors, each element reporting tokens from a document

step

(num) integer value (default is 500L) used to broken the procedure in parts of at maximum step documents each one. This is to help to don't overflow the RAM.

parallel

(lgl) if TRUE (default is FALSE) run parallel computations using makePSOCKcluster backend with max - 1 core.

...

further option passed to the function

ncores

(int) number of core to use in the parallel computation (default is number of machine cores minus one)

docs_or_tokens

(chr) if docs (default) means that the sequencies of elements of the character vector represent a document each one, if tokens means that they represents the sequencies of tokens of one single documents

Details

The algrithm of the simple triplet matrix considers three indeces i, j, v, in which the indeces i, j represent respectively the row (document) and the column (term/token) coordinate of an entry v rapresent its weight (commonly the frequency).

Moreover, for compatibility reasons (with some machine learning R implementation of algorithms which use different convention for the representation of sparse matrices), the indeces are ordered with priority i, j.

Value

a multiclass DocumentTermMatrix and simple_triplet_matrix object weigthed with simple term frequencies, rappresenting a document-term matrix in which each row represent a document, each columns a term (or token) and the content the simple frequencies of the terms in the document.

Examples

data(liu_4h28)
corpus <- data2corpus(liu_4h28)
atom_dtm(corpus)
atom_dtm(c('one', 'two', 'one two'))             # three documents, two token
atom_dtm(c('one', 'two', 'one two'), docs_or_tokens = 'tokens')    # one docs

## Not run: 
  atom_dtm(corpus, parallel = TRUE)                    # parallel computation
  atom_dtm(c(1, 2, 3))                                 # error

## End(Not run)

UBESP-DCTV/costumer documentation built on Feb. 1, 2023, 4:52 a.m.