build_dtm: build_dtm

Description Usage Arguments Value Examples

View source: R/matrix.R

Description

Compute document-term matrix from a corpus.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
build_dtm(
  corpus,
  sparsity = 1,
  dictionary = NULL,
  remove_stopwords = FALSE,
  tolower = TRUE,
  remove_punctuation = TRUE,
  remove_numbers = TRUE,
  min_length = 2
)

Arguments

corpus

A Corpus object.

sparsity

Value between 0 and 1 indicating the proportion of documents with no occurrences of a term above which that term should be dropped. By default all terms are kept (sparsity=1).

dictionary

A vector of terms to which the matrix should be restricted. By default, all words with more than min_length characters are considered.

remove_stopwords

Whether to remove stopwords appearing in a language-specific list (see tm::stopwords).

tolower

Whether to convert all text to lower case.

remove_punctuation

Whether to remove all punctuation from text before tokenizing terms.

remove_numbers

Whether to remove all numbers from text before tokenizing terms.

min_length

The minimal number of characters for a word to be retained.

Value

A DocumentTermMatrix object.

Examples

1
2
3
file <- system.file("texts", "reut21578-factiva.xml", package="tm.plugin.factiva")
corpus <- import_corpus(file, "factiva", language="en")
build_dtm(corpus)

R.temis documentation built on May 13, 2021, 1:08 a.m.

Related to build_dtm in R.temis...