asDocumentTermMatrix: Document-Term Matrix

View source: R/asDocumentTermMatrix.R

asDocumentTermMatrixR Documentation

Document-Term Matrix

Description

Constructs a document-term matrix.

Usage

asDocumentTermMatrix(
  input,
  vect.vocab = NULL,
  stopwords = character(0),
  stemming = NULL,
  type = c("dgCMatrix", "dgTMatrix", "lda_c")
)

Arguments

input

a character vector.

vect.vocab

a vocabulary created with vocab_vectorizer. If NULL, the vocabulary is created from the input. See example for a typical use case.

stopwords

character vector of stopwords to exclude when creating the vocabulary. tm::stopwords("de") provides German stopwords.

stemming

NULL for no stemming and "de" for stemming using the German porter stemmer.

type

character, one of c("dgCMatrix", "dgTMatrix", "lda_c") taken from create_dtm. dgCMatrix are useful for glmnet; dgTMatrix matrix refers to sparse matrices in triplet form, i.e. positions of all non-zero values are stored (easier to work with, but non-unique).

Value

A list with two elements

dtm

a sparse document-term-matrix, depending on the type-parameter

vect.vocab

a vocabulary that can be inserted as vect.vocab to build a document term matrix on new data with the same vocabulary.

See Also

http://text2vec.org/vectorization.html for details on the implementation used here, another implementation TermDocumentMatrix is slower

Examples

x <- c("Verkauf von Schreibwaren", "Verkauf", "Schreibwaren")
asDocumentTermMatrix(x)
asDocumentTermMatrix(x, type = "dgTMatrix")
asDocumentTermMatrix(x, stopwords = tm::stopwords("de"))

(x <- c("Verkauf von B\xfcchern, Schreibwaren", "Fach\xe4rzin f\xfcr Kinder- und Jugendmedizin im \xf6ffentlichen Gesundheitswesen", "Industriemechaniker", "Dipl.-Ing. - Agrarwirtschaft (Landwirtschaft)"))
x <- stringPreprocessing(x)
dtm <- asDocumentTermMatrix(x, stemming = "de")
print(dtm$dtm)
dimnames(dtm$dtm)[[2]]

# use the newly created vocab_vectorizer
(x <- stringPreprocessing(c("WILL NOT SHOW UP", "Verkauf von B\xfcchern, Schreibwaren", "Fach\xe4rzin f\xfcr Kinder- und Jugendmedizin")))
asDocumentTermMatrix(x, vect.vocab = dtm$vect.vocab, stopwords = character(0), stemming = "de")$dtm

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.