View source: R/asDocumentTermMatrix.R
asDocumentTermMatrix | R Documentation |
Constructs a document-term matrix.
asDocumentTermMatrix(
input,
vect.vocab = NULL,
stopwords = character(0),
stemming = NULL,
type = c("dgCMatrix", "dgTMatrix", "lda_c")
)
input |
a character vector. |
vect.vocab |
a vocabulary created with |
stopwords |
character vector of stopwords to exclude when creating the vocabulary. |
stemming |
|
type |
character, one of c("dgCMatrix", "dgTMatrix", "lda_c") taken from |
A list with two elements
a sparse document-term-matrix, depending on the type
-parameter
a vocabulary that can be inserted as vect.vocab
to build a document term matrix on new data with the same vocabulary.
http://text2vec.org/vectorization.html for details on the implementation used here,
another implementation TermDocumentMatrix
is slower
x <- c("Verkauf von Schreibwaren", "Verkauf", "Schreibwaren")
asDocumentTermMatrix(x)
asDocumentTermMatrix(x, type = "dgTMatrix")
asDocumentTermMatrix(x, stopwords = tm::stopwords("de"))
(x <- c("Verkauf von B\xfcchern, Schreibwaren", "Fach\xe4rzin f\xfcr Kinder- und Jugendmedizin im \xf6ffentlichen Gesundheitswesen", "Industriemechaniker", "Dipl.-Ing. - Agrarwirtschaft (Landwirtschaft)"))
x <- stringPreprocessing(x)
dtm <- asDocumentTermMatrix(x, stemming = "de")
print(dtm$dtm)
dimnames(dtm$dtm)[[2]]
# use the newly created vocab_vectorizer
(x <- stringPreprocessing(c("WILL NOT SHOW UP", "Verkauf von B\xfcchern, Schreibwaren", "Fach\xe4rzin f\xfcr Kinder- und Jugendmedizin")))
asDocumentTermMatrix(x, vect.vocab = dtm$vect.vocab, stopwords = character(0), stemming = "de")$dtm
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.