asSTMCorpus: STM Corpus Coercion

View source: R/asSTMCorpus.R

asSTMCorpusR Documentation

STM Corpus Coercion

Description

Convert a set of document term counts and associated metadata to the form required for processing by the stm function.

Usage

asSTMCorpus(documents, vocab, data = NULL, ...)

Arguments

documents

A documents-by-term matrix of counts, or a set of counts in the format returned by prepDocuments. Supported matrix formats include quanteda dfm and Matrix sparse matrix objects in "dgCMatrix" or "dgTMatrix" format.

vocab

Character vector specifying the words in the corpus in the order of the vocab indices in documents. Each term in the vocabulary index must appear at least once in the documents. See prepDocuments for dropping unused items in the vocabulary. If documents is a sparse matrix or quanteda dfm object, then vocab should not (and must not) be supplied. It is contained already inside the column names of the matrix.

data

An optional data frame containing the prevalence and/or content covariates. If unspecified the variables are taken from the active environment.

...

Additional arguments passed to or from other methods.

Value

A list with components "documents", "vocab", and "data" in the form needed for further processing by the stm function.

See Also

prepDocuments, stm

Examples


library(quanteda)
gadarian_corpus <- corpus(gadarian, text_field = "open.ended.response")
gadarian_dfm <- dfm(gadarian_corpus, 
                     remove = stopwords("english"),
                     stem = TRUE)
asSTMCorpus(gadarian_dfm)


stm documentation built on June 24, 2024, 5:18 p.m.