convertCorpus: Convert 'stm' formatted documents to another format

View source: R/convertCorpus.R

convertCorpusR Documentation

Convert stm formatted documents to another format

Description

Takes an stm formatted documents and vocab object and returns formats usable in other packages.

Usage

convertCorpus(documents, vocab, type = c("slam", "lda", "Matrix"))

Arguments

documents

the documents object in stm format

vocab

the vocab object in stm format

type

the output type desired. See Details.

Details

We also recommend the quanteda and tm packages for text preparation etc. The convertCorpus function is provided as a helpful utility for moving formats around, but if you intend to do text processing with a variety of output formats, you likely want to start with quanteda or tm.

The various type conversions are described below:

type = "slam"

Converts to the simple triplet matrix representation used by the slam package. This is the format used internally by tm.

type = "lda"

Converts to the format used by the lda package. This is a very minor change as the format in stm is based on lda's data representation. The difference as noted in stm involves how the numbers are indexed. Accordingly this type returns a list containing the new documents object and the unchanged vocab object.

type = "Matrix"

Converts to the sparse matrix representation used by Matrix. This is the format used internally by numerous other text analysis packages.

If you want to write out a file containing the sparse matrix representation popularized by David Blei's C code ldac see the function writeLdac.

See Also

writeLdac readCorpus poliblog5k

Examples

#convert the poliblog5k data to slam package format
poliSlam <- convertCorpus(poliblog5k.docs, poliblog5k.voc, type="slam")
class(poliSlam)
poliMatrix <- convertCorpus(poliblog5k.docs, poliblog5k.voc, type="Matrix")
class(poliMatrix)
poliLDA <- convertCorpus(poliblog5k.docs, poliblog5k.voc, type="lda")
str(poliLDA)

stm documentation built on June 24, 2024, 5:18 p.m.