View source: R/convertCorpus.R
convertCorpus | R Documentation |
Takes an stm formatted documents and vocab object and returns formats usable in other packages.
convertCorpus(documents, vocab, type = c("slam", "lda", "Matrix"))
documents |
the documents object in stm format |
vocab |
the vocab object in stm format |
type |
the output type desired. See Details. |
We also recommend the quanteda and tm packages for text preparation
etc. The convertCorpus
function is provided as a helpful utility for
moving formats around, but if you intend to do text processing with a variety
of output formats, you likely want to start with quanteda or tm.
The various type conversions are described below:
type = "slam"
Converts to the simple triplet matrix representation used by the slam package. This is the format used internally by tm.
type = "lda"
Converts to the format
used by the lda package. This is a very minor change as the format in
stm is based on lda's data representation. The difference as
noted in stm
involves how the numbers are indexed.
Accordingly this type returns a list containing the new documents object and
the unchanged vocab object.
type = "Matrix"
Converts to the sparse matrix representation used by Matrix. This is the format used internally by numerous other text analysis packages.
If you want to write
out a file containing the sparse matrix representation popularized by David
Blei's C
code ldac
see the function writeLdac
.
writeLdac
readCorpus
poliblog5k
#convert the poliblog5k data to slam package format
poliSlam <- convertCorpus(poliblog5k.docs, poliblog5k.voc, type="slam")
class(poliSlam)
poliMatrix <- convertCorpus(poliblog5k.docs, poliblog5k.voc, type="Matrix")
class(poliMatrix)
poliLDA <- convertCorpus(poliblog5k.docs, poliblog5k.voc, type="lda")
str(poliLDA)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.