| readCorpus | R Documentation |
Converts pre-processed document matrices stored in popular formats to stm format.
readCorpus(corpus, type = c("dtm", "slam", "Matrix"))
corpus |
An input file or filepath to be processed |
type |
The type of input file. We offer several sources, see details. |
This function provides a simple utility for converting other document
formats to our own. Briefly- dtm takes as input a standard matrix
and converts to our format. slam converts from the
simple_triplet_matrix representation used by the slam package.
This is also the representation of corpora in the popular tm package
and should work in those cases.
dtm expects a matrix object where each row represents a document and
each column represents a word in the dictionary.
slam expects a simple_triplet_matrix from that
package.
Matrix attempts to coerce the matrix to a
simple_triplet_matrix and convert using the
functionality built for the slam package. This will work for most
applicable classes in the Matrix package such as dgCMatrix.
If you are trying to read a .ldac file see readLdac.
documents |
A documents object in our format |
vocab |
A vocab object if information is available to construct one |
textProcessor, prepDocuments readLdac
## Not run:
library(textir)
data(congress109)
out <- readCorpus(congress109Counts, type="Matrix")
documents <- out$documents
vocab <- out$vocab
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.