readCorpus: Read in a corpus file.

View source: R/readCorpus.R

readCorpusR Documentation

Read in a corpus file.

Description

Converts pre-processed document matrices stored in popular formats to stm format.

Usage

readCorpus(corpus, type = c("dtm", "slam", "Matrix"))

Arguments

corpus

An input file or filepath to be processed

type

The type of input file. We offer several sources, see details.

Details

This function provides a simple utility for converting other document formats to our own. Briefly- dtm takes as input a standard matrix and converts to our format. slam converts from the simple_triplet_matrix representation used by the slam package. This is also the representation of corpora in the popular tm package and should work in those cases.

dtm expects a matrix object where each row represents a document and each column represents a word in the dictionary.

slam expects a simple_triplet_matrix from that package.

Matrix attempts to coerce the matrix to a simple_triplet_matrix and convert using the functionality built for the slam package. This will work for most applicable classes in the Matrix package such as dgCMatrix.

If you are trying to read a .ldac file see readLdac.

Value

documents

A documents object in our format

vocab

A vocab object if information is available to construct one

See Also

textProcessor, prepDocuments readLdac

Examples


## Not run: 

library(textir)
data(congress109)
out <- readCorpus(congress109Counts, type="Matrix")
documents <- out$documents
vocab <- out$vocab

## End(Not run)

stm documentation built on June 24, 2024, 5:18 p.m.