Converts pre-processed document matrices stored in popular formats to stm format.
An input file or filepath to be processed
The type of input file. We offer several sources, see details.
This function provides a simple utility for converting other document
formats to our own. Briefly-
dtm takes as input a standard matrix
and converts to our format
ldac takes a file path and reads in a
document in the sparse format popularized by David Blei's C code
implementation of lda.
slam converts from the
simple_triplet_matrix representation used by the
This is also the representation of corpora in the popular
and should work in those cases.
dtm expects a matrix object where each row represents a document and
each column represents a word in the dictionary.
ldac expects a file name or path that contains a file in Blei's LDA-C
format. From his ReadMe: "The data is a file where each line is of the form:
[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]
where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string."
Because R indexes from one, the values of the term indices are incremented by one on import.
slam expects a
simple_triplet_matrix from that
Matrix attempts to coerce the matrix to a
simple_triplet_matrix and convert using the
functionality built for the
slam package. This will work for most
applicable classes in the
Matrix package such as
Finally the object
txtorgvocab allows the user to easily read in a
vocab file generated by the software
txtorg. When working in English
it is straightforward to read in files created by txtorg. However when
working in other languages, particularly Chinese and Arabic, there can often
be difficulty reading in the files using
read.csv This function should work well in those
A documents object in our format
A vocab object if information is available to construct one
1 2 3 4 5 6 7 8 9