read.documents | R Documentation |
These functions read in the document and vocabulary files associated with a corpus. The format of the files is the same as that used by LDA-C (see below for details). The return value of these functions can be used by the inference procedures defined in the lda package.
read.documents(filename = "mult.dat")
read.vocab(filename = "vocab.dat")
filename |
A length-1 character vector specifying the path to the document/vocabulary file. These are set to ‘mult.dat’ and ‘vocab.dat’ by default. |
The details of the format are also described in the readme for LDA-C.
The format of the documents file is appropriate for typical text data as it sparsely encodes observed features. A single file encodes a corpus (a collection of documents). Each line of the file encodes a single document (a feature vector).
The line encoding a document begins with an integer followed by a number of feature-count pairs, all separated by spaces. A feature-count pair consists of two integers separated by a colon. The first integer indicates the feature (note that this is zero-indexed!) and the second integer indicates the count (i.e., value) of that feature. The initial integer of a line indicates how many feature-count pairs are to be expected on that line.
Note that we permit a feature to appear more than once on a line, in which case the value for that feature will be the sum of all instances (the behavior for such files is undefined for LDA-C). For example, a line reading ‘4 7:1 0:2 7:3 1:1’ will yield a document with feature 0 occurring twice, feature 1 occurring once, and feature 7 occurring four times, with all other features occurring zero times.
The format of the vocabulary is a set of newline separated strings corresponding to features. That is, the first line of the vocabulary file will correspond to the label for feature 0, the second for feature 1, etc.
read.documents
returns a list of matrices suitable as input for
the inference routines in lda. See
lda.collapsed.gibbs.sampler
for details.
read.vocab
returns a character vector of strings corresponding to
features.
Jonathan Chang (slycoder@gmail.com)
Blei, David M. Latent Dirichlet Allocation in C. http://www.cs.columbia.edu/~blei/topicmodeling_software.html
lda.collapsed.gibbs.sampler
for the format of
the return value of read.documents
.
lexicalize
to generate the same output from raw text data.
word.counts
to compute statistics associated with a
corpus.
concatenate.documents
for operations on a collection of documents.
## Read files using default values.
## Not run: setwd("corpus directory")
## Not run: documents <- read.documents()
## Not run: vocab <- read.vocab()
## Read files from another location.
## Not run: documents <- read.documents("corpus directory/features")
## Not run: vocab <- read.vocab("corpus directory/labels")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.