word.counts | R Documentation |
These functions compute summary statistics of a corpus.
word.counts
computes the word counts for a set of documents,
while documents.length
computes the length of the documents in
a corpus.
word.counts(docs, vocab = NULL)
document.lengths(docs)
docs |
A list of matrices specifying the corpus. See
|
vocab |
An optional character vector specifying the levels (i.e., labels) of
the vocabulary words. If unspecified (or |
word.counts
returns an object of class ‘table’ which
contains counts for the number of times each word appears in the input
corpus. If vocab is specified, then the levels of the table
will be set to vocab. Otherwise, the levels are automatically
inferred from the corpus (typically integers 0:(V-1), where
V indicates the number of unique words in the corpus).
documents.length
returns a integer vector of length
length(docs)
, each entry of which corresponds to the
length (sum of the counts of all features) of each document in
the corpus.
Jonathan Chang (slycoder@gmail.com)
lda.collapsed.gibbs.sampler
for the input format of
these functions.
read.documents
and lexicalize
for ways of
generating the input to these functions.
concatenate.documents
for operations on a corpus.
## Load the cora dataset.
data(cora.vocab)
data(cora.documents)
## Compute word counts using raw feature indices.
wc <- word.counts(cora.documents)
head(wc)
## 0 1 2 3 4 5
## 136 876 14 111 19 29
## Recompute them using the levels defined by the vocab file.
wc <- word.counts(cora.documents, cora.vocab)
head(wc)
## computer algorithms discovering patterns groups protein
## 136 876 14 111 19 29
head(document.lengths(cora.documents))
## [1] 64 39 76 84 52 24
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.