Compute Summary Statistics of a Corpus

Description

These functions compute summary statistics of a corpus. word.counts computes the word counts for a set of documents, while documents.length computes the length of the documents in a corpus.

Usage

1
2
3
word.counts(docs, vocab = NULL)

document.lengths(docs)

Arguments

docs

A list of matrices specifying the corpus. See lda.collapsed.gibbs.sampler for details on the format of this variable.

vocab

An optional character vector specifying the levels (i.e., labels) of the vocabulary words. If unspecified (or NULL), the levels will be automatically inferred from the corpus.

Value

word.counts returns an object of class table which contains counts for the number of times each word appears in the input corpus. If vocab is specified, then the levels of the table will be set to vocab. Otherwise, the levels are automatically inferred from the corpus (typically integers 0:(V-1), where V indicates the number of unique words in the corpus).

documents.length returns a integer vector of length length(docs), each entry of which corresponds to the length (sum of the counts of all features) of each document in the corpus.

Author(s)

Jonathan Chang (slycoder@gmail.com)

See Also

lda.collapsed.gibbs.sampler for the input format of these functions.

read.documents and lexicalize for ways of generating the input to these functions.

concatenate.documents for operations on a corpus.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
## Load the cora dataset.
data(cora.vocab)
data(cora.documents)

## Compute word counts using raw feature indices.
wc <- word.counts(cora.documents)
head(wc)
##   0   1   2   3   4   5 
## 136 876  14 111  19  29 

## Recompute them using the levels defined by the vocab file.
wc <- word.counts(cora.documents, cora.vocab)
head(wc)
##   computer  algorithms discovering    patterns      groups     protein 
##        136         876          14         111          19          29 

head(document.lengths(cora.documents))
## [1] 64 39 76 84 52 24