Functions to manipulate text corpora in LDA format.

Share:

Description

concatenate.documents concatenates a set of documents. filter.words removes references to certain words from a collection of documents. shift.word.indices adjusts references to words by a fixed amount.

Usage

1
2
3
concatenate.documents(...)
filter.words(documents, to.remove)
shift.word.indices(documents, amount)

Arguments

...

For concatenate.documents, the set of corpora to be merged. All arguments to ... must be corpora of the same length. The documents in the same position in each of the arguments will be concatenated, i.e., the new document 1 will be the concatenation of document 1 from argument 1, document 2 from argument 1, etc.

documents

For filter.words and shift.word.indices, the corpus to be operated on.

to.remove

For filter.words, an integer vector of words to filter. The words in each document which also exist in to.remove will be removed.

amount

For shift.word.indices, an integer scalar by which to shift the vocabulary in the corpus. amount will be added to each entry of the word field in the corpus.

Value

A corpus with the documents merged/words filtered/words shifted. The format of the input and output corpora is described in lda.collapsed.gibbs.sampler.

Author(s)

Jonathan Chang (slycoder@gmail.com)

See Also

lda.collapsed.gibbs.sampler for the format of the return value.

word.counts to compute statistics associated with a corpus.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
data(cora.documents)

## Just use a small subset for the example.
corpus <- cora.documents[1:6]
## Get the word counts.
wc <- word.counts(corpus)

## Only keep the words which occur more than 4 times.
filtered <- filter.words(corpus,
                         as.numeric(names(wc)[wc <= 4]))
## [[1]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1   23   34   37   44
## [2,]    4    1    3    4    1
##
## [[2]]
##      [,1] [,2]
## [1,]   34   94
## [2,]    1    1
## ... long output ommitted ...

## Shift the second half of the corpus.
shifted <- shift.word.indices(filtered[4:6], 100)
## [[1]]
##      [,1] [,2] [,3]
## [1,]  134  281  307
## [2,]    2    5    7
##
## [[2]]
##      [,1] [,2]
## [1,]  101  123
## [2,]    1    4
##
## [[3]]
##      [,1] [,2]
## [1,]  101  194
## [2,]    6    3

## Combine the unshifted documents and the shifted documents.
concatenate.documents(filtered[1:3], shifted)
## [[1]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1   23   34   37   44  134  281  307
## [2,]    4    1    3    4    1    2    5    7
##
## [[2]]
##      [,1] [,2] [,3] [,4]
## [1,]   34   94  101  123
## [2,]    1    1    1    4
##
## [[3]]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   34   37   44   94  101  194
## [2,]    4    1    7    1    6    3