Generate LDA Documents from Raw Text

Share:

Description

This function reads raw text in doclines format and returns a corpus and vocabulary suitable for the inference procedures defined in the lda package.

Usage

1
lexicalize(doclines, sep = " ", lower = TRUE, count = 1L, vocab = NULL)

Arguments

doclines

A character vector of document lines to be used to construct a corpus. See details for a description of the format of these lines.

sep

Separator string which is used to tokenize the input strings (default ).

lower

Logical indicating whether or not to convert all tokens to lowercase (default TRUE).

count

An integer scaling factor to be applied to feature counts. A single observation of a feature will be rendered as count observations in the return value (the default value, 1, is appropriate in most cases).

vocab

If left unspecified (or NULL), the vocabulary for the corpus will be automatically inferred from the observed tokens. Otherwise, this parameter should be a character vector specifying acceptable tokens. Tokens not appearing in this list will be filtered from the documents.

Details

This function first tokenizes a character vector by splitting each entry of the vector by sep (note that this is currently a fixed separator, not a regular expression). If lower is TRUE, then the tokens are then all converted to lowercase.

At this point, if vocab is NULL, then a vocabulary is constructed from the set of unique tokens appearing across all character vectors. Otherwise, the tokens derived from the character vectors are filtered so that only those appearing in vocab are retained.

Finally, token instances within each document (i.e., original character string) are tabulated in the format described in lda.collapsed.gibbs.sampler.

Value

If vocab is unspecified or NULL, a list with two components:

documents

A list of document matrices in the format described in lda.collapsed.gibbs.sampler.

vocab

A character vector of unique tokens occurring in the corpus.

Note

Because of the limited tokenization and filtering capabilities of this function, it may not be useful in many cases. This may be resolved in a future release.

Author(s)

Jonathan Chang (slycoder@gmail.com)

See Also

lda.collapsed.gibbs.sampler for the format of the return value.

read.documents to generate the same output from a file encoded in LDA-C format.

word.counts to compute statistics associated with a corpus.

concatenate.documents for operations on a collection of documents.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
## Generate an example.
example <- c("I am the very model of a modern major general",
             "I have a major headache")

corpus <- lexicalize(example, lower=TRUE)

## corpus$documents:
## $documents[[1]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    1    2    3    4    5    6    7    8     9
## [2,]    1    1    1    1    1    1    1    1    1     1
## 
## $documents[[2]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0   10    6    8   11
## [2,]    1    1    1    1    1

## corpus$lexicon:
## $vocab
## [1] "i"        "am"       "the"      "very"     "model"    "of"      
## [7] "a"        "modern"   "major"    "general"  "have"     "headache"

## Only keep words that appear at least twice:
to.keep <- corpus$vocab[word.counts(corpus$documents, corpus$vocab) >= 2]

## Re-lexicalize, using this subsetted vocabulary
documents <- lexicalize(example, lower=TRUE, vocab=to.keep)

## documents:
## [[1]]
##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    1    1    1
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    1    1    1