lexicalize | R Documentation |
This function reads raw text in doclines format and returns a corpus and vocabulary suitable for the inference procedures defined in the lda package.
lexicalize(doclines, sep = " ", lower = TRUE, count = 1L, vocab = NULL)
doclines |
A character vector of document lines to be used to construct a corpus. See details for a description of the format of these lines. |
sep |
Separator string which is used to tokenize the input strings (default ‘ ’). |
lower |
Logical indicating whether or not to convert all tokens to lowercase (default ‘TRUE’). |
count |
An integer scaling factor to be applied to feature counts. A single observation of a feature will be rendered as count observations in the return value (the default value, ‘1’, is appropriate in most cases). |
vocab |
If left unspecified (or |
This function first tokenizes a character vector by splitting each entry of the vector by sep (note that this is currently a fixed separator, not a regular expression). If lower is ‘TRUE’, then the tokens are then all converted to lowercase.
At this point, if vocab is NULL
, then a vocabulary is
constructed from the set of unique tokens appearing across all
character vectors. Otherwise, the tokens derived from the character
vectors are filtered so that only those appearing in vocab are
retained.
Finally, token instances within each document (i.e., original
character string) are tabulated in the format described in
lda.collapsed.gibbs.sampler
.
If vocab is unspecified or NULL
, a list with two components:
documents |
A list of document matrices in the format described in |
vocab |
A character vector of unique tokens occurring in the corpus. |
Because of the limited tokenization and filtering capabilities of this function, it may not be useful in many cases. This may be resolved in a future release.
Jonathan Chang (slycoder@gmail.com)
lda.collapsed.gibbs.sampler
for the format of
the return value.
read.documents
to generate the same output from a file
encoded in LDA-C format.
word.counts
to compute statistics associated with a
corpus.
concatenate.documents
for operations on a collection of documents.
## Generate an example.
example <- c("I am the very model of a modern major general",
"I have a major headache")
corpus <- lexicalize(example, lower=TRUE)
## corpus$documents:
## $documents[[1]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 1 2 3 4 5 6 7 8 9
## [2,] 1 1 1 1 1 1 1 1 1 1
##
## $documents[[2]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 10 6 8 11
## [2,] 1 1 1 1 1
## corpus$lexicon:
## $vocab
## [1] "i" "am" "the" "very" "model" "of"
## [7] "a" "modern" "major" "general" "have" "headache"
## Only keep words that appear at least twice:
to.keep <- corpus$vocab[word.counts(corpus$documents, corpus$vocab) >= 2]
## Re-lexicalize, using this subsetted vocabulary
documents <- lexicalize(example, lower=TRUE, vocab=to.keep)
## documents:
## [[1]]
## [,1] [,2] [,3]
## [1,] 0 1 2
## [2,] 1 1 1
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 0 1 2
## [2,] 1 1 1
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.