SimpleCorpus | R Documentation |
Create simple corpora.
SimpleCorpus(x, control = list(language = "en"))
x |
a |
control |
a named list of control parameters.
|
A simple corpus is fully kept in memory. Compared to a VCorpus
,
it is optimized for the most common usage scenario: importing plain texts from
files in a directory or directly from a vector in R, preprocessing and
transforming the texts, and finally exporting them to a term-document matrix.
It adheres to the Corpus
API. However, it takes
internally various shortcuts to boost performance and minimize memory
pressure; consequently it operates only under the following contraints:
only DataframeSource
, DirSource
and VectorSource
are supported,
no custom readers, i.e., each document is read in and stored as plain text (as a string, i.e., a character vector of length one),
transformations applied via tm_map
must be able to
process character vectors and return character vectors (of the same
length),
no lazy transformations in tm_map
,
no meta data for individual documents (i.e., no "local"
in
meta
).
An object inheriting from SimpleCorpus
and Corpus
.
Corpus
for basic information on the corpus infrastructure
employed by package tm.
VCorpus
provides an implementation with volatile storage
semantics, and PCorpus
provides an implementation with
permanent storage semantics.
txt <- system.file("texts", "txt", package = "tm")
(ovid <- SimpleCorpus(DirSource(txt, encoding = "UTF-8"),
control = list(language = "lat")))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.