Description Usage Arguments Details Value See Also Examples
Create simple corpora.
1 | SimpleCorpus(x, control = list(language = "en"))
|
x |
a |
control |
a named list of control parameters.
|
A simple corpus is fully kept in memory. Compared to a VCorpus
,
it is optimized for the most common usage scenario: importing plain texts from
files in a directory or directly from a vector in R, preprocessing and
transforming the texts, and finally exporting them to a term-document matrix.
It adheres to the Corpus
API. However, it takes
internally various shortcuts to boost performance and minimize memory
pressure; consequently it operates only under the following contraints:
only DataframeSource
, DirSource
and VectorSource
are supported,
no custom readers, i.e., each document is read in and stored as plain text (as a string, i.e., a character vector of length one),
transformations applied via tm_map
must be able to
process character vectors and return character vectors (of the same
length),
no lazy transformations in tm_map
,
no meta data for individual documents (i.e., no "local"
in
meta
).
An object inheriting from SimpleCorpus
and Corpus
.
Corpus
for basic information on the corpus infrastructure
employed by package tm.
VCorpus
provides an implementation with volatile storage
semantics, and PCorpus
provides an implementation with
permanent storage semantics.
1 2 3 | txt <- system.file("texts", "txt", package = "tm")
(ovid <- SimpleCorpus(DirSource(txt, encoding = "UTF-8"),
control = list(language = "lat")))
|
Loading required package: NLP
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 5
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.