SimpleCorpus: Simple Corpora

Description Usage Arguments Details Value See Also Examples

Description

Create simple corpora.

Usage

1
SimpleCorpus(x, control = list(language = "en"))

Arguments

x

a DataframeSource, DirSource or VectorSource.

control

a named list of control parameters.

language

a character giving the language (preferably as IETF language tags, see language in package NLP). The default language is assumed to be English ("en").

Details

A simple corpus is fully kept in memory. Compared to a VCorpus, it is optimized for the most common usage scenario: importing plain texts from files in a directory or directly from a vector in R, preprocessing and transforming the texts, and finally exporting them to a term-document matrix. It adheres to the Corpus API. However, it takes internally various shortcuts to boost performance and minimize memory pressure; consequently it operates only under the following contraints:

Value

An object inheriting from SimpleCorpus and Corpus.

See Also

Corpus for basic information on the corpus infrastructure employed by package tm.

VCorpus provides an implementation with volatile storage semantics, and PCorpus provides an implementation with permanent storage semantics.

Examples

1
2
3
txt <- system.file("texts", "txt", package = "tm")
(ovid <- SimpleCorpus(DirSource(txt, encoding = "UTF-8"),
                      control = list(language = "lat")))

Example output

Loading required package: NLP
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 5

tm documentation built on May 2, 2019, 2:43 a.m.