SimpleCorpus: Simple Corpora
In tm: Text Mining Package

SimpleCorpus

R Documentation

Simple Corpora

Description

Create simple corpora.

Usage

SimpleCorpus(x, control = list(language = "en"))

Arguments

x

a DataframeSource, DirSource or VectorSource.

control

a named list of control parameters.

language: a character giving the language (preferably as IETF language tags, see language in package NLP). The default language is assumed to be English ("en").

Details

A simple corpus is fully kept in memory. Compared to a VCorpus, it is optimized for the most common usage scenario: importing plain texts from files in a directory or directly from a vector in R, preprocessing and transforming the texts, and finally exporting them to a term-document matrix. It adheres to the Corpus API. However, it takes internally various shortcuts to boost performance and minimize memory pressure; consequently it operates only under the following contraints:

only DataframeSource, DirSource and VectorSource are supported,
no custom readers, i.e., each document is read in and stored as plain text (as a string, i.e., a character vector of length one),
transformations applied via tm_map must be able to process character vectors and return character vectors (of the same length),
no lazy transformations in tm_map,
no meta data for individual documents (i.e., no "local" in meta).

Value

An object inheriting from SimpleCorpus and Corpus.

Examples

txt <- system.file("texts", "txt", package = "tm")
(ovid <- SimpleCorpus(DirSource(txt, encoding = "UTF-8"),
                      control = list(language = "lat")))

tm documentation built on Sept. 11, 2024, 6:47 p.m.