DistributedCorpus: Distributed Corpus

Description Usage Arguments Details Value Author(s) See Also Examples

Description

Data structures and operators for distributed corpora.

Usage

1
2
3
4
5
6
7
DCorpus( x,
         readerControl = list(reader   = reader(x),
                              language = "en"),
         storage = NULL, keep = TRUE, ... )
## S3 method for class 'DCorpus'
as.VCorpus(x)
as.DCorpus( x, storage = NULL, ... )

Arguments

x

for DCorpus, a Source object. At the moment only DirSource is supported. For as.VCorpus() and as.DCorpus(), an object to be coerced to a VCorpus/DCorpus. Currently coercion from/to classic tm corpora (VCorpus) is implemented.

readerControl

A list with the named components reader representing a reading function capable of handling the file format found in x, and language giving the text's language (preferably as IETF language tags, see language in package NLP).

storage

The storage subsystem to use with the DCorpus. Currently two types of storages are supported: local disk storage using the Local File System (LFS) and the Hadoop Distributed File System (HDFS). Default: 'LFS'.

keep

Should revisions be used when operating on the DCorpus? Default: TRUE

...

Optional arguments for the reader.

Details

When constructing a distributed corpus the input source is extracted via the supplied reader and stored on the given file system (argument storage). While the data set resides on the corresponding storage (e.g., HDFS), only a symbolic representation is held in R (a so-called DList) which allows to access the corpus via corresponding (DList) methods. Since the available memory for the distributed corpus is only restricted by available disk space in the given storage (and not main memory like in a standard tm corpus) by default we also store a set of so-called revisions, i.e., stages of the (processed) corpus. Revisions can be turned off later on using the keepRevisions() replacement function.\

The constructed corpus object inherits from a tm Corpus and has several slots containing meta information:

meta

Corpus Meta Data contains corpus specific meta data in form of tag-value pairs.

dmeta

Document Meta Data of class data.frame contains document specific meta data for the corpus. This is mainly available to be compatible with standard tm corpus definitions but not yet actually used in the distributed scenario.

keep

A logical indicating whether revisions representing stages e.g., in a preprocessing chain should be kept or not.

Value

An object inheriting from DCorpus and Corpus.

Author(s)

Ingo Feinerer and Stefan Theussl

See Also

Corpus for basic information on the corpus infrastructure employed by package tm.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
## Similar to example in package 'tm'
reut21578 <- system.file("texts", "crude", package = "tm")
dc <- DistributedCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain) )
dc

## Coercion
data("crude")
as.DistributedCorpus(crude)
as.VCorpus(dc)

Example output

Loading required package: DSL
Loading required package: tm
Loading required package: NLP
<<DCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20
<<DCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20

tm.plugin.dc documentation built on Nov. 29, 2020, 5:07 p.m.