Description Usage Arguments Details Value Author(s) See Also Examples
Data structures and operators for distributed corpora.
1 2 3 4 5 6 7 |
x |
for |
readerControl |
A list with the named components |
storage |
The storage subsystem to use with the DCorpus. Currently two types of storages are supported: local disk storage using the Local File System (LFS) and the Hadoop Distributed File System (HDFS). Default: 'LFS'. |
keep |
Should revisions be used when operating on the
|
... |
Optional arguments for the |
When constructing a distributed corpus the input source is
extracted via the supplied reader and stored on the given file
system (argument storage
). While the data set resides on the
corresponding storage (e.g., HDFS), only a symbolic representation is
held in R (a so-called DList
) which allows to
access the corpus via corresponding (DList
) methods. Since the
available memory for the distributed corpus is only restricted by
available disk space in the given storage (and not main memory like in
a standard tm corpus) by default we also store a set of
so-called revisions, i.e., stages of the (processed) corpus. Revisions
can be turned off later on using the keepRevisions()
replacement function.\
The constructed corpus object inherits from a tm
Corpus
and has several slots containing meta
information:
meta
Corpus Meta Data contains corpus specific meta data in form of tag-value pairs.
dmeta
Document Meta Data of class
data.frame
contains document specific meta data for the
corpus. This is mainly available to be compatible with standard
tm corpus definitions but not yet actually used in the
distributed scenario.
keep
A logical indicating whether revisions representing stages e.g., in a preprocessing chain should be kept or not.
An object inheriting from DCorpus
and Corpus
.
Ingo Feinerer and Stefan Theussl
Corpus
for basic information on the corpus infrastructure
employed by package tm.
1 2 3 4 5 6 7 8 9 10 | ## Similar to example in package 'tm'
reut21578 <- system.file("texts", "crude", package = "tm")
dc <- DistributedCorpus(DirSource(reut21578),
readerControl = list(reader = readReut21578XMLasPlain) )
dc
## Coercion
data("crude")
as.DistributedCorpus(crude)
as.VCorpus(dc)
|
Loading required package: DSL
Loading required package: tm
Loading required package: NLP
<<DCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 20
<<DCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 20
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 20
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.