Representing and computing on corpora.
Corpora are collections of documents containing (natural language)
text. In packages which employ the infrastructure provided by package
tm, such corpora are represented via the virtual S3 class
Corpus: such packages then provide S3 corpus classes extending the
virtual base class (such as
VCorpus provided by package tm
All extension classes must provide accessors to extract subsets
[), individual documents (
[[), and metadata
meta). The function
length must return the number
of documents, and
as.list must construct a list holding the
A corpus can have two types of metadata (accessible via
Corpus metadata contains corpus specific metadata in form of tag-value
pairs. Document level metadata contains document specific metadata but
is stored in the corpus as a data frame. Document level metadata is typically
used for semantic reasons (e.g., classifications of documents form an own
entity due to some high-level information like the range of possible values)
or for performance reasons (single access instead of extracting metadata of
Corpus is a convenience alias to
VCorpus, depending on the arguments provided.
DCorpus for a distributed corpus class provided by
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.