kRp.corpus: S4 Class kRp.corpus
In unDocUMeantIt/tm.plugin.koRpus: Full Corpus Support for the 'koRpus' Package

Description Details Slots Contructor function Note Examples

Objects of this class can contain full text corpora in a hierachical structure. It supports both the tm package's Corpus class and koRpus' own object classes and stores them in separated slots.

Objects should be created using the readCorpus function.

lang

A character string, naming the language that is assumed for the tokenized texts in this object.

desc

A named list of descriptive statistics of the tagged texts.

meta

A named list. Can be used to store meta information. Currently, no particular format is defined.

raw

A list of objects of class Corpus.

tokens

A data frame as used for the tokens slot in objects of class kRp.text. In addition to the columns usually found in those objects, this data frame also has a factor column for each hierarchical category defined (if any).

features

A named logical vector, indicating which features are available in this object's feat_list slot. Common features are listed in the description of the feat_list slot.

feat_list

A named list with optional analysis results or other content as used by the defined features:

hierarchy A named list of named character vectors describing the directory hierarchy level by level.
hyphen A named list of objects of class kRp.hyphen.
readability A named list of objects of class kRp.readability.
lex_div A named list of objects of class kRp.TTR.
freq The freq.analysis slot of a kRp.txt.freq class object after freq.analysis was called.
corp_freq An object of class kRp.corp.freq, e.g., results of a call to read.corp.custom.
diff A named list of diff features of a kRp.text object after a method like textTransform was called.
summary A summary data frame for the full corpus, including descriptive statistics on all texts, as well as results of analyses like readability and lexical diversity, if available.
doc_term_matrix A sparse document-term matrix, as produced by docTermMatrix.
stopwords A numeric vector with the total number of stopwords in each text, if stopwords were analyzed during tokenizing or POS tagging.

See the getter and setter methods for easy access to these sub-slots. There can actually be any number of additional features, the above is just a list of those already defined by this package.

Should you need to manually generate objects of this class (which should rarely be the case), the contructor function kRp.corpus(...) can be used instead of new("kRp.corpus", ...). Whenever possible, stick to readCorpus.

There is also getter and setter methods for objects of this class.

# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  myCorpus <- readCorpus(
    dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
    hierarchy=list(
      Topic=c(
        Winner="Reality Winner",
        Edwards="Natalie Edwards"
      ),
      Source=c(
        Wikipedia_prev="Wikipedia (old)",
        Wikipedia_new="Wikipedia (new)"
      )
    ),
    # use tokenize() so examples run without a TreeTagger installation
    tagger="tokenize",
    lang="en"
  )
} else {}

# manual creation
emptyCorpus <- kRp.corpus()