CorpusStudio: CorpusStudio

Description Usage Arguments Format Details Author(s) See Also

Description

CorpusStudio Creates a Corpus object then prepares it for cross-validation downstream.

Usage

1

Arguments

x

a series of character vectors, each containing the text for a single document, a FileSet object containing .txt files, a character string containing the directory holding .txt files, a quanteda corpus object, or a tm VCorpus, or tm SimpleCorpus object.#'

name

Character string containing the name to assign to the final CVSet or CVSetKFold object.

cv

The type of cross-validation product to deliver. Valid values are c('standard', 'kFold'). The default is standard and one letter abbreviations are acceptable.

textConfig

a TextConfig object which encapsulates the text cleaning configuration.

n

Numeric parameter used by the sample method. It contains the number of samples to obtain from the Corpus or the proportion of the Corpus to sample prior to splitting into cross-validation set(s).

k

Numeric. If 'cv' is 'kFold', this number indicates the number of folds to produce.

stratify

Logical. If TRUE (default), splits and sampling will be stratefied.

replace

Logical. If TRUE, sampling is conducted with replacement. The default is FALSE.

train

Numeric indicating the proportion of the Corpus to allocate to the training set. Acceptable values are between 0 and 1. The total of the values for the train, validation and test parameters must equal 1.

validation

Numeric indicating the proportion of the Corpus to allocate to the validation set. Acceptable values are between 0 and 1. The total of the values for the train, validation and test parameters must equal 1.

test

Numeric indicating the proportion of the Corpus to allocate to the test set. Acceptable values are between 0 and 1. The total of the values for the train, validation and test parameters must equal 1.

seed

Numeric used to initialize a pseudorandom number generator.

Format

An object of class R6ClassGenerator of length 24.

Details

Class responsible for creating, cleaning, sampling, splitting and constructing the cross-validation object that will be used by downstream modeling classes. This is performed in five states.

The first stage builds the corpus object from one of several sources: a directory source, a FileSet object, a TM Corpus object, or a quanteda corpus object. The second stage is optional and reshapes the Corpus object into word, sentence or paragraph units. The third stage, the Sampling Stage takes a stratified or non-stratfified sampling from the Corpus object. The forth stage, the cross-validation stage, produces one of two cross-validation objects: a CVSet object, which is comprised of a training, test and optional validation set, or a CVSetKFold object which contains k-folds, each comprised of a training and test set. The cross-validation product is the final product and forms the data basis for the modeling phase.

Author(s)

John James, jjames@dataScienceSalon.org

See Also

Other CorpusStudio Family of Classes: KFold, Sample0, Sample, Segment, Split, TokenizerNLP, TokenizerQ, Tokenizer, Token


DecisionScients/NLPStudio documentation built on May 15, 2019, 12:51 p.m.