Description Usage Arguments Format Details Author(s) See Also
CorpusStudio
Creates a Corpus object then prepares it for cross-validation downstream.
1 |
x |
a series of character vectors, each containing the text for a single document, a FileSet object containing .txt files, a character string containing the directory holding .txt files, a quanteda corpus object, or a tm VCorpus, or tm SimpleCorpus object.#' |
name |
Character string containing the name to assign to the final CVSet or CVSetKFold object. |
cv |
The type of cross-validation product to deliver. Valid values are c('standard', 'kFold'). The default is standard and one letter abbreviations are acceptable. |
textConfig |
a TextConfig object which encapsulates the text cleaning configuration. |
n |
Numeric parameter used by the sample method. It contains the number of samples to obtain from the Corpus or the proportion of the Corpus to sample prior to splitting into cross-validation set(s). |
k |
Numeric. If 'cv' is 'kFold', this number indicates the number of folds to produce. |
stratify |
Logical. If TRUE (default), splits and sampling will be stratefied. |
replace |
Logical. If TRUE, sampling is conducted with replacement. The default is FALSE. |
train |
Numeric indicating the proportion of the Corpus to allocate to the training set. Acceptable values are between 0 and 1. The total of the values for the train, validation and test parameters must equal 1. |
validation |
Numeric indicating the proportion of the Corpus to allocate to the validation set. Acceptable values are between 0 and 1. The total of the values for the train, validation and test parameters must equal 1. |
test |
Numeric indicating the proportion of the Corpus to allocate to the test set. Acceptable values are between 0 and 1. The total of the values for the train, validation and test parameters must equal 1. |
seed |
Numeric used to initialize a pseudorandom number generator. |
An object of class R6ClassGenerator
of length 24.
Class responsible for creating, cleaning, sampling, splitting and constructing the cross-validation object that will be used by downstream modeling classes. This is performed in five states.
The first stage builds the corpus object from one of several sources: a directory source, a FileSet object, a TM Corpus object, or a quanteda corpus object. The second stage is optional and reshapes the Corpus object into word, sentence or paragraph units. The third stage, the Sampling Stage takes a stratified or non-stratfified sampling from the Corpus object. The forth stage, the cross-validation stage, produces one of two cross-validation objects: a CVSet object, which is comprised of a training, test and optional validation set, or a CVSetKFold object which contains k-folds, each comprised of a training and test set. The cross-validation product is the final product and forms the data basis for the modeling phase.
John James, jjames@dataScienceSalon.org
Other CorpusStudio Family of Classes: KFold
,
Sample0
, Sample
,
Segment
, Split
,
TokenizerNLP
, TokenizerQ
,
Tokenizer
, Token
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.