preprocess_split: Split Data

View source: R/preprocess_split.R

preprocess_splitR Documentation

Split Data

Description

A utility to split data into a training and testing dataset. This can also split labels according to the same split.

Usage

preprocess_split(
  input,
  input_labels = NA,
  no_shuffle = FALSE,
  seed = NA,
  stratify_data = FALSE,
  test_ratio = NA,
  verbose = FALSE
)

Arguments

input

Matrix containing data (numeric matrix).

input_labels

Matrix containing labels (integer matrix).

no_shuffle

Avoid shuffling the data before splitting. Default value "FALSE" (logical).

seed

Random seed (0 for std::time(NULL)). Default value "0" (integer).

stratify_data

Stratify the data according to label. Default value "FALSE" (logical).

test_ratio

Ratio of test set; if not set,the ratio defaults to 0.. Default value "0.2" (numeric).

verbose

Display informational messages and the full list of parameters and timers at the end of execution. Default value "FALSE" (logical).

Details

This utility takes a dataset and optionally labels and splits them into a training set and a test set. Before the split, the points in the dataset are randomly reordered. The percentage of the dataset to be used as the test set can be specified with the "test_ratio" parameter; the default is 0.2 (20

The output training and test matrices may be saved with the "training" and "test" output parameters.

Optionally, labels can also be split along with the data by specifying the "input_labels" parameter. Splitting labels works the same way as splitting the data. The output training and test labels may be saved with the "training_labels" and "test_labels" output parameters, respectively.

Value

A list with several components:

test

Matrix to save test data to (numeric matrix).

test_labels

Matrix to save test labels to (integer matrix).

training

Matrix to save training data to (numeric matrix).

training_labels

Matrix to save train labels to (integer matrix).

Author(s)

mlpack developers

Examples

# So, a simple example where we want to split the dataset "X" into "X_train"
# and "X_test" with 60% of the data in the training set and 40% of the
# dataset in the test set, we could run 

## Not run: 
output <- preprocess_split(input=X, test_ratio=0.4)
X_train <- output$training
X_test <- output$test

## End(Not run)

# Also by default the dataset is shuffled and split; you can provide the
# "no_shuffle" option to avoid shuffling the data; an example to avoid
# shuffling of data is:

## Not run: 
output <- preprocess_split(input=X, test_ratio=0.4, no_shuffle=TRUE)
X_train <- output$training
X_test <- output$test

## End(Not run)

# If we had a dataset "X" and associated labels "y", and we wanted to split
# these into "X_train", "y_train", "X_test", and "y_test", with 30% of the
# data in the test set, we could run

## Not run: 
output <- preprocess_split(input=X, input_labels=y, test_ratio=0.3)
X_train <- output$training
y_train <- output$training_labels
X_test <- output$test
y_test <- output$test_labels

## End(Not run)
# To maintain the ratio of each class in the train and test sets,
# the"stratify_data" option can be used.

## Not run: 
output <- preprocess_split(input=X, test_ratio=0.4, stratify_data=TRUE)
X_train <- output$training
X_test <- output$test

## End(Not run)

mlpack documentation built on Sept. 27, 2023, 1:07 a.m.