partition: Partition Data into Training and Test Model Frames

View source: R/partition.R

partitionR Documentation

Partition Data into Training and Test Model Frames

Description

partition randomly splits a data frame into two model frames, train and test, which are returned as a "data_partition" structure.

Usage

partition(
  data,
  y,
  frac = 0.5,
  x = NULL,
  offset = NULL,
  weights = NULL,
  na_action = na.omit,
  seed = 42
)

Arguments

data

Data frame to be partitioned.

y

Character string giving the name of the column containing the response variable to be predicted.

frac

The fraction of data that should be included in the training set. Default is 0.5.

x

(Optional) character vector giving the names of the columns containing the predictor variables. If omitted, defaults to all columns other than those named as y, offset, or weights.

offset

(Optional) character string giving the name of the column containing a model offset. An offset is a known predictor that is added to a linear model as is (with a beta coefficient of 1) rather than having its beta coefficient optimized. If given, an offset must be included for both the train and test data frames.

weights

(Optional) character string giving the name of the column containing observation weights. Use these if you want some rows of the data frame to exert more or less influence than others on a model fit. If given, the weights column is only applied during model training; a weights column in the test data will be ignored.

na_action

Function defining how NAs shoud be treated. Options include na.omit (default), na.fail, na.exclude, and na.pass.

seed

Integer value for seeding the random number generator. See set.seed.

Details

partition creates a train/test split among the rows of a data frame based on stratified random sampling within the factor levels of a classification outcome or the quartiles of a numeric outcome. This insures that the training and test samples will be closely matched in terms of class incidence or frequency distribution of the outcome measure. partition includes a seed argument so that the randomized partitioning is reproducible. The train and test data frames are returned bound together in a data_partition structure so that their common ancestry is maintained and self-documented. For example, if you name your data_partition "data", you can intuitively access the training set with data$train and its corresponding test set with data$test.

Value

An object of class "data_partition": a list containing two model frames named train and test, containing the training and testing sets, respectively.

See Also

set.seed, data_partition

Examples

data <- mtcars
factor_names <- c("cyl", "vs", "am", "gear", "carb")
data[factor_names] <- purrr::map_dfc(data[factor_names], factor)
data <- partition(data, y = "mpg")


jashu/beset documentation built on April 20, 2023, 5:28 a.m.