partition: Partition Data into Training and Test Model Frames
In jashu/beset: Best Subset Predictive Modeling

partition

R Documentation

Partition Data into Training and Test Model Frames

Description

partition randomly splits a data frame into two model frames, train and test, which are returned as a "data_partition" structure.

Usage

partition(
  data,
  y,
  frac = 0.5,
  x = NULL,
  offset = NULL,
  weights = NULL,
  na_action = na.omit,
  seed = 42
)

Arguments

`data`	Data frame to be partitioned.
`y`	`Character` string giving the name of the column containing the response variable to be predicted.
`frac`	The fraction of data that should be included in the training set. Default is `0.5`.
`x`	(Optional) `character` vector giving the names of the columns containing the predictor variables. If omitted, defaults to all columns other than those named as `y`, `offset`, or `weights`.
`offset`	(Optional) `character` string giving the name of the column containing a model offset. An offset is a known predictor that is added to a linear model as is (with a beta coefficient of 1) rather than having its beta coefficient optimized. If given, an offset must be included for both the `train` and `test` data frames.
`weights`	(Optional) `character` string giving the name of the column containing observation weights. Use these if you want some rows of the data frame to exert more or less influence than others on a model fit. If given, the `weights` column is only applied during model training; a `weights` column in the `test` data will be ignored.
`na_action`	`Function` defining how `NA`s shoud be treated. Options include `na.omit` (default), `na.fail`, `na.exclude`, and `na.pass`.
`seed`	`Integer` value for seeding the random number generator. See `set.seed`.

Details

partition creates a train/test split among the rows of a data frame based on stratified random sampling within the factor levels of a classification outcome or the quartiles of a numeric outcome. This insures that the training and test samples will be closely matched in terms of class incidence or frequency distribution of the outcome measure. partition includes a seed argument so that the randomized partitioning is reproducible. The train and test data frames are returned bound together in a data_partition structure so that their common ancestry is maintained and self-documented. For example, if you name your data_partition "data", you can intuitively access the training set with data$train and its corresponding test set with data$test.

Value

An object of class "data_partition": a list containing two model frames named train and test, containing the training and testing sets, respectively.

Examples

data <- mtcars
factor_names <- c("cyl", "vs", "am", "gear", "carb")
data[factor_names] <- purrr::map_dfc(data[factor_names], factor)
data <- partition(data, y = "mpg")

jashu/beset documentation built on April 20, 2023, 5:28 a.m.