Variable Specifications and Preprocessing

Variable specification defines the relationship between response and predictor variables as well as the data used to estimate the relationship. Four main types of specifications are supported by the fit and resample functions: traditional formula, design matrix, model frame, and preprocessing recipe.

Analysis Dataset

Different variable specifications will be illustrated with the Iowa City home prices dataset from the package. To start, a formula is defined relating sale amount to home characteristics. Two equivalent definitions are given. One in which predictors on the right hand side of the equation are explicitly included, and another in which . notation is used to indicate that all remaining variable not already in the model be included on the right hand side.

## Analysis library
library(MachineShop)

## Iowa City home prices dataset
str(ICHomes)

## Formula: explicit inclusion of predictor variables
fo <- sale_amount ~ sale_year + sale_month + built + style + construction +
  base_size + add_size + garage1_size + garage2_size + lot_size +
  bedrooms + basement + ac + attic + lon + lat

## Formula: implicit inclusion of predictor (. = remaining) variables
fo <- sale_amount ~ .

Traditional Formula

Traditional formula calls to the fitting functions consist of a formula and dataset pair. This specification additionally allows for crossing (*), interaction (:), and removal (-) of predictors in the formula; in-line functions of response variables; and some in-line functions of predictors.

model_fit <- fit(fo, ICHomes, model = TunedModel(SVMRadialModel))
tuned_model <- as.MLModel(model_fit)
model_res <- resample(fo, ICHomes, model = tuned_model)
summary(model_res)

Design Matrix

Support is provided for calls with a numeric design matrix and response object pair. The design matrix approach has lower computational overhead than the others and can thus enable a larger number of predictors to be included in an analysis.

x <- as.matrix(ICHomes[c("built", "base_size", "lot_size", "bedrooms")])
y <- ICHomes$sale_amount

model_fit <- fit(x, y, model = TunedModel(SVMRadialModel))
tuned_model <- as.MLModel(model_fit)
model_res <- resample(x, y, model = tuned_model)
summary(model_res)

Model Frame

Model frames are created with the traditional formula or design matrix syntax described above and then passed to the fitting functions. They allow for the specification of variables for stratified resampling or for weighting of cases in model fitting.

mf <- ModelFrame(fo, data = ICHomes, strata = ICHomes$sale_amount)

model_fit <- fit(mf, model = TunedModel(SVMRadialModel))
tuned_model <- as.MLModel(model_fit)
model_res <- resample(mf, model = tuned_model)
summary(model_res)

Preprocessing Recipe

Preprocessing recipes provide a flexible framework for defining predictor and response variables as well as preprocessing steps to be applied to them prior to model fitting. Using recipes helps ensure that estimation of predictive performance accounts for all modeling step. As with model frames, the recipes approach allows for specification of case strata and weights.

library(recipes)

rec <- recipe(fo, data = ICHomes) %>%
  role_case(stratum = sale_amount) %>%
  step_center(base_size, add_size, garage1_size, garage2_size, lot_size) %>%
  step_pca(base_size, add_size, garage1_size, garage2_size, lot_size, num_comp = 2) %>%
  step_dummy(all_nominal_predictors())

model_Fit <- fit(rec, model = TunedModel(SVMRadialModel))
tuned_model <- as.MLModel(model_fit)
model_res <- resample(rec, model = tuned_model)
summary(model_res)


brian-j-smith/MachineShop documentation built on Sept. 22, 2023, 10:01 p.m.