Response Variable Types

The R class types of response variables play a central role in their analysis with the package. They determine, for example, the specific models that can be fit, fitting algorithms employed, predicted values produced, and applicable performance metrics and analyses. As described in the following sections, factors, ordered factors, numeric vectors and matrices, and survival objects are supported by the package.

Factor

Categorical responses with two or more levels should be coded as a factor variable for analysis. Prediction is of factor levels by default and of level-specific probabilities if type = "prob".

## Iris flowers species (3-level factor)
model_fit <- fit(Species ~ ., data = iris, model = GBMModel)
predict(model_fit) %>% head
predict(model_fit, type = "prob") %>% head

In the case of a binary factor, the second factor level is treated as the event and the level for which predicted probabilities are computed.

## Pima Indians diabetes statuses (binary factor)
data(Pima.te, package = "MASS")
data(Pima.tr, package = "MASS")

model_fit <- fit(type ~ ., data = Pima.tr, model = GBMModel)
predict(model_fit, newdata = Pima.te) %>% head
predict(model_fit, newdata = Pima.te, type = "prob") %>% head

Ordered Factor

Categorical responses can be designated as having ordered levels by storing them as an ordered factor variable. For categorical vectors, this can be accomplished with the factor() function and its argument ordered = TRUE or more simply with the ordered() function. Numeric vectors can be converted to ordered factors with the cut() function.

## Iowa City housing prices (ordered factor)
df <- within(ICHomes, {
  sale_amount <- cut(sale_amount, breaks = 3,
                     labels = c("Low", "Medium", "High"),
                     ordered_result = TRUE)
})

model_fit <- fit(sale_amount ~ ., data = df, model = GBMModel)
predict(model_fit) %>% head
predict(model_fit, type = "prob") %>% head

Numeric Vector

Code univariate numerical responses as a numeric variable. Predicted numeric values are of the original type (integer or double) by default, and doubles if type = "numeric".

## Iowa City housing prices
model_fit <- fit(sale_amount ~ ., data = ICHomes, model = GBMModel)
predict(model_fit) %>% head
predict(model_fit, type = "numeric") %>% head

Numeric Matrix

Store multivariate numerical responses as a numeric matrix variable for model fitting with traditional formulas and model frames.

## Anscombe's multiple regression models dataset

## Numeric matrix response formula
model_fit <- fit(cbind(y1, y2, y3) ~ x1, data = anscombe, model = LMModel)
predict(model_fit) %>% head

For recipes, the multiple response may be defined on the left hand side of a recipe formula or as a single variable within a data frame.

## Numeric matrix response recipe
## Defined in a recipe formula
rec <- recipe(y1 + y2 + y3 ~ x1, data = anscombe)

## Defined within a data frame
df <- within(anscombe, {
  y <- cbind(y1, y2, y3)
  remove(y1, y2, y3)
})
rec <- recipe(y ~ x1, data = df)
model_fit <- fit(rec, model = LMModel)
predict(model_fit) %>% head

Survival Objects

Censored time-to-event survival responses should be stored as a Surv variable for model fitting with traditional formulas and model frames.

## Survival response formula
library(survival)

fit(Surv(time, status) ~ ., data = veteran, model = GBMModel)

For recipes, survival responses may be defined with the individual survival time and event variables given on the left hand side of a recipe formula and their roles designated with the r rdoc_url("role_surv()", "recipe_roles") function or as a single Surv variable within a data frame.

## Survival response recipe
## Defined in a recipe formula
rec <- recipe(time + status ~ ., data = veteran) %>%
  role_surv(time = time, event = status)

## Defined within a data frame
df <- within(veteran, {
  y <- Surv(time, status)
  remove(time, status)
})
rec <- recipe(y ~ ., data = df)
fit(rec, model = GBMModel)


brian-j-smith/MachineShop documentation built on Aug. 20, 2024, 10:59 p.m.