Model Fit and Prediction

Model Information

Model fitting requires user specification of a MachineShop compatible model. A named list of package-supplied models can be obtained interactively with the r rdoc_url("modelinfo()") function, and includes the following components for each.

label : character descriptor for the model.

packages : character vector of source packages required to use the model. These need only be installed with the install.packages() function or by equivalent means; but need not be loaded with, for example, the library() function.

response_types : character vector of response variable types supported by the model.

weights : logical value or vector of the same length as response_types indicating whether case weights are supported for the responses.

na.rm : character sting specifying removal of "all" cases with missing values from model fitting, "none", or only those whose missing values are in the "response" variable.

arguments : closure with the argument names and corresponding default values of the model function.

grid : logical indicating whether automatic generation of tuning parameter grids is implemented for the model.

varimp : logical indicating whether model-specific variable importance is defined.

Function r rdoc_url("modelinfo()") may be called without arguments, with one or more model functions, observed response variables, or vectors representing response variable types; and will return information on all matching models.

## All available models
modelinfo() %>% names

Information is displayed below for the r rdoc_url("GBMModel()") function corresponding to a generalized boosted regression model, which is applicable to survival outcomes.

## Model-specific information
modelinfo(GBMModel)

Submitting the model function at the console will result in similar information being displayed as formatted text.

GBMModel

Type-Specific Models

When data objects are supplied as arguments to r rdoc_url("modelinfo()"), information is returned on all models applicable to response variables of the same data types. If model functions are additionally supplied as arguments, information on the subset matching the data types is returned.

## All survival response-specific models
modelinfo(Surv(0)) %>% names

## Identify survival response-specific models
modelinfo(Surv(0), CoxModel, GBMModel, SVMModel) %>% names

Response Variable-Specific Models

As a special case of type-specific arguments, existing response variables to be used in analyses may be given as arguments to identify applicable models.

## Models for a responses variable
modelinfo(surv_df$y) %>% names

Fit Function

Package models, such as r rdoc_url("GBMModel"), can be specified in the model argument of the r rdoc_url("fit()") function to estimate a relationship (surv_fo) between predictors and an outcome based on a set of data (surv_train). Argument specifications may be in terms of a model function, function name, or object.

## Generalized boosted regression fit

## Model function
surv_fit <- fit(surv_fo, data = surv_train, model = GBMModel)

## Model function name
fit(surv_fo, data = surv_train, model = "GBMModel")

## Model object
fit(surv_fo, data = surv_train, model = GBMModel(n.trees = 100, interaction.depth = 1))

Model function arguments will assume their default values unless otherwise changed in a function call.

Dynamic Model Parameters

Dynamic model parameters are model function arguments defined as expressions to be evaluated at the time of model fitting. As such, their values can change based on characteristics of the analytic dataset, including the number of observations or predictor variables. Expressions to dynamic parameters are specified within the package-supplied quote operator r rdoc_url(".()", "quote") and can include the following objects:

nobs : number of observations in data.

nvars : number of predictor variables in data.

y : response variable.

In the example below, Bayesian information criterion (BIC) based stepwise variable selection is performed by creating a r rdoc_url("CoxStepAICModel", "CoxModel") with dynamic parameter k to be calculated as the log number of observations in the fitted dataset.

## Dynamic model parameter k = log number of observations

## Number of observations: nobs
fit(surv_fo, data = surv_train, model = CoxStepAICModel(k = .(log(nobs))))

## Response variable: y
fit(surv_fo, data = surv_train, model = CoxStepAICModel(k = .(log(length(y)))))

Predict Function

A r rdoc_url("predict()") function is supplied for application to model fit results to obtain predicted values on a dataset specified with its newdata argument or on the original dataset if not specified. Survival means are predicted for survival outcomes by default. Estimates of the associated survival distributions are needed to calculate the means. For models, like r rdoc_url("GBMModel"), that perform semi- or non-parametric survival analysis, Weibull approximations to the survival distributions are the default for mean estimation. Other choices of distributional approximations are exponential, Rayleigh, and empirical. Empirical distributions are applicable to Cox proportional hazards-based models and can be calculated with the method of Breslow [-@breslow:1972:DPC] or Efron [-@efron:1977:ECL, default]. Note, however, that empirical survival means are undefined mathematically if an event does not occur at the longest follow-up time. In such situations, a restricted survival mean is calculated by changing the longest follow-up time to an event, as suggested by Efron [-@efron:1967:PFB], which will be negatively biased.

## Predicted survival means (default: Weibull distribution)
predict(surv_fit, newdata = surv_test)

## Predicted survival means (empirical distribution)
predict(surv_fit, newdata = surv_test, distr = "empirical")

In addition to survival means, predicted survival probabilities (type = "prob") or 0-1 survival events (default: type = "response") can be obtained with the follow-up times argument. The cutoff probability for classification of survival events (or other binary responses) can be set optionally with the cutoff argument (default: cutoff = 0.5). As with mean estimation, distributional approximations to the survival functions may be specified for the predictions, with the default for survival probabilities being the empirical distribution.

## Predict survival probabilities and events at specified follow-up times
surv_times <- 365 * c(5, 10)

predict(surv_fit, newdata = surv_test, times = surv_times, type = "prob")

predict(surv_fit, newdata = surv_test, times = surv_times, cutoff = 0.7)

Prediction of other outcome types is more straightforward. Predicted numeric and factor responses are of the same class as the observed values at the default type = "response"; whereas, double (decimal) numeric values and factor level probabilities result when type = "prob".



brian-j-smith/MachineShop documentation built on Aug. 20, 2024, 10:59 p.m.