In brian-j-smith/MachineShop: Machine Learning Models and Tools

Package Extensions

Custom models and metrics can be defined with r rdoc_url("MLModel()") and r rdoc_url("MLMetric()") for use with the model fitting, prediction, and performance assessment tools provided by the package.

Custom Models

The r rdoc_url("MLModel()") function creates a model object that can be used with the previously described fitting functions. It take the following arguments.

name : character name of the object to which the model is assigned.

label : optional character descriptor for the model (default: name).

packages : character vector of packages upon which the model depends. Each name may be optionally followed by a comment in parentheses specifying a version requirement. The comment should contain a comparison operator, whitespace and a valid version number, e.g. "xgboost (>= 1.3.0)".

response_types : character vector of response variable types to which the model can be fit. Supported types are "binary", "BinomialVariate", "DiscreteVariate", "factor", "matrix", "NegBinomialVariate", "numeric", "ordered", "PoissonVariate", and "Surv".

fit : model fitting function whose arguments are a formula, a r rdoc_url("ModelFrame") named data, case weights, and an ellipsis. Argument data may be converted to a data frame with the as.data.frame() function as is commonly needed. The fit function should return the object resulting from the model fit.

predict : prediction function whose arguments are the object returned by r rdoc_url("fit()"), a r rdoc_url("ModelFrame") named newdata of predictor variables, optional vector of times at which to predict survival, and an ellipsis. Argument data may be converted to a data frame with the as.data.frame() function as needed. Values returned by the function should be formatted according to the response variable types below.

factor : matrix whose columns contain the probabilities for multi-level factors or vector of probabilities for the second level of binary factors.
matrix : matrix of predicted responses.
numeric : vector or column matrix of predicted responses.
Surv : matrix whose columns contain survival probabilities at times if supplied or vector of predicted survival means otherwise.

varimp : optional variable importance function whose arguments are the object returned by r rdoc_url("fit()"), optional arguments passed from calls to r rdoc_url("varimp()"), and an ellipsis. The function should return a vector of importance values named after the predictor variables or a matrix or data frame whose rows are named after the predictors.

## Logistic regression model extension
LogisticModel <- MLModel(
  name = "LogisticModel",
  label = "Logistic Model",
  response_types = "binary",
  weights = TRUE,
  fit = function(formula, data, weights, ...) {
    glm(formula, data = as.data.frame(data), weights = weights,
        family = binomial, ...)
  },
  predict = function(object, newdata, ...) {
    predict(object, newdata = as.data.frame(newdata), type = "response")
  },
  varimp = function(object, ...) {
    pchisq(coef(object)^2 / diag(vcov(object)), 1)
  }
)

Custom Metrics

The r rdoc_url("MLMetric()") function creates a metric object that can be used as previously described for the model performance metrics. Its first argument is a function to compute the metric, defined to accept observed and predicted as the first two arguments and with an ellipsis to accommodate others. Its remaining arguments are as follows.

name : character name of the object to which the metric is assigned.

label : optional character descriptor for the metric (default: name).

maximize : logical indicating whether higher values of the metric correspond to better predictive performance.

## F2 score metric extension
f2_score <- MLMetric(
  function(observed, predicted, ...) {
    f_score(observed, predicted, beta = 2, ...)
  },
  name = "f2_score",
  label = "F2 Score",
  maximize = TRUE
)

Usage

Once created, model and metric extensions can be used with the package-supplied fitting and performance functions.

## Logistic regression analysis
data(Pima.tr, package = "MASS")
res <- resample(type ~ ., data = Pima.tr, model = LogisticModel)
summary(performance(res, metric = f2_score))