autotune | R Documentation |
Automated tuning process for the penalty parameter lambda, with built-in feature selection. Lambda directly influences the granularity of the segmentation, with low/high values resulting in a fine/coarse segmentation.
autotune(
mfit,
data,
vars,
target,
max_ngrps = 15,
hcut = 0.75,
ignr_intr = NULL,
pred_fun = NULL,
lambdas = as.vector(outer(seq(1, 10, 0.1), 10^(-7:3))),
nfolds = 5,
strat_vars = NULL,
glm_par = alist(),
err_fun = mse,
ncores = -1,
out_pds = FALSE
)
mfit |
Fitted model object (e.g., a "gbm" or "randomForest" object). |
data |
Data frame containing the original training data. |
vars |
Character vector specifying the features in |
target |
String specifying the target (or response) variable to model. |
max_ngrps |
Integer specifying the maximum number of groups that each feature's values/levels are allowed to be grouped into. |
hcut |
Numeric in the range [0,1] specifying the cut-off value for the
normalized cumulative H-statistic over all two-way interactions, ordered
from most to least important, between the features in |
ignr_intr |
Optional character string specifying features to ignore when searching for meaningful interactions to incorporate in the GLM. |
pred_fun |
Optional prediction function to calculate feature effects for
the model in |
lambdas |
Numeric vector with the possible lambda values to explore. The
search grid is created automatically via |
nfolds |
Integer for the number of folds in K-fold cross-validation. |
strat_vars |
Character (vector) specifying the feature(s) to use for stratified sampling. The default NULL implies no stratification is applied. |
glm_par |
Named list, constructed via |
err_fun |
Error function to calculate the prediction errors on the
validation folds. This must be an R function which outputs a single number
and takes two vectors
See |
ncores |
Integer specifying the number of cores to use. The default
|
out_pds |
Boolean to indicate whether to add the calculated PD effects for the selected features to the output list. |
List with the following elements:
named vector containing the selected features (names) and the optimal number of groups for each feature (values).
the optimal GLM
surrogate, which is fit to all observations in data
. The segmented
data can be obtained via the $data
attribute of the GLM fit.
the cross-validation results for the main effects as a
tidy data frame. The column cv_err
contains the cross-validated
error, while the columns 1:nfolds
contain the error on the
validation folds.
cross-validation results for the interaction effects.
List with the PD effects for the
selected features (only present if out_pds = TRUE
).
## Not run:
data('mtpl_be')
features <- setdiff(names(mtpl_be), c('id', 'nclaims', 'expo', 'long', 'lat'))
set.seed(12345)
gbm_fit <- gbm::gbm(as.formula(paste('nclaims ~',
paste(features, collapse = ' + '))),
distribution = 'poisson',
data = mtpl_be,
n.trees = 50,
interaction.depth = 3,
shrinkage = 0.1)
gbm_fun <- function(object, newdata) mean(predict(object, newdata, n.trees = object$n.trees, type = 'response'))
gbm_fit %>% autotune(data = mtpl_be,
vars = c('ageph', 'bm', 'coverage', 'fuel', 'sex', 'fleet', 'use'),
target = 'nclaims',
hcut = 0.75,
pred_fun = gbm_fun,
lambdas = as.vector(outer(seq(1, 10, 1), 10^(-6:-2))),
nfolds = 5,
strat_vars = c('nclaims', 'expo'),
glm_par = alist(family = poisson(link = 'log'),
offset = log(expo)),
err_fun = poi_dev,
ncores = -1)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.