beset_glm | R Documentation |
beset_glm
performs best subset selection using repeated
cross-validation to find the optimal number of predictors for several
families of generalized linear models.
beset_glm(
form,
data,
family = "gaussian",
link = NULL,
p_max = 10,
force_in = NULL,
nest_cv = FALSE,
n_folds = 10,
n_reps = 10,
seed = 42,
contrasts = NULL,
offset = NULL,
weights = NULL,
start = NULL,
etastart = NULL,
mustart = NULL,
epsilon = 1e-08,
maxit = 25,
skinny = FALSE,
n_cores = NULL,
parallel_type = NULL,
cl = NULL
)
beset_lm(
form,
data,
p_max = 10,
force_in = NULL,
weights = NULL,
contrasts = NULL,
offset = NULL,
nest_cv = FALSE,
n_folds = 10,
n_reps = 10,
seed = 42,
n_cores = NULL,
parallel_type = NULL,
cl = NULL
)
form |
A model |
data |
Either a |
family |
Character string naming the error distribution to be used in the model. Available families are listed under 'List of available families and link functions'. |
link |
(Optional) character string naming the link function to be used in
the model. Available links and their defaults differ by |
p_max |
Maximum number of predictors to attempt to fit. Default is 10. |
force_in |
(Optional) character vector containing the names of any predictor variables that should be included in every model. (Note that if there is an intercept, it is forced into every model by default.) |
nest_cv |
|
n_folds |
|
n_reps |
|
seed |
|
contrasts |
(Optional) |
offset |
(Optional) |
weights |
(Optional) |
start |
(Optional) starting values for the parameters in the linear predictor. |
etastart |
(Optional) starting values for the linear predictor. |
mustart |
(Optional) starting values for the vector of means. |
epsilon |
|
maxit |
|
skinny |
|
n_cores |
Integer value indicating the number of workers to run in
parallel during subset search and cross-validation. By default, this will
be set to one fewer than the maximum number of physical cores you have
available, as indicated by |
parallel_type |
(Optional) character string indicating the type of
parallel operation to be used, either |
cl |
(Optional) |
beset_glm
performs best subset selection for generalized linear
models, fitting a separate model for each possible combination of predictors
(all models that contain exactly 1 predictor, all models that contain
exactly 2 predictors, and so forth). For each number of predictors,
beset_glm
first picks the model with the best fit and then
estimates how well this model predicts new data using k
-fold
cross-validation (how well, on average, a model trained using k - 1
folds predicts the left-out fold).
A "beset_glm" object with the following components:
a list with three data frames:
statistics for every possible combination of predictors:
the total number of predictors in model; note that the number of predictors for a factor variable corresponds to the number of factor levels minus 1
formula for model
-2*log-likelihood + k*npar
, where npar
represents the number of parameters in the fitted model, and
k = 2
twice the difference between the log-likelihoods of the saturated and fitted models, multiplied by the scale parameter
mean absolute error
mean cross entropy, estimated as
-log-likelihood/N
, where N
is the number of
observations
mean squared error
R-squared, calculated as
1 - deviance/null deviance
a data frame containing cross-validation statistics
for the best model for each n_pred
listed in fit_stats
.
Each metric is computed using predict_metrics
, with
models fit to n-1
folds and predictions made on the left-out fold.
Each metric is followed by its standard error. The data frame
is otherwise the same as that documented for fit
, except
AIC is omitted.
if test_data
is provided, a data frame
containing prediction metrics for the best model for each n_pred
listed in fit
as applied to the test_data
.
list giving the row indices for the holdout observations for each fold and/or repetition of cross-validation
number of folds used in cross-validation
number of repetitions used in cross-validation
name of error distribution used in the model
name of link function used in the model
the terms
object used
the data
argument
the offset vector used
(where relevant) the contrasts used
(where relevant) a record of the levels of the factors used in fitting
beset_glm
randomly partitions the data set into n_folds
*
n_repeats
folds within strata (factor levels for factor outcomes,
percentile-based groups for numeric outcomes). This insures that the folds
will be matched in terms of the outcome's frequency distribution.
beset_glm
also insures the reproducibility of your analysis by
requiring a seed
to the random number generator as one of its
arguments.
"gaussian"
The Gaussian family accepts the links
"identity"
(default), "log"
, and "inverse"
.
"binomial"
The binomial family accepts the links
"logit"
(default), "probit"
, "cauchit"
, "log"
, and
"cloglog"
(complementary log-log).
"poisson"
The Poisson family accepts the links "log"
(default), "sqrt"
, and "identity"
.
"negbin"
The negative binomial family accepts the links
"log"
(default), "sqrt"
, and "identity"
.
beset_glm
handles missing data by performing listwise deletion.
No other options for handling missing data are provided at this time. The
user is encouraged to deal with missing values prior to running this
function.
beset_glm
is intended for use with additive models only. An
exhaustive search over the space of possible interactions and/or non-linear
effects is computationally prohibitive, but I hope to offer a greedy search
option in the future. In the meantime and in general, I would recommend the
MARS
technique if you require this feature.
beset_glm
is best suited for searching over a small number of
predictors (less than 10). For a large number of predictors (more than 20),
beset_elnet
is recommended instead. However, note that
beset_elnet
only works with a more restricted set of
distributions.
glm
,
set.seed
, glm.nb
subset1 <- beset_glm(Fertility ~ ., data = swiss)
summary(subset1)
# Force variables to be included in model
subset2 <- beset_glm(Fertility ~ ., data = swiss,
force_in = c("Agriculture", "Examination"))
summary(subset2)
# Use nested cross-validation to evaluate error in selection
subset3 <- beset_glm(Fertility ~ ., data = swiss, nest_cv = TRUE)
summary(subset3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.