In jashu/beset: Best Subset Predictive Modeling

library(beset)
suppressPackageStartupMessages(library(ggplot2))

The beset package includes an S3 object system that enables you to easily obtain cross-validated prediction metrics from common model objects. Currently supported classes are "lm", "negbin", and "glm"/ "glmnet" (for gaussian, binomial, and poisson families only). To access this feature, first fit a model as you normally would.

For example, consider this example model that uses the Swiss Fertility and Socioeconomic Indicators (1888) Data. Simply fit a linear model as you normally would:

lin_mod <- lm(Fertility ~ ., data = swiss)
summary(lin_mod)

To obtain cross-validation metrics using default settings, simply pass the model object to the S3 generic function validate. If you have not specified the number of folds or repetitions to use, a 10 X 10 cross-validation will be performed by default (10-fold c.v. repeated 10 times).

cv_results <- validate(lin_mod)
cv_results

Prediction Metrics

The print method for objects returned by validate will report 4 summary statistics of prediction error on the hold-out data and, provided more than one repetition was performed, the range of each statistic observed across the different repetitions.

Mean absolute error. This corresponds to the mean of the absolute value of the prediction residuals. On average, our model predictions tend to miss the true fertility percentage by 6 points.
Mean cross entropy. This is an information-theory metric that corresponds to the mean negative log-likelihood of the observations. Smaller is better, but the magnitude of this value is not particularly informative by itself. It is mainly useful for evaluating models that optimize a loss function other than the residual sum of squares, especially binary classification models that minimize logistic loss. (Cross entropy in this case is identical to "log loss".) To obtain a benchmark, you can always compare this metric to the expected negative log-likelihood of a null (intercept-only) model:

null_model <- lm(Fertility ~ 1, data = swiss)
mce <- as.numeric(-logLik(null_model) / nobs(null_model))

In this case, we can see that our model reduces the mean cross entropy from r signif(mce, 3) to r signif(cv_results$stats$mce$mean, 3).

Mean squared error. The traditional performance metric for regression models of a continuous response.
Variance Explained. It is well known that, unless the sample size $n$ greatly exceeds the number of predictors $p$, a model's $R^2$ will be inflated. The adjusted $R^2$ that appears in the summary output of an lm is an attempt to correct for this bias by penalizing the $R^2$ in proportion to larger $p$ and smaller $n$. The cross-validated $R^2$ can also be thought of as an adjusted $R^2$, except it is empirically, rather than formulaically, derived. It is the fraction of variance that the model is expected to predict in a new sample. Note for these data that the cross-validated $R^2$ is somewhat more pessimistic than the adjusted $R^2$ statistic.

``{block2, type = 'rmd-details'} **HowS.E.,Min, andMaxare calculated**. These columns report different ways of quantifying the variability of the cross-validation estimates. For example, 10-fold cross-validation results in 10 different train-test splits: 10 models and 10 different estimates of each prediction metric. TheS.E.column reports the standard error of these different estimates, calculated as the standard deviation of the within-fold estimates divided by the square root of the number of folds. Provided that repetitions are performed (n_reps > 1), an additional metric to evaluate is how sensitive the cross-validation estimates are to the randomized fold assignments. TheMinandMax` columns report the range of estimates that were obtained between repetitions (after aggregating all folds within each repetition).

If you `validate` a binomial GLM, "Mean Absolute Error" will be replaced by "Area Under Curve", i.e., the area under the receiver operating characteristic (ROC) curve, and "Variance Explained" will be replaced by "Deviance Explained". An example of logistic regression will be shown here using baseline exam results on prostate cancer patients from Dr. Donn Young at The Ohio State University Comprehensive Cancer Center. See `?prostate` for details. 

```r
log_mod <- glm(tumor ~ ., data = prostate, family = "binomial")
summary(log_mod)
validate(log_mod)

``{block2, type = 'rmd-details'} **Use of deviance-based $R_D^2$**. Ordinary $R^2$ can show undesirable properties when applied to GLMs with non-normal error distributions, such as the training $R^2$ not uniformly increasing as more predictors are added. However,beset` calculates the deviance-based $R_D^2$, which can be reliably used with all of the model families it supports to indicate the fraction of uncertainty in the outcome that the model is explaining.

# Optional Arguments

## `n_folds` and `n_reps`

Use the `n_folds` arg to specify the number of cross-validation folds, and the `n_reps` arg to specify the number of times that cross-validation should be repeated (with different random draws to assign cases to folds). 

For example, the following will result in 5 repetitions of 5-fold cross-validation:

```r
validate(log_mod, n_folds = 5, n_reps = 5)

``{block2, type = 'rmd-details'} **How fold assignments are made**. Thevalidatemethods use stratified random sampling to make fold assignments. For factor responses, random fold assignment is made within each factor level, insuring that the ratio of class examples remains the same between any given train-test split. For numeric responses, the random assignment is made within quartiles of the response, insuring that the distribution of response values is similar for all train-test splits. Special consideration is given ifvalidate` detects a count distribution with a floor of 0; in this case, the 0 values are treated as a separate stratum to insure that the magnitude of zero-inflation is matched for all folds.

To perform leave-one-out cross-validation (LOOCV), set `n_folds` to a value equal to the number of observations in your data set. (Technically, any value greater than half the number of observations will also result in LOOCV because `validate` will autocorrect any such value to equal the number of observations.) Note that under LOOCV, repetitions are pointless and will not be performed, and so ranges will not be reported in the output. Note also that for the AUC statistics and variance/deviance explained, error metrics cannot be calculated under LOOCV because these statistics are undefined for individual predictions; they can only be calculated on the aggregate predictions. 

```r
validate(log_mod, n_folds = nobs(log_mod), n_reps = 1)

``{block2, type = 'rmd-details'} **Advice on choosingn_foldsandn_reps`**. To understand the impact of the choice of how many folds to use in cross-validation, it is helpful to consider two extremes.

The smallest number of folds that one can choose is 2, which is equivalent to training your model on half your data and testing it on the other half (and then doing the reverse). Unless your sample is so large that all meaningful variation in the population is captured after splitting it in half, you will incur greater prediction error because of the reduced sample size; on the other hand, the magnitude of the error is likely to be highly consistent between the two halves, and you can be reasonably confident that the model trained to the entire sample will perform at least as well on new data as the model trained on half (and it will likely perform better).

The largest number of folds that one can use is one for every observation in the data set, which is also known as leave-one-out cross-validation (LOOCV). Here, you are training models on data sets that are almost the same size as the full data set (n-1). So the cross-validation estimator is the least biased for the true prediction error, but it is also the most variable; that is, if you were to repeat the LOOCV procedure on multiple independent samples, you would find that the average LOOCV error would approximate the true error, but the individual LOOCV errors would vary greatly. So you can no longer be reasonably confident that the model will perform as well on new data as the LOOCV estimate would suggest; it is equally likely to perform much better or much worse.

So the smaller the number of folds, the more pessimistic your estimates of prediction error will be. (Note above how the predicted deviance explained is a bit lower when 5 folds is used as opposed to 10.) But the more confident you can be that your model will meet or exceed this estimate. Overall, 10-fold cross-validation is recognized as an optimal compromise for this bias-variance tradeoff (Kohavi, 1995), which is why it is used by default. However, LOOCV may be preferrable when the data distributions are characterized by wide dispersion or extreme values (Japkowics & Shah, 2011).

As for the number of repetitions, it is a good idea to evaluate how sensitive your cross-validation estimates are to different random splits of the data. You want to perform enough repetitions so that your mean estimate is stable. (You get roughly the same result no matter what value you choose for seed.) 10 repetitions has been recommended for 10-fold cross-validation (Bouckaert, 2003) and is used as the default value.

## Obtaining hold-out predictions and fold assignments

The hold-out predictions of the cross-validation procedure and the corresponding fold-assignments can be obtained by assigning a variable name to the object returned by `validate`:

```r
cv_results <- validate(lin_mod)

The predictions are stored as a data frame named "predictions". The rows correspond to the equivalent rows of the data frame used to fit the model, and there is a column for each repetition of the cross-validation procedure. Going across columns within a row you can see the variability of the prediction for that individual as a function of fitting the model to different subsamples of the data (none of which included that individual):

cv_results$predictions

The fold assignments are stored as a data frame named "fold_assignments". The values indicate the number of the fold that the case was assigned to; within a repetition, cases with the same fold number make up one of the hold-out data sets (and all cases not having that number were used to train a model to predict that hold-out set):

cv_results$fold_assignments

``{block2, type = 'rmd-caution'} **The wrong and right way to do cross-validation**. A common blunder is to first perform variable selection on the basis of *all of the samples*, such as computing the pairwise correlation betweeen each predictor and the response and retaining only those that meet a "significance" threshold. Ifvalidate` is used on a model whose variables have been selected in this way, it will underestimate the true prediction error, in proportion to the number of predictors that were pre-screened; the more predictors that were eliminated from the original data set, the more erroneous the cross-validation error will be.

Why is this the case? Suppose we have two independent samples, A and B, and we want to train a model on Sample A and test it on Sample B. Now suppose we use information from Sample B to make decisions about how Sample A is trained. The samples will no longer be independent, and the more degrees of freedom we have to choose the model parameters for Sample A, the more opportunities we have to stumble upon and exploit spurious similarities between these two particular samples that will not generalize to other samples. Similarly, cross-validation estimates based on hold-out folds will be corrupted if the entire data set is used beforehand because the procedure no longer mimics the prediction of a brand new sample.

In summary, the intended use case for the validate function is to estimate the true prediction error of a model based on theory or prior work that has not been influenced by "peeking" at the relationships between the predictors and the response in the sample data. (It is perfectly okay to look at just the predictors by themselves and eliminate variables for reasons unrelated to the response, such as low variance.) Accurate cross-validation estimates of prediction error can still be obtained for models based on data-driven variable selection, but the variable selection must take place during the cross-validation procedure, not before. ```

jashu/beset documentation built on April 20, 2023, 5:28 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com