This vignette contains some examples of cross-validation and bootstrapping with resampler.
library("tidyverse") library("broom") library("resamplr") library("boot")
These examples are taken from the examples in cv.glm
in the boot package.
Calculate leave-one-out and 5-fold cross-validation prediction error for a regression of mammal brain size on body size using the mammals
data set. This
will use RMSE as the cost function.
data(mammals, package = "MASS") mammals_cvloo <- crossv_loo(mammals) mammals_cv6 <- crossv_kfold(mammals, k = 5) mammals_err <- function(train, test) { mod <- glm(log(brain) ~ log(body), data = as.data.frame(train)) modelr::rmse(mod, as.data.frame(test)) } map2_dbl(mammals_cvloo$train, mammals_cvloo$test, mammals_err) %>% mean() map2_dbl(mammals_cv6$train, mammals_cv6$test, mammals_err) %>% mean()
Consider a regression of whether prostate cancer had spread to lymph nodes (r
) in prostate cancer patients on several predictors.
We will calculate the out-of-sample fit using leave-one-out cross validation and and 11-fold cross-validation prediction error for the nodal
data set.
Since it a binary outcome, we'll use the proportion classified correctly,
$$
\frac{1}{n} \sum I(\vert y_i - \hat{p}_i \vert > 0),
$$
as a measure of fit.
nodal_err <- function(train, test) { mod <- glm(r ~ stage + xray + acid, binomial, data = as.data.frame(train)) test <- as.data.frame(test) p <- predict(mod, newdata = test, type = "response") mean(abs(test$r - p) > 0.5) }
The expected fit from LOO-CV,
nodal_cvloo <- crossv_loo(nodal) map2_dbl(nodal_cvloo$train, nodal_cvloo$test, nodal_err) %>% mean()
and from 11-fold cross-validation,
nodal_cv11 <- crossv_kfold(nodal, 11) map2_dbl(nodal_cv11$train, nodal_cv11$test, nodal_err) %>% mean()
The examples are taken from the examples in the boot
function of the boot package.
This is an ordinary bootstrap of the ratio of means using the city
data
bootstrap(city, k = 999) %>% mutate(ratio_means = map_dbl(sample, function(d) { d <- as.data.frame(d) mean(d$x) / mean(d$u) })) %>% summarise_at(vars(ratio_means), funs(mean, sd))
A stratified bootstrap uses the bootstrap
function, but requires a grouped_df
object as input.
The method bootstrap.grouped_df
can resample groups (groups = TRUE
), within groups (stratify = TRUE
) or both. The default is a clustered bootstrap, resampling groups, but not rows within groups (groups = TRUE, stratify = FALSE
).
In this example we will look at the difference of means between the final two series in the gravity
data.
gravity %>% filter(series %in% c("7", "8")) %>% group_by(series) %>% bootstrap(k = 999, stratify = TRUE, groups = FALSE) %>% mutate(diff_means = map_dbl(sample, function(d) { as.data.frame(d) %>% group_by(series) %>% summarise(g = mean(g)) %>% summarise(diff_means = base::diff(g)) %>% `[[`("diff_means") })) %>% summarise_at(vars(diff_means), funs(mean, sd))
Example from car::Boot
:
Run a regression of fertility on several socioeconomic measures for the 47 French-speaking provinces of Switzerland in around 1888. Use the ordinary non-parametric bootstrap, and calculate the standard errors of $\hat{\beta}$ and $\hat{\sigma}$.
swiss_bs <- bootstrap(swiss, k = 200) %>% mutate(mod = map(sample, ~ lm(Fertility ~ ., data = .x)))
Find the standard deviation for $\hat{\beta}$,
map_df(swiss_bs$mod, broom::tidy) %>% group_by(term) %>% summarise(mean = mean(estimate), se = sd(estimate))
and for $\hat{\sigma}$,
swiss_bs %>% mutate(sigma = map_dbl(mod, ~ summary(.x)$sigma)) %>% summarise_at(vars(sigma), funs(mean, sd))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.