knitr::opts_chunk$set( collapse = TRUE, comment = "#" ) knitr::opts_chunk$set(cache=TRUE) knitr::opts_knit$set(cache.extra = 234) # seed options(warnPartialMatchArgs = FALSE, warnPartialMatchDollar = FALSE, warnPartialMatchAttr = FALSE)
For even quicker usage instructions, see the README.
The general idea behind this package is for you to be able to throw
in nearly any model built in the typical R fashion and get back
a non-parametric bootstrap analysis. For now, we just need
a somewhat standard way of extracting fixed effects, and that
update
works, which is nearly always the case.
Assuming your model is called base_model
, if
coef(summary(base_model))
gives the estimates and standard errors,
the method will work. The result of coef
is often a list; this
package will then produce estimates for each element of the
list. Common examples where there are multiple sets of coefficients
are zero-inflated models.
The data should be supplied manually, although it's currently the case
that if the dataframe that creates the model is accessible from
base_model$model
, base_model$frame
or base_model@frame
, you
won't have to send the data in to the function separately (but you
will get a warning). Be careful here - some models supply this but the
method introduces bugs. Much safer to supply the data explicitly.
Let's work through an example with glmmTMB
. We'll use the data
that comes with glmmboot
.
library(glmmboot) data(test_data) head(test_data)
We're assuming the x_var
variables are fixed, and subj
is to be treated as a random effect.
Thus our base analysis is:
## we'll skip this if glmmTMB not available library(glmmTMB) model_formula <- as.formula(y ~ x_var1 + x_var2 + x_var2 + (1 | subj)) base_run <- glmmTMB(formula = model_formula, data = test_data, family = binomial)
We get a warning because the outcome data is proportional. Not to worry.
Now we'll use the bootstrap. By default it'll perform block bootstrapping over the highest
entropy random effect - but there's only one, so of course the winner is subj
.
bootstrap_over_subj <- bootstrap_model(base_model = base_run, base_data = test_data, resamples = 99)
For publications etc, better to run about ten thousand resamples to avoid noise having much of an effect. Of course 99 is far too small, only for an example.
And comparing results:
print(bootstrap_over_subj)
The above might take a long time in a real setting. If it takes far too long on your machine,
you can ideally run it on a bunch of computers. We don't want each computer to
output the fully processed output, only the intermediate outcome. To do this,
we set return_coefs_instead = TRUE
for each run:
b_list1 <- bootstrap_model(base_model = base_run, base_data = test_data, resamples = 29, return_coefs_instead = TRUE) b_list2 <- bootstrap_model(base_model = base_run, base_data = test_data, resamples = 30, return_coefs_instead = TRUE) b_list3 <- bootstrap_model(base_model = base_run, base_data = test_data, resamples = 30, return_coefs_instead = TRUE)
Combining this is simple enough. If we've used a few, we don't want to mess around with even more lists, so we can enter them into the relevant function:
print(combine_resampled_lists(b_list1, b_list2, b_list3))
If we've run a huge number of such runs, ideally we'll combine all output to a list of lists, like so:
list_of_lists_output <- list(b_list1, b_list2, b_list3)
And we'll get the same result:
print(combine_resampled_lists(list_of_lists_output))
You MUST set return_coefs_instead = TRUE
for methods like these that combine output.
There are two primary ways to run models in parallel:
parallel
: this will use parallel::mclapply
to run the models.future
: this will use future.apply::future_lapply
to run the modelsUse he parameter parallelism
in the function bootstrap_model()
to choose how to parallelise.
parallelism = "none"
This is the default, and will run each model sequentially, using lapply()
.
If you set num_cores
to a number greater than 1, you'll get an error with this option.
This isn't the default if you set num_cores
to a value greater than 1.
parallelism = "parallel"
This will use parallel::mclapply
to run the models. If you set num_cores
,
then that many cores will be used. If you don't set num_cores
,
the function will use num_cores = parallel::detectCores() - 1
.
Example:
## will use `parallel::detectCores() - 1` cores model_results <- bootstrap_model(base_model = some_base_run, base_data = some_interesting_data, resamples = 9999, parallelism = "parallel") ## will use 4 cores: model_results <- bootstrap_model(base_model = some_base_run, base_data = some_interesting_data, resamples = 9999, parallelism = "parallel", num_cores = 4)
I've heard that parallel::mclapply
doesn't play well on Windows. Instead
of implementing parallel::parLapply
(which I will do if there's interest), I
would recommend using parallelism = "future"
, see below.
This becomes the default if you don't set parallelism
but just num_cores
.
parallelism = "future"
This uses the very nice future.apply::future_lapply
function to
run the models in parallel. Note that you MUST set the future::plan
(see the docs for the future
and future.apply
packages) to actually
make use of multiple cores etc, or else you'll just get a sequential run.
Using num_cores
is NOT the right
way to set the backend, and will cause an error.
Further, in many cases
you should supply any packages required for the model to run with the
argument future_packages
, because the S3 generic update()
is used
to update the model, which won't ship the required globals. You
don't need this if using plan(multicore)
, since all futures will
have access to the same shared memory (and you also don't need it if
your model is a base model, e.g. glm
).
This should work well with Windows.
Example:
library(future) plan("multiprocess") # "multiprocess" should work across Windows / Mac / Linux model_results <- bootstrap_model(base_model = some_base_run, base_data = some_interesting_data, resamples = 9999, parallelism = "future", future_packages = "glmmTMB")
num_cores
If you don't touch the parallelism
argument, but just set num_cores
, e.g.:
model_results <- bootstrap_model(base_model = some_base_run, base_data = some_interesting_data, resamples = 9999, num_cores = 8)
It'll be as if you set parallelism = "parallel"
, and will use
the parallel::mclapply
backend (unless you set num_cores = 1
!).
Unfortunately currently none of these methods have any progress bars setup. There used to be a call to a parallel package that did this, but it seemed to have some bugs. Hopefully this will be possible in the future.
The (not great) options for now: run a small set to estimate timing before running a large set of models using the split approach (see the Combining Runs approach above), print out some intermediate output between runs.
Let's look at an example of the second approach, using the future
argument. First, we'll
write a somewhat rough logging function:
log_remaining <- function(start_time, cur_time, j, total_iters, time_units = "hours"){ cur_time <- Sys.time() total_time <- difftime(cur_time, start_time, units = time_units) est_remaining <- (total_iters - j) * (total_time / j) paste0("[", cur_time, "] [iteration ", j, "] ", "total ", time_units, ": ", round(total_time, 3), " // ", "est remaining ", time_units, ": ", round(est_remaining, 3)) }
This looks like:
total_iterations <- 5 start_time <- Sys.time() for (j in 1:total_iterations) { ## simulate expensive operation... Sys.sleep(2) print(log_remaining(start_time, Sys.time(), j, total_iterations, "secs")) }
As before, don't forget return_coefs_instead = TRUE
. We'd run:
library(future) plan("multiprocess") results_list <- list() num_blocks <- 50 runs_per_block <- 100 start_time <- Sys.time() for (j in 1:num_blocks) { results_list[[j]] <- bootstrap_model(base_model = base_run, base_data = test_data, resamples = runs_per_block, parallelism = "future", return_coefs_instead = TRUE, suppress_sampling_message = TRUE) print(log_remaining(start_time, Sys.time(), j, num_blocks, "secs")) } combined_results <- combine_resampled_lists(results_list)
Of course with 50 blocks of size 100, we'd get the equivalent of 5000 samples.
Let's first reuse the model from the glmmTMB vignettes:
owls <- transform(Owls, nest = reorder(Nest, NegPerChick), ncalls = SiblingNegotiation, ft = FoodTreatment) fit_zipoisson <- glmmTMB( ncalls ~ (ft + ArrivalTime) * SexParent + offset(log(BroodSize)) + (1 | nest), data = owls, ziformula = ~1, family = poisson)
We run this more or less the same way as the other models:
zero_boot <- bootstrap_model(base_model = fit_zipoisson, base_data = owls, resamples = 9999)
In this case the output would be a list with two entries,
one matrix in zero_boot$cond
, the typical conditional
model,
and another matrix (in this case a single row, because ziformula = ~1
)
in zero_cond$zi
, each giving the same shape output as we've seen above.
In this particular model, the bootstrap model is quite conservative. Assumptions are often useful!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.