knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
First things first. multitool
leverages the tidyverse
package so lets load both:
library(tidyverse) library(multitool)
Image we have some data with several predictor variables, moderators, covariates, and dependent measures. We want to know if our predictors (ivs
) interact with our moderators (mods
) to predict the outcome (dvs
).
But we have three versions of our predictor that (supposedly) measure the same thing, albeit in slightly different ways.
In addition, because we collected messy data from the real world (not really but let's pretend), we have some idea of which observations to include and which we might exclude (e.g., include1
, include2
, include3
).
the_data <- data.frame( id = 1:500, iv1 = rnorm(500), iv2 = rnorm(500), iv3 = rnorm(500), mod = rnorm(500), dv1 = rnorm(500), dv2 = rnorm(500), include1 = rbinom(500, size = 1, prob = .1), include2 = sample(1:3, size = 500, replace = TRUE), include3 = rnorm(500) )
Say we don't know much about this new and exciting area of research.
We want to maximize our knowledge but we also want to be systematic. One approach would be to specify a reasonable analysis pipeline. Something that looks like the following:
# Filter out exclusions filtered_data <- the_data |> filter( include1 == 0, # -- include2 != 3, # Exclusion criteria include3 > -2.5 # -- ) # Model the data my_model <- lm(dv1 ~ iv1 * mod, data = filtered_data) # Check the results my_results <- parameters::parameters(my_model)
But what if there are valid alternative alternatives to this pipeline?
For example, using iv2
instead of iv1
or only using two exclusion criteria instead of three? A sensible approach would be to copy the code above, paste it, and edit with different decisions.
This quickly become tedious. It adds many lines of code, many new objects, and is difficult to keep track of in a systematic way.
Enter multitool
.
With multitool
, the above analysis pipeline can be transformed into a specification blueprint for exploring all combinations of sensible data decisions in a pipeline. It was designed to leverage already written code (e.g., the filter
statement above) to create a all possible combinations of data analysis pipelines.
Our example above has three exclusion criteria. If we don't know which are important, for example, because they are based on arbitrary 'rules of thumb' (that may or may not have inherent wisdom) or we don't know if including/excluding these cases is valid, we can generate all combinations:
the_data |> add_filters(include1 == 0, include2 != 3, include3 > -2.5)
The output above is a simple tibble
(i.e., data.frame
) containing three columns.
Each row is a possible filter: the type
column refers to the type of blueprint specification (see below for types other than filters), the group
refers to the variable in the base data frame (in our case the_data
) for which the filter applies, and the code
column contains the code needed to execute the filter.
For filtering decisions (e.g., exclusion criteria), a 'do nothing' alternative is always generated.
For example, perhaps some observations belong to a subgroup, include1 == 1
. We may or may not have good reason to exclude these cases (this depends on the specific situation).
But imagine that we don't know if we should include them or not. When include1 == 1
is added to add_filters()
, the 'do nothing' alternative include1 %in% unique(include1)
is automatically generated so you can compare including versus excluding cases based on a criterion.
Most multiverse-style analyses explore a range of exclusion criteria and their alternatives. However, sometimes alternative versions of a variable are also included.
In the social sciences, it is fairly common to have many measures of roughly the same construct (i.e., measured variable). For example, a happiness researcher might measure positive mood, life satisfaction, and/or a single item measuring happiness (e.g., 'how happy do your feel?').
If you want to explore the output of your pipeline with differing versions of a variable, you can use add_variables()
.
the_data |> add_variables(var_group = "ivs", iv1, iv2, iv3)
The output above generates the same tibble
as add_filters()
. Each row is a particular decision to use a particular variable in your pipeline.
In contrast to filter, however, you need to tell add_variables()
what to call each set of variables with the var_group
argument. This is how multitool
knows that each variable name in the code
column is a different alternative of a larger set.
Here, var_group = "ivs"
indicates that iv1, iv2, iv3
are all different versions of ivs
. I used "ivs" as way of indicating to myself that these are alternative versions of my main independent variable.
You can add as many variable sets as you want. For example, we might also want to analyze our two versions of the outcome, dv1
and dv2
.
the_data |> add_variables(var_group = "ivs", iv1, iv2, iv3) |> add_variables(var_group = "dvs", dv1, dv2)
You can harness the real power of multitool
by piping specification statements.
For example, perhaps we want to explore our exclusion criteria alternatives across different versions of our predictor and outcome variables. We can simply pipe new blueprint specifications into each other like so:
the_data |> add_filters(include1 == 0, include2 != 3, include3 > -2.5) |> add_variables(var_group = "ivs", iv1, iv2, iv3) |> add_variables(var_group = "dvs", dv1, dv2)
Notice that we now have a specification blueprint with both exclusion alternatives and variable alternatives.
The whole point of building a specification blueprint is to eventually feed it to a model and examine the results.
You can add a model to your blueprint by using add_model()
. I designed add_model()
so the user can simply paste a model function. For example, our call to lm()
can be simply pasted into add_model()
. Make sure to give your model a label with the model_desc
argument.
the_data |> add_filters(include1 == 0, include2 != 3, include3 > -2.5) |> add_variables(var_group = "ivs", iv1, iv2, iv3) |> add_variables(var_group = "dvs", dv1, dv2) |> add_model("linear model", lm(dv1 ~ iv1 * mod))
Above, the model is completely unquoted. It also has no data
argument. This is intentional; multitool
is tracking the base dataset along the way (so you don't have to). Note that you can still quote the model formula, if that is more your style.
the_data |> add_filters(include1 == 0, include2 != 3, include3 > -2.5) |> add_variables(var_group = "ivs", iv1, iv2, iv3) |> add_variables(var_group = "dvs", dv1, dv2) |> add_model("linear model", "lm(dv1 ~ iv1 * mod)")
To make sure your add_variables()
works properly, add_model()
was designed to interpret glue::glue()
syntax. For example:
the_data |> # add_filters(include1 == 0, include2 != 3, include3 > -2.5) |> add_variables(var_group = "ivs", iv1, iv2, iv3) |> add_variables(var_group = "dvs", dv1, dv2) |> add_model("linear model", lm({dvs} ~ {ivs} * mod)) # see the {} here
This allows multitool
to insert the correct version of each variable specified in a add_variables()
step. Make sure to embrace the variable with the var_group
argument from add_variables()
, for example add_model(lm({dvs} ~ {ivs} * mod))
.
Here, dvs
and ivs
tells multitool
to insert the current version of the ivs
and dvs
into the model.
There are two steps in finalizing your blueprint. The first is to visualize your pipeline with a graph. This is optional, but I think it is helpful.
You can automate making a chart with create_blueprint_graph()
. Feed your pipeline to create_blueprint_graph()
to see a chart of your multiverse pipeline plan:
full_pipeline <- the_data |> add_filters(include1 == 0, include2 != 3, include3 > -2.5) |> add_variables(var_group = "ivs", iv1, iv2, iv3) |> add_variables(var_group = "dvs", dv1, dv2) |> add_model("linear model", lm({dvs} ~ {ivs} * mod)) create_blueprint_graph(full_pipeline)
The final step in making your blueprint is expanding all your specifications into all possible combinations. You can do this by calling expand_decisions()
at the end of your blueprint pipeline:
expanded_pipeline <- expand_decisions(full_pipeline) expanded_pipeline
The result is an expanded tibble
with 1 row per unique decision and columns for each major blueprint category. In our example, we have alternative variables (predictors and outcomes), filters (three exclusion alternatives), and a model to run.
Note that we have 3 exclusions (each with two combinations), 3 versions of our predictor, and 2 versions of our outcome. This means our blueprint should have 2*2*2*3*2
or r 2*2*2*3*2
rows, which corresponds with our expanded pipeline:
2*2*2*3*2 == nrow(expanded_pipeline)
Our blueprint uses list columns to organize information. You can view each list column by using tidyr::unnest(<column name>)
. For example, we can look at the filters:
expanded_pipeline |> unnest(filters)
Or we could look at the models:
expanded_pipeline |> unnest(models)
Notice that, with the glue::glue()
syntax, different versions of our predictors and outcomes were inserted appropriately. You can check their correspondence by using unnest()
on both the models and variable list columns:
expanded_pipeline |> unnest(c(variables, models))
This example uses relatively simple pipeline steps. You can also add more sophisticated steps to your pipeline, such as preprocessing data, post-processing model results, or calculating descriptive statistics alongside your model.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.