library(learnr) library(gradethis) gradethis_setup() knitr::opts_chunk$set(echo = FALSE)
Here's a simple exercise with an empty code chunk provided for entering the answer.
Today, we are going to return to the topic of reproducibility that we have talked about all semester and discuss several important best practices you can follow to help ensure that:
The principle of reproducibility has been central to the way we've approached programming in this course. Your analyses should be be able to run without any manual input from you. Reproducibility has two major components.
Example: Generating a report with a plot, a table, and some results in text.
Key concepts to live by:
What versions of packages are you using?
Software packages are often updated, and their inputs or outputs might change across versions. It's important to tell your readers (and future-you) what versions of each package you are using. The best way to do this is to include one of these functions at the end of your report or in an appendix:
sessionInfo() devtools::session_info()
The devtools
version has some additional useful info and is somewhat more nicely
formatted.
Why in tarnation did they do that?!?
Beyond just being able to reproduce the same numbers, tables, figures, etc., reproducibility is also concerned with you others being able to reproduce the thinking that lead to code, analyses, and results.
What do the data and variables mean? Why did you choose these analyses? Why did you write your code this way versus another? Being able to answer these questions is critical to being able to trust the results. By clicking the "Knit" button, you might be able to reproduce the numbers in the table, but are those numbers right? Is there a bug in the code, does a function not work the way that the author thought, etc.?
The way that you make sure that readers or future-you can figure out is by providing extensive and clear documentation, both within files (comments) and between files (codebooks, READMEs).
Think about your documentation at three levels:
For resources on documenting code, see the tidyverse style guide.
For resources on documenting data files, check out the codebook package. It can automatically produce codebooks for a data file, saving you a lot of time. Its vignettes (e.g., for SPSS, formr, and Qualtrics) are a great resource for thinking about the type of information that a codebook should contain.
Your variables, functions, and other objects should have clear, concise, and unambiguous names.
foo
, dat2
, model_3
, etc.%>%
to avoid creating unnecessary intermediate objects Bad:
x <- rnorm(100) y <- x + rnorm(100)
Good:
n <- 100 x <- rnorm(n) y <- x + rnorm(n)
Tab
on your keyboardcanada_gdp
and china_gdp
. gdp_canada
and gdp_china
.Write your code for humans. Someone reading your code should be able to figure out what it does. This includes both writing explanatory comments and also writing the code itself in a way that is clear about what it does.
Code that is clear and speaks for itself is called "self-documenting code". Some
ideas that make code more self-documenting include using clear variable names
and doing things one step at a time, rather than combining multiple operations
into one line. Using the tidyverse functions (e.g., dplyr
) can also help new
R users to figure out what your code is doing in my experience.
Base R:
mtcars[mtcars$cyl < 8, c("cyl", "mpg")]
tidyverse:
mtcars %>% filter(cyl < 8) %>% select(cyl, mpg)
Think carefully about how detailed your comments need to be! Overly detailed comments could be more confusing than no comments at all. They also can be hard to keep accurate as you revise your code or if you move things around. Focus on the high-level decisions about the programming/analysis.
# Lag the negative affect variable twice.
# Create lagged predictors for modeling.
Also think carefully about how you need to explain your code to someone who isn't
already familiar with it—don't use say what your code does, but also why it
is written the way it is. Don't use comments to describe what your code is doing on a low-level.
# make data frame of cylinders less than 8, with variables 'cyl' and 'mpg'
# Select relevant cases and variables for analyses
DRY = Don't repeat yourself
Avoid repeating or copy-pasting the same lines of code over and over, then making minor changes. This is prone to typos, errors, and breakage down the line.
If you are going to do something more than once, then use functions or write functions to do the repetition for you.
Example: Running an analysis by subgroup
mod_wt_4cyl <-
mtcars %>%
filter(cyl == 4) %>%
lm(mpg ~ wt, data = .)
mod_wt_6cyl <-
mtcars %>%
filter(cyl == 6) %>%
lm(mpg ~ wt, data = .)
mod_wt_8cyl <-
mtcars %>%
filter(cyl == 6) %>%
lm(mpg ~ wt, data = .)
# Requires dplyr >0.8.99 or >1.0.0
# Install from GitHub if you don't have this version:
# devtools::install_github("tidyverse/dplyr")
mods_wt <-
mtcars %>%
nest_by(cyl) %>%
summarize(mods_wt = list(lm(mpg ~ wt, data = data)))
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x))
model_cyl_subgroup <- function(data, cyl, formula) {
data %>%
select(cyl == cyl) %>%
lm(formula, data = .)
}
mods_wt <- map(c(4, 6, 8),
~ model_cyl_subgroup(data = mtcars,
cyl = .x,
formula = mpg ~ wt)
Example: Running an analysis for each predictor
Bad: ``` # Make some data dat_big_five <- psych::bfi %>% select(age, O = O1, C = C1, E = E1, A = A1, N = N1) %>% slice(sample(1:nrow(.), 150)) %>% na.omit()
mod_age_O <- lm(age ~ O, data = dat_big_five)
mod_age_C <- lm(age ~ C, data = dat_big_five)
mod_age_E <- lm(age ~ E, data = dat_big_five)
mod_age_A <- lm(age ~ A, data = dat_big_five)
mod_age_N <- lm(age ~ N, data = dat_big_five)
- Good:
vars_big_five <- c("O", "C", "E", "A", "N")
mods_age <-
dat_big_five %>%
summarize(across(all_of(vars_big_five),
~ list(lm(age ~ .x)),
.names = "mod_age_{col}"))
- Good:
model_age <- function(data, predictor) {
data %>%
select(age, predictor) %>%
lm(age ~ . , data = .)
}
mods_age <- map(vars_big_five,
~ model_age(data = dat_big_five,
predictor = .x))
```
These are obviously very simple toy functions, but imagine a case where you have much more complex models or series of analyses that you will need to repeat over and over.
Make sure that your code produces the correct results. This is best done by writing
a "unit test"—give a function/bit of code some input with a known expected output
and make sure they are the same. You should write automatic tests for your code
to be sure it produces the write result.
- Check out the testthat
package.
Format your code so that it is easy to read. For example:
- Include spaces between object names
- Line up parallel lines of code (see the arguments in map()
above
- Use indentation to guide the reader through how to read your code
Following a style guide, such as the tidyverse style guide is a good practice for making your code readable.
Be sure that your final project follows these coding practice guidelines!
Let's check our understanding
Review the following code:
# Make some data dat_big_five <- psych::bfi %>% select(age, O = O1, C = C1, E = E1, A = A1, N = N1) %>% slice(sample(1:nrow(.), 150)) %>% na.omit() o_average <- mean(dat_big_five$O) c_average <- mean(dat_big_five$C) e_average <- mean(dat_big_five$E) a_average <- mean(dat_big_five$A) n_average <- mean(dat_big_five$N)
quiz( question("Does the above code chunk follow the DRY principle?", answer("Yes"), answer("No", correct = TRUE), random_answer_order = TRUE, allow_retry = TRUE ), question("Should you constantly be restarting your R session when creating a script?", answer("Yes, when I run my script restarting will ensure that everything is reproducible.", correct = TRUE), answer("Yes, then I can start working on other things"), answer("No, restarting will delete all my work"), answer("No, restarting is bad."), random_answer_order = TRUE, allow_retry = TRUE ), question("What is a 'magic number'?", answer("Numbers in your code that are present without any explanation as to what they are or where they came from", correct = TRUE), answer("Numbers that will create code"), answer("Numbers that have an explanation and which the origin is known"), random_answer_order = TRUE, allow_retry = TRUE ), question("Is it a good idea to seperate my variable names? (Example: `cool.data.set`)", answer("No, it will mess with some of the R functionality", correct = TRUE), answer("Yes, it looks good."), random_answer_order = TRUE, allow_retry = TRUE ), question("When creating variable names, which direction should our names disamiguate?", answer("Left-to-Right", correct = TRUE), answer("Right-to-Left"), random_answer_order = TRUE, allow_retry = TRUE ) )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.