In bwiernik/progdata: Tutorials for USF Programming with Data

library(learnr)
library(gradethis)
gradethis_setup()
knitr::opts_chunk$set(echo = FALSE)

Exercise

Here's a simple exercise with an empty code chunk provided for entering the answer.

Reproducibility and Coding Best Practices

Today, we are going to return to the topic of reproducibility that we have talked about all semester and discuss several important best practices you can follow to help ensure that:

Your analyses are reproducible: They always return the same results, in the same way, every time.
You can easily and painlessly update your analyses and output (e.g., if you get new data or fix a bug or error).
- Jenny Bryan: "If the thought of re-running your analysis makes you ill, you're not doing it right."
People can easily read your code, understand what it is doing, and follow your thinking as a data analyst.
Future-you can easily easily read your code, understand what it is doing, and follow past-you's thinking as a data analyst.
Someone else (e.g., a future RA) could easily take over and keep working on the project or analysis.

Resources

Wilson, G., et al. (2014). Best practices for scientific computing. PLoS Biology
- A great guide for both coding and project management practices to help make your analyses more reproducible.
The tidyverse style guide for R.

Reproducibility

The principle of reproducibility has been central to the way we've approached programming in this course. Your analyses should be be able to run without any manual input from you. Reproducibility has two major components.

1. How easily someone can reproduce output

Example: Generating a report with a plot, a table, and some results in text.

Worst case scenario: No source code available; have to redo everything from scratch.
Common scenario: Source is available, but it's incomplete (e.g.,)
Better scenario: Full source is available, but output is arranged in the report manually.
Best case scenario: Complete report output is regenerated with the click of a button.

Key concepts to live by:

Source is real. Output is transient.
- Source >> Output
- Think of your source files (your .R files, .Rmd files, etc.) as "real". Think about output files (figures, CSV tables, HTML, PDF, or Word documents, etc.) as "transient"—things that are temporary and likely to be overwritten by the next run of your code.
- Don't make changes in your output files that matter. Make those changes in the source.
When you are building a script…
- (i.e., when you are writing code, then immediately running it and looking at the results)
- This is called "working interactively"
- Frequently kill your R session and re-run your script from the top. It should still work.
- In RStudio, Session → Restart R
- Don't save your workspace or history when exiting from R
- Disable this in the RStudio settings.

What versions of packages are you using?

Software packages are often updated, and their inputs or outputs might change across versions. It's important to tell your readers (and future-you) what versions of each package you are using. The best way to do this is to include one of these functions at the end of your report or in an appendix:

sessionInfo()
devtools::session_info()

The devtools version has some additional useful info and is somewhat more nicely formatted.

2. How easily someone can reproduce your frame of mind or thought proccess

Why in tarnation did they do that?!?

Beyond just being able to reproduce the same numbers, tables, figures, etc., reproducibility is also concerned with you others being able to reproduce the thinking that lead to code, analyses, and results.

What do the data and variables mean? Why did you choose these analyses? Why did you write your code this way versus another? Being able to answer these questions is critical to being able to trust the results. By clicking the "Knit" button, you might be able to reproduce the numbers in the table, but are those numbers right? Is there a bug in the code, does a function not work the way that the author thought, etc.?

The way that you make sure that readers or future-you can figure out is by providing extensive and clear documentation, both within files (comments) and between files (codebooks, READMEs).

Think about your documentation at three levels:

The big picture
- What is this script or function doing overall?
- What is the broad organization of your files and folders?
The walkthrough
- For each section of code, what is it broadly doing?
- Think of this like the headings or summaries for the block.
- Don't get too detailed or overdescribe here. "Compute predictor composites" is fine.
The nitty gritty
- What exactly is a specific line of code doing?
- Reserve this level of detail for when something is unusual or would look odd to someone else/future-you.
- If your code needs a lot of comments to explain, consider re-writing it to be clearer.

For resources on documenting code, see the tidyverse style guide.

For resources on documenting data files, check out the codebook package. It can automatically produce codebooks for a data file, saving you a lot of time. Its vignettes (e.g., for SPSS, formr, and Qualtrics) are a great resource for thinking about the type of information that a codebook should contain.

Good Coding Practices

Naming

Your variables, functions, and other objects should have clear, concise, and unambiguous names.

Pick a style for naming variables and stick with it consistently:
Some people use camelCase, snake_case, or period.case
Don't use period.case! (it can mess with some R functionality)
Follow consistent rules for different types of objects
Functions should be verbs (based on the one main thing that it does)
Objects (data, models, results, figures, etc.) should be nouns
Functions that return functions should be adverbs
Always use descriptive names
Not foo, dat2, model_3, etc.
Don't over-create
The more objects there are in your global environment, the more confusing it will be to try to keep track of them
Especially if you don't follow consistent rules for when you make objects and how you name them
Use the pipe %>% to avoid creating unnecessary intermediate objects
Don't under-create
If you will re-use an object more than a few times, make it once and save it
Avoid "magic numbers"
- Numbers in your code that are present without any explanation as to what they are or where they came from

Bad:

x <- rnorm(100)
y <- x + rnorm(100)

Good:

n <- 100
x <- rnorm(n)
y <- x + rnorm(n)

Disambiguate from left-to-right, not right-to-left
This makes it easier to figure out what an object is
It also makes it easier to complete typing a name by typing Tab on your keyboard
Bad: canada_gdp and china_gdp.
Good: gdp_canada and gdp_china.

Documenting code

Write your code for humans. Someone reading your code should be able to figure out what it does. This includes both writing explanatory comments and also writing the code itself in a way that is clear about what it does.

Code that is clear and speaks for itself is called "self-documenting code". Some ideas that make code more self-documenting include using clear variable names and doing things one step at a time, rather than combining multiple operations into one line. Using the tidyverse functions (e.g., dplyr) can also help new R users to figure out what your code is doing in my experience.

Base R:

mtcars[mtcars$cyl < 8, c("cyl", "mpg")]

tidyverse:

mtcars %>% 
   filter(cyl < 8) %>% 
   select(cyl, mpg)

Think carefully about how detailed your comments need to be! Overly detailed comments could be more confusing than no comments at all. They also can be hard to keep accurate as you revise your code or if you move things around. Focus on the high-level decisions about the programming/analysis.

Bad: # Lag the negative affect variable twice.
Good: # Create lagged predictors for modeling. Also think carefully about how you need to explain your code to someone who isn't already familiar with it—don't use say what your code does, but also why it is written the way it is.

Don't use comments to describe what your code is doing on a low-level.

Bad: # make data frame of cylinders less than 8, with variables 'cyl' and 'mpg'
Maybe okay: No comment (if the code is clear what it is doing)
- Using the tidyverse will often be helpful here.
Maybe okay: # Select relevant cases and variables for analyses

The DRY principle

DRY = Don't repeat yourself

Avoid repeating or copy-pasting the same lines of code over and over, then making minor changes. This is prone to typos, errors, and breakage down the line.

If you are going to do something more than once, then use functions or write functions to do the repetition for you.

Example: Running an analysis by subgroup

Bad: mod_wt_4cyl <- mtcars %>% filter(cyl == 4) %>% lm(mpg ~ wt, data = .) mod_wt_6cyl <- mtcars %>% filter(cyl == 6) %>% lm(mpg ~ wt, data = .) mod_wt_8cyl <- mtcars %>% filter(cyl == 6) %>% lm(mpg ~ wt, data = .)
Good: # Requires dplyr >0.8.99 or >1.0.0 # Install from GitHub if you don't have this version: # devtools::install_github("tidyverse/dplyr") mods_wt <- mtcars %>% nest_by(cyl) %>% summarize(mods_wt = list(lm(mpg ~ wt, data = data)))
Good: mtcars %>% split(.$cyl) %>% map(~ lm(mpg ~ wt, data = .x))
Good: model_cyl_subgroup <- function(data, cyl, formula) { data %>% select(cyl == cyl) %>% lm(formula, data = .) } mods_wt <- map(c(4, 6, 8), ~ model_cyl_subgroup(data = mtcars, cyl = .x, formula = mpg ~ wt)

Example: Running an analysis for each predictor

Bad: ``` # Make some data dat_big_five <- psych::bfi %>% select(age, O = O1, C = C1, E = E1, A = A1, N = N1) %>% slice(sample(1:nrow(.), 150)) %>% na.omit()

mod_age_O <- lm(age ~ O, data = dat_big_five) mod_age_C <- lm(age ~ C, data = dat_big_five) mod_age_E <- lm(age ~ E, data = dat_big_five) mod_age_A <- lm(age ~ A, data = dat_big_five) mod_age_N <- lm(age ~ N, data = dat_big_five) - Good: vars_big_five <- c("O", "C", "E", "A", "N") mods_age <- dat_big_five %>% summarize(across(all_of(vars_big_five), ~ list(lm(age ~ .x)), .names = "mod_age_{col}")) - Good: model_age <- function(data, predictor) { data %>% select(age, predictor) %>% lm(age ~ . , data = .) } mods_age <- map(vars_big_five, ~ model_age(data = dat_big_five, predictor = .x)) ```

These are obviously very simple toy functions, but imagine a case where you have much more complex models or series of analyses that you will need to repeat over and over.

Write Tests

Make sure that your code produces the correct results. This is best done by writing a "unit test"—give a function/bit of code some input with a known expected output and make sure they are the same. You should write automatic tests for your code to be sure it produces the write result. - Check out the testthat package.

Code Styling

Format your code so that it is easy to read. For example: - Include spaces between object names - Line up parallel lines of code (see the arguments in map() above - Use indentation to guide the reader through how to read your code

Following a style guide, such as the tidyverse style guide is a good practice for making your code readable.

Activity

Be sure that your final project follows these coding practice guidelines!

Let's check our understanding

Review the following code:

    # Make some data
    dat_big_five <- psych::bfi %>% 
      select(age, O = O1, C = C1, E = E1, A = A1, N = N1) %>% 
      slice(sample(1:nrow(.), 150)) %>% 
      na.omit()

    o_average <- mean(dat_big_five$O)
    c_average <- mean(dat_big_five$C)
    e_average <- mean(dat_big_five$E)
    a_average <- mean(dat_big_five$A)
    n_average <- mean(dat_big_five$N)

quiz(
  question("Does the above code chunk follow the DRY principle?",
           answer("Yes"),
           answer("No", correct = TRUE),
           random_answer_order = TRUE,
           allow_retry = TRUE
  ),
  question("Should you constantly be restarting your R session when creating a script?",
           answer("Yes, when I run my script restarting will ensure that everything is reproducible.", correct = TRUE),
           answer("Yes, then I can start working on other things"),
           answer("No, restarting will delete all my work"),
           answer("No, restarting is bad."),
           random_answer_order = TRUE,
           allow_retry = TRUE
  ),
  question("What is a 'magic number'?",
           answer("Numbers in your code that are present without any explanation as to what they are or where they came from", correct = TRUE),
           answer("Numbers that will create code"),
           answer("Numbers that have an explanation and which the origin is known"),
           random_answer_order = TRUE,
           allow_retry = TRUE
  ),
  question("Is it a good idea to seperate my variable names? (Example: `cool.data.set`)",
           answer("No, it will mess with some of the R functionality", correct = TRUE),
           answer("Yes, it looks good."),
           random_answer_order = TRUE,
           allow_retry = TRUE
  ),
  question("When creating variable names, which direction should our names disamiguate?",
           answer("Left-to-Right", correct = TRUE),
           answer("Right-to-Left"),
           random_answer_order = TRUE,
           allow_retry = TRUE
  )
)

bwiernik/progdata documentation built on Feb. 1, 2021, 2:33 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

bwiernik/progdata
Tutorials for USF Programming with Data

In bwiernik/progdata: Tutorials for USF Programming with Data

Exercise

Reproducibility and Coding Best Practices

Resources

Reproducibility

1. How easily someone can reproduce output

2. How easily someone can reproduce your frame of mind or thought proccess

Good Coding Practices

Naming

Documenting code

The DRY principle

Write Tests

Code Styling

Activity

R Package Documentation

Browse R Packages

We want your feedback!

bwiernik/progdata Tutorials for USF Programming with Data

In bwiernik/progdata: Tutorials for USF Programming with Data

Exercise

Reproducibility and Coding Best Practices

Resources

Reproducibility

1. How easily someone can reproduce output

2. How easily someone can reproduce your frame of mind or thought proccess

Good Coding Practices

Naming

Documenting code

The DRY principle

Write Tests

Code Styling

Activity

R Package Documentation

Browse R Packages

We want your feedback!

bwiernik/progdata
Tutorials for USF Programming with Data