library(learnr)
library(gradethis)
gradethis_setup()
knitr::opts_chunk$set(echo = FALSE)

Exercise

Here's a simple exercise with an empty code chunk provided for entering the answer.

Reproducibility and Coding Best Practices

Today, we are going to return to the topic of reproducibility that we have talked about all semester and discuss several important best practices you can follow to help ensure that:

  1. Your analyses are reproducible: They always return the same results, in the same way, every time.
  2. You can easily and painlessly update your analyses and output (e.g., if you get new data or fix a bug or error).
    • Jenny Bryan: "If the thought of re-running your analysis makes you ill, you're not doing it right."
  3. People can easily read your code, understand what it is doing, and follow your thinking as a data analyst.
  4. Future-you can easily easily read your code, understand what it is doing, and follow past-you's thinking as a data analyst.
  5. Someone else (e.g., a future RA) could easily take over and keep working on the project or analysis.

Resources

Reproducibility

The principle of reproducibility has been central to the way we've approached programming in this course. Your analyses should be be able to run without any manual input from you. Reproducibility has two major components.

1. How easily someone can reproduce output

Example: Generating a report with a plot, a table, and some results in text.

Key concepts to live by:

What versions of packages are you using?

Software packages are often updated, and their inputs or outputs might change across versions. It's important to tell your readers (and future-you) what versions of each package you are using. The best way to do this is to include one of these functions at the end of your report or in an appendix:

sessionInfo()
devtools::session_info()

The devtools version has some additional useful info and is somewhat more nicely formatted.

2. How easily someone can reproduce your frame of mind or thought proccess

Why in tarnation did they do that?!?

Beyond just being able to reproduce the same numbers, tables, figures, etc., reproducibility is also concerned with you others being able to reproduce the thinking that lead to code, analyses, and results.

What do the data and variables mean? Why did you choose these analyses? Why did you write your code this way versus another? Being able to answer these questions is critical to being able to trust the results. By clicking the "Knit" button, you might be able to reproduce the numbers in the table, but are those numbers right? Is there a bug in the code, does a function not work the way that the author thought, etc.?

The way that you make sure that readers or future-you can figure out is by providing extensive and clear documentation, both within files (comments) and between files (codebooks, READMEs).

Think about your documentation at three levels:

  1. The big picture
    • What is this script or function doing overall?
    • What is the broad organization of your files and folders?
  2. The walkthrough
    • For each section of code, what is it broadly doing?
    • Think of this like the headings or summaries for the block.
    • Don't get too detailed or overdescribe here. "Compute predictor composites" is fine.
  3. The nitty gritty
    • What exactly is a specific line of code doing?
    • Reserve this level of detail for when something is unusual or would look odd to someone else/future-you.
    • If your code needs a lot of comments to explain, consider re-writing it to be clearer.

For resources on documenting code, see the tidyverse style guide.

For resources on documenting data files, check out the codebook package. It can automatically produce codebooks for a data file, saving you a lot of time. Its vignettes (e.g., for SPSS, formr, and Qualtrics) are a great resource for thinking about the type of information that a codebook should contain.

Good Coding Practices

Naming

Your variables, functions, and other objects should have clear, concise, and unambiguous names.

  1. Pick a style for naming variables and stick with it consistently:
  2. Some people use camelCase, snake_case, or period.case
  3. Don't use period.case! (it can mess with some R functionality)
  4. Follow consistent rules for different types of objects
  5. Functions should be verbs (based on the one main thing that it does)
  6. Objects (data, models, results, figures, etc.) should be nouns
  7. Functions that return functions should be adverbs
  8. Always use descriptive names
  9. Not foo, dat2, model_3, etc.
  10. Don't over-create
  11. The more objects there are in your global environment, the more confusing it will be to try to keep track of them
  12. Especially if you don't follow consistent rules for when you make objects and how you name them
  13. Use the pipe %>% to avoid creating unnecessary intermediate objects
  14. Don't under-create
  15. If you will re-use an object more than a few times, make it once and save it
  16. Avoid "magic numbers"
    • Numbers in your code that are present without any explanation as to what they are or where they came from

Bad:

x <- rnorm(100)
y <- x + rnorm(100)

Good:

n <- 100
x <- rnorm(n)
y <- x + rnorm(n)
  1. Disambiguate from left-to-right, not right-to-left
  2. This makes it easier to figure out what an object is
  3. It also makes it easier to complete typing a name by typing Tab on your keyboard
  4. Bad: canada_gdp and china_gdp.
  5. Good: gdp_canada and gdp_china.

Documenting code

Write your code for humans. Someone reading your code should be able to figure out what it does. This includes both writing explanatory comments and also writing the code itself in a way that is clear about what it does.

Code that is clear and speaks for itself is called "self-documenting code". Some ideas that make code more self-documenting include using clear variable names and doing things one step at a time, rather than combining multiple operations into one line. Using the tidyverse functions (e.g., dplyr) can also help new R users to figure out what your code is doing in my experience.

Base R:

mtcars[mtcars$cyl < 8, c("cyl", "mpg")]

tidyverse:

mtcars %>% 
   filter(cyl < 8) %>% 
   select(cyl, mpg)

Think carefully about how detailed your comments need to be! Overly detailed comments could be more confusing than no comments at all. They also can be hard to keep accurate as you revise your code or if you move things around. Focus on the high-level decisions about the programming/analysis.

Don't use comments to describe what your code is doing on a low-level.

The DRY principle

DRY = Don't repeat yourself

Avoid repeating or copy-pasting the same lines of code over and over, then making minor changes. This is prone to typos, errors, and breakage down the line.

If you are going to do something more than once, then use functions or write functions to do the repetition for you.

Example: Running an analysis by subgroup

Example: Running an analysis for each predictor

These are obviously very simple toy functions, but imagine a case where you have much more complex models or series of analyses that you will need to repeat over and over.

Write Tests

Make sure that your code produces the correct results. This is best done by writing a "unit test"—give a function/bit of code some input with a known expected output and make sure they are the same. You should write automatic tests for your code to be sure it produces the write result. - Check out the testthat package.

Code Styling

Format your code so that it is easy to read. For example: - Include spaces between object names - Line up parallel lines of code (see the arguments in map() above - Use indentation to guide the reader through how to read your code

Following a style guide, such as the tidyverse style guide is a good practice for making your code readable.

Activity

Be sure that your final project follows these coding practice guidelines!

Let's check our understanding

Review the following code:

    # Make some data
    dat_big_five <- psych::bfi %>% 
      select(age, O = O1, C = C1, E = E1, A = A1, N = N1) %>% 
      slice(sample(1:nrow(.), 150)) %>% 
      na.omit()

    o_average <- mean(dat_big_five$O)
    c_average <- mean(dat_big_five$C)
    e_average <- mean(dat_big_five$E)
    a_average <- mean(dat_big_five$A)
    n_average <- mean(dat_big_five$N)
quiz(
  question("Does the above code chunk follow the DRY principle?",
           answer("Yes"),
           answer("No", correct = TRUE),
           random_answer_order = TRUE,
           allow_retry = TRUE
  ),
  question("Should you constantly be restarting your R session when creating a script?",
           answer("Yes, when I run my script restarting will ensure that everything is reproducible.", correct = TRUE),
           answer("Yes, then I can start working on other things"),
           answer("No, restarting will delete all my work"),
           answer("No, restarting is bad."),
           random_answer_order = TRUE,
           allow_retry = TRUE
  ),
  question("What is a 'magic number'?",
           answer("Numbers in your code that are present without any explanation as to what they are or where they came from", correct = TRUE),
           answer("Numbers that will create code"),
           answer("Numbers that have an explanation and which the origin is known"),
           random_answer_order = TRUE,
           allow_retry = TRUE
  ),
  question("Is it a good idea to seperate my variable names? (Example: `cool.data.set`)",
           answer("No, it will mess with some of the R functionality", correct = TRUE),
           answer("Yes, it looks good."),
           random_answer_order = TRUE,
           allow_retry = TRUE
  ),
  question("When creating variable names, which direction should our names disamiguate?",
           answer("Left-to-Right", correct = TRUE),
           answer("Right-to-Left"),
           random_answer_order = TRUE,
           allow_retry = TRUE
  )
)


bwiernik/progdata documentation built on Feb. 1, 2021, 2:33 a.m.