In the previous chapters, we learned to share our materials using the Open Science Framework (OSF) and to organize all our files in a well structured and documented repository. Moreover we learned recommended practices to organize and share our data.
However, if we want our analysis to be reproducible, we need to write code, actually, we need to write good code. In this chapter, we provide some main guidelines about coding. In particular, in Section \@ref(coding-style), we describe general coding good practices and introduce the functional style approach. Note that, although examples are in R, these general recommendations are valid for all programming languages. In Section \@ref(R-coding), we further discuss specific elements related to R and we introduce more advanced R packages that can be useful when developing code in R.
Often, researchers' first experience with programming occurs during some statistical courses where they are used as a tool to run statistical analyses. In these scenarios, all our attention is usually directed to the statistical burden and limited details are provided about coding per sé. Therefore, we rarely receive any training about programming or coding good practices and we end up learning and writing code in a quite anarchic way.
Usually, the most common approach is to create a very long single script where, by trials-and-errors and copy-and-paste, we collect all our lines of code in a chaotic way hoping to obtain some reasonable result. This is normal during the first stages of an analysis, when we are just exploring our data, trying different statistical approaches, and coming up with new ideas. As the analyses get more complex, however, we will easily lose control of what we are doing introducing several errors. At this point, reviewing and debugging the code will be much more difficult. Moreover, it would be really hard, if not impossible, to replicate the results. We need to follow a more structured approach to avoid these issues.
In Chapter \@ref(workflow-analysis-chapter), we discuss how to organize the scripts and manage the analysis workflow to enhance results reproducibility and code maintainability. In this Chapter, instead, we focus on how to write good code.
But, what does it mean to write “good code”? We can think of at least three important characteristics that define a good code:
In Section \@ref(good-practices), we describe the general good practices to write readable code. In Section \@ref(functional-style), we introduce the functional style approach to allow us to develop and maintain the code required for the analysis more efficiently. Finally, in Section \@ref(advanced), we briefly discuss some more advanced topics that are important in programming.
“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” -- Martin Fowler, “Refactoring: Improving the Design of Existing Code”
“Good coding style is like using correct punctuation. You can manage without it, but it sure makes things easier to read.” -- Hadley Wickham
If you do not agree with the above quotations, try to read the two following chunks of code.
:::{.code-tex-bad data-latex=""}
y<-c(1,0,1,1,0,1,1);p<-sum(y)/length(y);if(p>=.6){"Passed"}else{"Failed"}
:::
:::{.code-tex-good data-latex=""}
# Load subj test answers exam_answers <- c(1,0,1,1,0,1,1) # 0 = wrong; 1 = correct # Get exam score as proportion of correct answers exam_score <- sum(exam_answers) / length(exam_answers) # Set exam pass threshold [0,1] threshold <- .6 # Check if subj passed the exam if (exam_score >= threshold) { "Passed" } else { "Failed" }
:::
Which one have you found easier to read and understand? Unless you are a Terminator sent from the future to assassinate John Connor, we are sure you had barely a clue of what was going on in the first chuck. On the contrary, you could easily read and understand the second chunk, as if you were reading a plain English text. This is a simple example showing how machines do not need pretty well-formatted and documented code, but the programmers do.
Some programming languages have specific syntax rules that we need to strictly abide by to avoid errors (e.g., the indentation in Python is required to mark code blocks). Other programming languages are more flexible and do not follow strict rules (e.g., the indentation in R is for readability only). However, there are some general good practices common to all programming languages that facilitate code readability and understanding. In the next sections, we discuss some of the main guidelines.
“There are only two hard things in Computer Science: cache invalidation and naming things.” -- Phil Karlton
Choosing appropriate object and variable names is important to facilitate code readability. Names should be:
:::{.code-tex-bad data-latex=""}
r
y # generic name without useful information
:::
:::{.code-tex-good data-latex=""}
r
exam_answers # clear descriptive name
:::
:::{.code-tex-bad data-latex=""}
r
average_outcome_score_of_the_control_gorup # too long
avg_scr_ctr # difficult to guess the abbreviation meaning
:::
:::{.code-tex-good data-latex=""}
r
avg_score_control # clear descriptive name
:::
Moreover, we should not be scared of using longer names if these are required to properly describe an object or a variable. Most IDEs (i.e., integrated development environments such as RStudio or Visual Studio Code) have auto-complete features to help us easily type longer names. At the same time, this is not a good excuse to create 12-word-long names. Remember, names should not be longer than what is strictly necessary.
Objects' and variables' names can not include spaces. Therefore, to combine multiple words, we need to adopt one of the following naming styles:
:::{.code-tex-good data-latex=""}
r
myObjectName # camelCase
:::
:::{.code-tex-good data-latex=""}
r
MyObjectName # PascalCase
:::
"_"
. For example,:::{.code-tex-good data-latex=""}
r
my_object_name # snake_case
:::
"."
instead of "_"
. However, this is deprecated as in many programming languages the character "."
is a reserved special character (see Section \@ref(classes-methods)). For example,:::{.code-tex-warn data-latex=""}
r
my.object.name # snake.case
:::
Usually, every programming language has its specific preferred style, but it does not really matter which style we choose. The important thing is to choose one style and stick with it naming consistently all objects and variables.
Finally, we should avoid any name that could lead to possible errors or conflicts. For example, we should
data
, list
, or sum
).lm_fit
, lm_Fit
).:::{.design title="Temporary Variables" data-latex="[Temporary Variables]"}
Some extra tips concern the names of temporary variables that are commonly used in for
loops or other data manipulation processes.
i
. To refer to some position indexes, the variable i
is commonly used. Alternatively, we could use other custom names to indicate explicitly the specific selection operation. For example, we could use row_i
or col_i
to indicate that we refer to row index or column index respectively.:::{.code-tex-good data-latex=""} ```r # Cities names cities <- c("Amsterdam", "Berlin", "Cardiff", "Dublin")
# Loop by position index for (i in seq_len(length(cities))) { cat(cities[i], " ") } ```
:::
k
. To refer to an element inside an object we can use the variable k
. Alternatively, a common approach is to use the plural and the singular names to refer to the collection of all elements and the single element respectively (or an indicative name that represents a part of a unit, for example, slice
for pizza
).:::{.code-tex-good data-latex=""}
r
# Loop by element
for (city in cities) {
cat(city, " ")
}
:::
These are not mandatory rules but just general recommendations. Using consistently distinct variables to refer to position indexes or single elements of a collection (e.g., i
and k
, respectively) facilitates the review of the code and allows us to identify possible errors more easily.
:::
Again, some programming languages have specific syntax rules about spacing and indentation (e.g., Python), whereas other programming languages are more flexible (e.g., R). However, it is always recommended to use appropriate and consistent spacing and indentation to facilitate readability. As general guidelines:
:::{.code-tex-bad data-latex=""} ```r x<-sum(c(1:10,99),rnorm(5,mean=3,1))
if(test>=5&test<=10)print("...") ```
:::
:::{.code-tex-good data-latex=""} ```r x <- sum(c(1:10, 99), rnorm(n = 5, mean = 3, sd = 1))
if (test >= 5 & test <= 10) print("...") ```
:::
:::{.code-tex-bad data-latex=""}
r
my_very_long_list<-list(first_argument="something-very-long",
second_argument=c("many","objects"),third_argument=c(1,2,3,4,5))
:::
:::{.code-tex-good data-latex=""}
r
my_very_long_list <- list(
first_argument = "something-very-long",
second_argument = c("many", "objects"),
third_argument = c(1, 2, 3, 4, 5)
)
:::
:::{.code-tex-bad data-latex=""}
r
for (...) { # Outer loop
...
for (...) { # Inner loop
...
if (...) { # Conditional
...
}}}
:::
:::{.code-tex-good data-latex=""} ```r for (...) { # Outer loop ...
for (...) { # Inner loop ... if (...) { # Conditional ... } } } ```
:::
To indent the code, we can use spaces or Tabs. For a nice debate about this choice see https://thenewstack.io/spaces-vs-tabs-a-20-year-debate-and-now-this-what-the-hell-is-wrong-with-go (do not forget to watch the linked video as well). However, if we mix together Tabs and spaces this will lead to errors in programming languages that require precise indentation. This issue is very difficult to debug as Tabs and spaces look invisible. To avoid this problem, most editors allow the user to automatically substitute Tabs with a fixed number of spaces.
Comments are ignored by the program, but they are extremely valuable for colleagues reading our code. Thus, we should always include appropriate comments in our code. The future us will be very grateful.
Comments are used to provide useful information about the code in plain language. For example, we could describe the aim and the logic behind the next block of code, explain the reasons for specific choices, clarify the meaning of some particular uncommon code syntax and functions used, or provide links to external documentation.
Note that comments should not simply replicate the code in plain language, but they should rather explain the meaning of the code by providing additional information.
:::{.code-tex-bad data-latex=""}
# Set x to 10 x <- 10
:::
:::{.code-tex-good data-latex=""}
# Define maximum answer number x <- 10
:::
Remember, good comments should explain the why and not the what. If we can not understand what the code is doing by simply reading it, we should probably consider re-writing it.
Finally, comments can also be used to divide and organize the code scripts into sections. We further discuss how to organize scripts used to run the analysis in Chapter \@ref(workflow-analysis-chapter).
Here we list other general recommendations to facilitate code readability and maintainability.
:::{.code-tex-bad data-latex=""}
r
x <- seq(0, 10, 2)
:::
:::{.code-tex-good data-latex=""}
r
x <- seq(from = 0, to = 10, by = 2)
:::
:::{.code-tex-bad data-latex=""} ```r check_value <- function(x){
if (x > 0) { if (x > 100) { return("x is a positive large value") } else { return("x is a positive value") } } else { if (x < - 100) { return("x is a negative small value") } else { return("x is a negative value") } } } ```
:::
:::{.code-tex-good data-latex=""} ```r check_value <- function(x){
if(x < - 100) return("x is a negative small value") if(x < 0) return("x is a negative value") if(x < 100) return("x is a positive value") return("x is a positive large value") } ```
:::
We should always aim to write elegant and readable code. This is different from trying to write code as short as possible. This is not a competition where we need to show off our coding skills. If we are not required to deal with specific constraints (e.g., time or memory efficiency), it is better to write a few more lines of simple code rather than squeezing everything into a single obscure line of code.
In particular, we should not rely on weird language-specific behaviours or unclear tricks, but rather we should try to make everything as explicit as possible. Simple and clear code is always easier to read and maintain.
Remember that writing good code requires time and experience. We can only get better by$\ldots$ writing code.
When writing code, it is very likely that in many occurrences we need to apply the same set of commands multiple times. For example, suppose we need to standardize our variables. We would write the required commands to standardize the first variables. Next, each time we need to standardize a new variable, we will need to rewrite the same code all over again or we copy and pasted the previous code making the required changes. We would end up with something similar to the following lines of code.
:::{.code-tex-bad data-latex=""}
# Standardize variables x1_std <- (x1 - mean(x1)) / sd(x1) x2_std <- (x2 - mean(x2)) / sd(x2) x3_std <- (x3 - mean(x3)) / sd(x3)
:::
Rewriting the same code over and over again or, even worse, copying and pasting the same chunk of code are very inefficient and error-prone practices. In particular, suppose we need to modify the code to solve a problem or to fix a typo. Any change would require us to revise the entire script and to modify each instance of the code. Again, this is a very inefficient and error-prone practice.
To overcome this issue, we can follow a completely different approach by creating our custom functions. Considering the previous example, we can define, possibly in a separate script, the function std_var()
that allows us to standardize a variable. Next, after we have loaded our newly created function, we can call it every time we need it. Following this approach, we would obtain something similar to the code below.
:::{.code-tex-good data-latex=""}
#---- my-functions.R ----# # Define custom function std_var <- function(x){ res <- (x - mean(x)) / sd(x) return(res) } #---- my-analysis-script.R ----# # Apply custom function x1_std <- std_var(x1) x2_std <- std_var(x2) x3_std <- std_var(x3)
:::
Now, if we need to make some change to our custom function, we can simply modify its definition and any change will be automatically applied to each instance of the function in our code. This allows us to easily develop the code efficiently and limit the possibility of introducing errors (really common when copying and pasting).
Following this approach, we obey the DRY (Don't Repeat Yourself) principle that aims at reducing repetitions in the code. Each time we find ourselves repeating the same code logic, we should not rewrite (or copy and paste) the same lines of code, but instead, we should create a new function and use it. By defining custom functions in a single place and then using them, we enhance code:
Now it should be clear that writing functions each time we find ourselves repeating some code logic has many advantages. However, we do not have to necessarily wait for code repetitions before writing a function. Even if a specific code logic is present only once, we can always define a wrap function to execute it, improving code readability.
For example, in most analyses, we need to execute some data cleaning or preprocessing. This step usually requires several lines of code and operations that make our analysis script messy and difficult to read.
:::{.code-tex-bad data-latex=""}
# Data cleaning my_data <- read_csv("path-to/my-data.csv") %>% select(...) %>% mutate(...) %>% gorup_by(...) %>% summarize(...)
:::
To avoid this problem, we could define a wrap function in a separate script with all the operations required to clean the data and give it a meaningful name (e.g,. clean_my_data
). Next, after we have loaded our custom function, we can use it in the analysis script to clean the data, improving readability.
:::{.code-tex-good data-latex=""}
#---- my-functions.R ----# # Define data cleaning function clean_my_data <- function(file){ read_csv(file) %>% select(...) %>% mutate(...) %>% gorup_by(...) %>% summarize(...) } #---- my-analysis-script.R ----# # Data cleaning my_data <- clean_my_data("path-to/my-data.csv")
:::
We followed a Functional Style: break down large problems into smaller pieces and define functions or combinations of functions to solve each piece.
Functional style and DRY principle allow us to develop readable and maintainable code very efficiently. The idea is simple. Instead of having a unique long script with all the analysis code, we define our custom functions to run each step of the analysis in separate scripts. Next, we use these functions in another script to run the analysis. In this way, we keep all the code organized and easy to read and maintain. In the short term, this approach requires more time and may seem overwhelming. In the long term, however, we will be rewarded with all the advantages.
In Chapter \@ref(workflow-analysis-chapter), we describe possible methods to manage the analysis workflow. In the following sections, we provide general recommendations about writing functions, documentation, and testing.
Here we list some of the main recommendations and aspects to take into account when writing functions:
:::{.code-tex-bad data-latex=""}
r
x <- sqrt(2)
x^2 == 2 # WTF (Why is This False?)
:::
:::{.code-tex-good data-latex=""}
r
all.equal(x^2, 2)
:::
:::{.code-tex-warn data-latex=""}
r
# Not intuitive behaviour
round(1.5)
round(2.5)
:::
snake_case()
is usually preferred). In addition, function names should be verbs summarizing the function goal.:::{.code-tex-bad data-latex=""}
r
f()
my_data()
:::
:::{.code-tex-good data-latex=""}
r
get_my_data()
:::
:::{.code-tex-bad data-latex=""} ```r solve_condition <- function(x){
# Initial code ... if (is_condition_A){ ... } else { ... } # Middle code ... if (is_condition_B){ ... } else { ... } # Final code ... return(res) } ```
:::
:::{.code-tex-good data-latex=""} ```r solve_condition_A <- function(x){
# All code related to condition A ... return(res) } solve_condition_B <- function(x){ # All code related to condition B ... return(res) } ```
:::
:::{.code-tex-bad data-latex=""} ```r format_perc <- function(x){
perc_values <- round(x * 100, digits = 2) res <- paste0(perc_values, "%") return(res) } ```
:::
:::{.code-tex-good data-latex=""} ```r format_perc <- function(x, digits = 2){
perc_values <- round(x * 100, digits = digits) res <- paste0(perc_values, "%") return(res) } ```
:::
:::{.code-tex-bad data-latex=""} ```r safe_division <- function(x, y){
res <- x / y return(res) } safe_division(x = 1, y = 0) ```
:::
:::{.code-tex-good data-latex=""} ```r safe_division <- function(x, y){
if(y == 0) stop("you can not divide by zero") res <- x / y return(res) } safe_division(x = 1, y = 0) ```
:::
Of course, writing checks is time-consuming and therefore we need to decide when it is worth spending some extra effort to ensure that the code is stable. - Be Explicit. How functions return the resulting value depends on the specific programming language. However, a good tip is to always make the return statement explicit to avoid possible misunderstandings.
:::{.code-tex-bad data-latex=""}
r
get_mean <- function(x){
sum(x) / length(x)
}
:::
:::{.code-tex-good data-latex=""} ```r get_mean <- function(x){
res <- sum(x) / length(x) return(res) } ```
:::
data-preprocessing.R
, models.R
, plots.R
, utils.R
). Alternatively, each function can be saved in a separate script named by the function name (this is usually done for complex long functions). Finally, we can collect all our scripts in a single folder within our project.[TODO: find better examples?]
Remember that writing good functions requires time and experience. We can only get better by$\ldots$ writing functions.
Writing the code is only a small part of the work in creating a new function. Every time we define a new function, we should also provide appropriate documentation and create unit tests (see Section \@ref(unit-tests)).
Function documentation is used to describe what the function is supposed to do, provides details about the function arguments and outputs, presents function special features, and provides some examples. We can document a function by writing multiple lines of comments right before the function definition or at the beginning of the function body.
Ideally, the documentation of each function should include:
Thus, for example, we could create the following documentation.
#---- format_perc ---- # Format Values as Percentages # # Given a numeric vector, return a string vector with the values formatted # as percentage (e.g., "12.5%"). The argument `digits` allows specifying # the rounding number of decimal places. # # Arguments: # - x : Numeric vector of values to format. # - digits: Integer indicating the rounding number of decimal places # (default 2) # # Output: # A string vector with values formatted as percentages (e.g., "12.5%"). # # Examples: # format_perc(c(.749, .251)) # format_perc(c(.749, .251), digits = 0) format_perc <- function(x, digits = 2){ perc_values <- round(x * 100, digits = digits) res <- paste0(perc_values, "%") return(res) }
Let's discuss some general aspects of writing documentation:
Moreover, in the case of open-source projects, documentation should not be limited to the exported functions (i.e., functions directly accessed by the users), but it should include internal functions as well (i.e., utility functions used inside the app or package not directly accessed by the users). In fact, documenting all functions is required to facilitate the project maintenance and development by multiple contributors. - Documenting analyses. In the case of projects where the code is only used to run the analyses, documentation may seem less relevant. This is not true. Even if we are the only ones working on the code, documentation is always recommended as it facilitates code maintainability. Although we do not need the same level of detail, spending a few extra hours documenting our code is always worth it. The future us will be very grateful for this.
We may think that after writing the functions and documenting them we are done. Well$\ldots$ no. We still miss unit tests. Unit Tests are automated tests used to check whether our code works as expected.
For example, consider the following custom function to compute the mean.
#---- get_mean ---- get_mean <- function(x){ res <- sum(x) / length(x) return(res) }
We can write some tests to evaluate whether the function works correctly.
#---- Unit Tests ---- # Test 1 stopifnot( get_mean(1:10) == 5.5 ) # Test 2 stopifnot( get_mean(c(2,4,6,8)) == mean(c(2,4,6,8)) )
Let's discuss some general aspects of unit tests:
tests/
), ready to be run. Now, we have understood the importance of documenting and testing our functions to enhance code maintainability. In an ideal world, each line of code would be documented and tested. But, of course, this happens only in the ideal world and the reality is very far from this. Most of the time, documentation is limited and tests are only a dream. When choosing what to do, we should evaluate the trade-off between short-term effort and long-term advantages. In small projects, all this may be recommended but not necessary. In long term projects when maintainability is a real issue, however, we should put some real effort into documenting and testing. Again, the future us will be very grateful.
In this section, we introduce some more advanced programming aspects that we may have to deal with when defining functions. These topics are complex and highly dependent on the specific programming language. Therefore we do not aim to provide a detailed description of each argument. Instead, we want to offer a general introduction to these topics providing simple definitions that can help us begin to familiarize ourselves with these advanced concepts.
In some projects or analyses, we may need to run some computational heavy tasks (e.g., simulations). In these cases, performance becomes a fundamental aspect and our code needs not only to be readable and maintainable but also efficient. Here we discuss some general aspects to take into account when we need efficient code in terms of speed.
Let's consider a case where we need to execute a function (add_one()
) over each element of our vector. A common but very bad practice is to grow objects inside the loop. For example, in the function below, we are saving the newly obtained value by combining it with the previously obtained results. This is an extremely inefficient operation as it requires copying the whole vector of results at each iteration. As the length of the vector increases, the program will be slower and slower.
```r add_one <- function(x){ x + 1 } ```
:::{.code-tex-bad data-latex=""} ```r bad_loop <- function(x){
res <- NULL for (i in seq_along(x)){ value <- add_one(x[i]) res <- c(res, value) # copy entire vector at each iteration } return(res) } ```
:::
Some programming languages provide specific functions to allow "adding" an element to an object without copying all its content. In these cases, we should take care in choosing the right functions. However, a commonly recommended approach is to pre-allocate objects used inside the loop. This simply means we need to create objects of the required size before we start the loop.
For example, in our case, first, we initialize the vector res
of length equal to the number of iterations outside of the loop. Next, we store the obtained values at each iteration inside the vector.
:::{.code-tex-good data-latex=""} ```r good_loop <- function(x){
# Initialize vector of the required length res <- vector(mode = "numeric", length = length(x)) for (i in seq_along(x)){ value <- add_one(x[i]) res[i] <- value # assign single value } return(res) } ```
:::
Differences in performance will be greater as the number of iterations increases. Let's compare the two loops over 1000 iterations. The difference is incredible. ```r x <- 1:1e4 # vector with 1000 elements
# Bad loop microbenchmark::microbenchmark(bad_loop(x)) # Good loop microbenchmark::microbenchmark(good_loop(x)) ```
Another important tip to improve loop performance is to limit computations at each iteration to what is strictly necessary. All elements that are constant between iterations should be defined outside the loop.
So, do not be afraid of using loops. They are not slow if we write them correctly.
Let's consider the simple case of adding two vectors of the same length. Without vectorized operators, we would need to write a for loop as in the below function.
:::{.code-tex-warn data-latex=""} ```r add_vectors <- function(x1, x2){
res <- vector(mode = "numeric", length = length(x1)) # Add element by element for (i in seq_along(x1)){ res[i] <- x1[i] + x2[i] } return(res) } ```
:::
Let's see how this for loop compares to the analogue vectorized operator. ```r # vectors with 1000 elements x1 <- 1:1e4 x2 <- 1:1e4
# Element by element operation microbenchmark::microbenchmark(add_vectors(x1, x2)) # Vectorized operation # - In R the `+` operator is vectorized microbenchmark::microbenchmark(x1 + x2) ```
The difference is incredible. Note that this is not because for loops are slow, but rather because vectorized operators are super fast. In fact, vectorized operations are based on really efficient code usually written in compiled languages and run in parallel (see next point). This is what makes vectorized operators so fast and efficient.
So, if we want to improve performance, we should always use vectorized operators when available.
In Compiled Languages (e.g., C or C++), the source code is translated using a compiler before we can execute it. The compilation process is slow and it is required each time we make changes to the source code. Once compiled, however, the code can be simply loaded and executed in a very fast and efficient way.
In Interpreted Languages (e.g., R or Python), the source code is translated at execution time by an interpreter. This allows us to modify the source code at any time and immediately run it. However, the resulting code is slower and less efficient.
So, interpreted languages are much more flexible and ideal when we write and execute code iteratively, but they are slower. On the contrary, compiled languages are very fast and efficient but they require to be compiled first. Therefore, when performance is important, we should use compiled code. However, this does not mean that we necessarily have to write code in compiled languages, we can simply check if there are available libraries that implement compiled code for our needs. In fact, many interpreted programming languages provide libraries based on compiled code to execute specific tasks very efficiently.
Parallel and Serial Processing. Depending on the specific task, we can improve performance by running it in parallel. In Serial Processing, a single task is executed at a time. On the contrary, in Parallel Processing, multiple tasks are executed at the same time.
r
knitr::include_graphics("images/coding/coding.png")
Parallel processing, allows us to take advantage of the multiple processors available on our machine to execute multiple tasks simultaneously. If our program involves the execution or repetitive independent computations, parallel processing can help us to step up in terms of performance. However, parallelization is an advanced topic that needs to be applied appropriately. In fact, there are many aspects to take into account. For example, not all tasks can be parallelized and the overall costs of parallelizing processes may be higher than the benefits.
So, parallelization is a wonderful world that can help us to reach incredible levels of performance but we need to use it with care.
To summarize, when performance is an issue, we should check that our code is written efficiently. In particular, we should always use vectorized operators if available and follow best practices when writing for loops. Next, if we really need to push the limits, we can consider compiled code and parallelization. These are very advanced topics that require specific knowledge. Fortunately, however, many dedicated libraries allow us to implement these solutions more easily. Get ready to break the benchmark!
Another important aspect that we need to master when writing functions, is how a specific programming language evaluates and accesses variables from within a function. Again, we do not aim to provide a detailed discussion of this argument. Instead, we just want to introduce these concepts allowing us to familiarize ourselves with these relevant aspects that need to be understood more in-depth by studying programming language dedicated resources.
Environments. An environment is a virtual space where all objects, variables, and functions we created during our session are available. More technically, an environment is a collection that associates a set of names with a set of values. Usually, the main environment of our session is called “Global Environment”. Note that inside an environment names must be unique. So, for example, we can represent the global environment in the following way.
r
knitr::include_graphics("images/coding/glob-env-2.png")
When we execute a function, commands are executed from inside a new environment. In this way, we avoid conflicts between objects with the same name in the Global Environment and the function environment. For example, in the following case, we have a variable X
pointing to a three-element vector in the global environment and another variable named X
in the function environment pointing to a string.
r
knitr::include_graphics("images/coding/fun-env.png")
So each time we run a function, commands are executed inside a newly created environment with its own set of objects. Note that the function environments are actually inside the global environment and therefore they are also referred to as child-environment and parent-environment respectively. If we call a function, within another function, we would obtain a function environment inside another function environment. We can think about it like a Russian doll.
Global Variables. Global variables are objects defined in the parent environment. Global variables can be accessed from within the child-environment. For example, consider the following case. ```r global_var <- "I am Global!"
my_fun <- function(){ return(global_var) }
my_fun()
``
We created
global_varin the global environment. Next, we defined the
my_fun()that simply prints the object
global_var. Note that, although there is no object
global_var` defined in the function, we do not get an error. Instead, the function looks in the parent environment for the variable and we obtain its value.
Note that we can not modify global variables from within a function as any attempt will simply create a local variable. ```r global_var <- "I am Global!"
my_fun <- function(){ global_var <- "I am local" return(global_var) } my_fun() global_var ```
There are specific commands used to modify global variables from within a function, but this is usually a deprecated practice because we may affect other functions that depend on those variables.
Global variables can be used to specify constants and settings that affect the whole analysis. A common practice is to capitalize global variables to distinguish them from the local variables. However, global variables should be used with care preferring to explicitly pass function arguments instead.
Aliasing. Occasionally, we create an object as a copy of another object, for example, x = y
. This apparent simple command can actually lead to very different behaviours depending on the specific programming language. In fact, some programming languages distinguish between copying (i.e., creating a new variable that points to an independent copy of the same object) and aliasing (i.e., creating a new variable that points to the same object ). In the first case, we would obtain two independent objects, whereas, in the second case, we obtain two variables pointing to the same object as presented in the figure below.
r
knitr::include_graphics("images/coding/glob-env.png")
If we are not aware of this difference, we could easily end up with serious problems. Let's consider the following example in R (aliasing is not allowed) and in Python (aliasing is allowed). ```r #---- R Code ----
# Create objects x = c(1, 2, 3) y = c(1, 2, 3) w = y w[3] <- 999 # change a value # Check values x y w ``` In R, changes to an element of `w` do not affect y. ```{python, echo = TRUE} #---- Python Code ---- # Create objects x = [1,2,3] y = [1,2,3] w = y w[2] = 999 # change a value # Check values x y w ``` In Python, changes to an element of `w` do also affect y. This example is not intended to scare anyone but simply to highlight the importance of having in-depth knowledge and understanding of the programming language we are using.
At some point in programming, we will need to deal with classes and methods. But what do these two strange words mean? Let's try to clarify these concepts.
Class. Each object we create belongs to a specific family of objects depending on their characteristics. We call this family a “class”. More precisely, a class is a template that defines which are the specific characteristics of the objects belonging to that class. This template is used to create objects of a given class and we say that an object is an instance of that class. Of course, we can create multiple instances (i.e., multiple objects) of the same class.
Methods. Each class has their own methods, that is, a set of actions objects of a specific class can execute or functions we can apply to manipulate the object itself.
So, why are classes and methods so important? Classes and methods allow us to organize our code efficiently and enhance reusability. For example, if we find ourselves relying on some specific data structure in our program, we can create a dedicated class. In this way, we can improve the control over the program by breaking down the code into small units and by specifying different methods depending on the object class.
Now, classes and methods are typical of the Object-Oriented Programming approach rather than the Functional Programming approach. Let's briefly introduce these two approaches.
Less extreme applications of functional programming allow object classes. In this case, methods are not characteristics of the object itself but are functions defined separately from the object. To clarify this difference, suppose we have an object todo_list
with a list of tasks we need to complete today and we have a method whats_next()
that returns which is the next task we need to complete. In an object-oriented programming approach, the method is directly invoked from the object, whereas, in a functional programming approach, we would apply the method as a function to the object.
# Object Oriented Programming todo_list.whats_next() ## Write the paper # Functional Programming whats_next(todo_list) ## Have a break
Note that in most object-oriented programming languages, methods are accessed using the dot character ("."
). For this reason, we should always separate words in object names using snake_case
and not snake.case
.
Finally, the two approaches are not mutually exclusive and actually, most programming languages support both approaches, usually leading to a mixed flavour of object classes and functions working together. However, different programming languages could favour one of the two approaches. For example, Python is commonly considered an object-oriented programming language and R a functional programming language, although both support both approaches.
In this section, we discuss further recommendations regarding writing code in R. Now, learning how to program in R is a huge topic that would require an entire new book on its own (or probably more than one book). Therefore, we prefer to provide references to useful resources and packages.
There are many books available on-line covering all the aspects of the R programming language. In particular, we highlight the following books (freely available on-line) ordered from beginner to advanced topics:
\begin{center} \includegraphics[width=0.23\textwidth]{images/coding/hopr.png} \includegraphics[width=0.23\textwidth]{images/coding/r4ds.png} \includegraphics[width=0.23\textwidth]{images/coding/r-pkgs.png} \includegraphics[width=0.23\textwidth]{images/coding/adv-r.png} \end{center}
In the next sections, we briefly discuss coding good practices in R and how we can develop projects according to a functional style approach.
:::{.design title="Tidyverse VS Base R" data-latex="[Tidyverse VS Base R]"}
Regarding the tidyverse vs Base R discussion, we want to share our simple opinion. We love tidyverse. This new ecosystem of packages allows us to write readable code in a very efficient way. However, tidyverse develops very quickly and many functions or arguments may become deprecated or even removed in the future. This is not an issue per se, but it can make it hard to maintain projects in the long term. So what should we use tidyverse or Base R? Our answer is$\ldots$depends on the specific project aims.
Analyses Projects. In the case of projects related to specific analyses, we recommend using tidyverse. Wrangling, visualising, and exploring data in the tidyverse is very easy and efficient (and fun!). To overcome the issues related to the frequent changes in the tidyverse, we recommend using the renv
R package to manage the specific package versions (see Chapter \@ref(renv-section)). In this way, we can have all the fun of tidyverse without worrying about maintainability in the long term.
Apps and Packages. In the case of projects that aim to create apps or packages, we recommend using Base R. In this case, we may have less control over the specific package versions installed by the users. Therefore, we prefer to build our code with as few dependencies as possible, relying only on stable packages that rarely change. In this way, we limit issues of maintainability in the long term. :::
The same general good practices described in Section \@ref(good-practices) apply also to R. In addition, there are many “unofficial” coding style recommendations specific to R. We should always stick to the language-specific style guidelines. In some cases, however, there are no strict rules and thus we can create our style according to our needs and personal preferences. When creating our personal style, remember that we want a consistent styling that enhances code readability.
For more details about R coding style, consider the two following resources:
Here we review only a few aspects:
snake_case
.<-
or =
. We agree with the tidyverse preference of using <-
for assignment, as it explicitly indicates the direction of the assignment (x = y
, are we assigning y
to x
or the contrary? With x <- y
there are no doubts). However, both are fine, just pick one and stick with it.r
knitr::include_graphics("images/coding/rstudio-line.png")
r
knitr::include_graphics("images/coding/rstudio-indent.png")
Logical Values. Always write TRUE
and FALSE
logical values instead of the respective abbreviations T
and F
. TRUE
and FALSE
are reserved words, whereas T
and F
are not. This means that we can overwrite their values, leading to possible issues in the code.
```r
# Check value
TRUE == T
TRUE <- "Hello"
T <- "World" T
TRUE == T
T <- FALSE FALSE == T ``` - RStudio Keyboard Shortcuts. RStudio provides many useful keyboard shortcuts to execute specific actions https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts-in-the-RStudio-IDE. Familiarizing with them can facilitate our life while coding. For example:
Ctrl+L
Ctrl+D
(macOS Cmd+D
)Alt+-
(macOS Option+-
)Ctrl+Shift+C
(macOS Cmd+Shift+C
)Ctrl+I
(macOS Cmd+I
)Ctrl+Shift+A
(macOS Cmd+Shift+A
)Ctrl+Shift+/
(macOS Cmd+Shift+/
)F1
F2
([TODO: check on my mac is command + click
function]).Ctrl+Shift+F
(macOS Cmd+Shift+F
)Adopting a functional style approach, we will create lots of functions and use them in the analyses. To organize our project files, we can save all the scripts containing the function definitions in a separate directory (e.g., R/
) and source all them at the beginning of the main script used to run the analyses. This approach is absolutely fine. However, if we want to optimize our workflow, we should consider organizing our project as it was an R package.
Using the structure of R Packages, we can take advantage of specific features that facilitate our lives during the development process. Note that creating (and publishing) an actual R package requires dealing with many advanced aspects of R and this whole process may be overwhelming for our projects. However, we do not need to create a real R package to take advantage of the development tools. Simply by organizing our project according to the R Package Project template, we can already use all the features that help us manage, document, and test our functions.
In the next sections, we introduce the basic aspects of the R Package Project template. Our aim is to simply provide an introduction highlighting all the advantages to encourage learning more. For a detailed discussion of all aspects, we highly recommend the R Packages book (https://r-pkgs.org).
To create a project using the structure of R Packages, we simply need to select “R Package” as Project Type when creating a new project. Alternatively, we can use the function devtools::create()
indicating the path.
knitr::include_graphics("images/coding/create-pkg.png")
The basic structure of an R package project is presented below.
- <pkg-name>/ |-- .Rbuildignore |-- DESCRIPTION |-- NAMESPACE |-- <pkg-name>.Rproj |-- man/ |-- R/ |-- tests/
In particular, we have:
.Rbuildignore
, a special file used to list the files we do not want to include in the package. If we are not interested in creating a proper package, we can ignore it.DESCRIPTION
, this file provides metadata about our package (see Section \@ref(description)). This is a very important file used to recognize our project as an R package, we should never delete it.NAMESPACE
, a special file used to declare the functions that are exported and imported by our package the files we do not want to include in the package. If we are not interested in creating a proper package, we can ignore it.<pkg-name>.Rproj
, the usual .Rproj
file of each RStudio project.man/
, a directory containing the functions documentation (see Section \@ref(documentation-roxygen)).R/
, a directory containing all the function scripts (must be capital "R"
).tests/
, a directory containing all the unit tests (see Section \@ref(testthat)).devtools
Workflow {#devtools-workflow}package_logo("images/coding/devtools.png", format = output_format, tex_width = .25)
So, what is special about the R Package Project template? Well, thanks to this structure we can take advantage of the workflow introduced by the devtools
R package. The devtools
R package [@R-devtools] provides many functions to automatically manage common tasks during the development. In particular, the main functions are:
devtools::load_all()
. Automatically load all the functions found in the R/
directory.devtools::document()
. Automatically generate the function documentation in the man/
directory.devtools::test()
. Automatically run all unit tests in the tests/
directory.These functions allow us to automatically execute all the most common actions during the development. In particular all these operations have dedicated keyboard shortcuts in RStudio:
Ctrl+Shift+L
(macOS Cmd+Shift+L
)Ctrl+Shift+D
(macOS Cmd+Shift+D
)Ctrl+Shift+T
(macOS Cmd+Shift+T
)Using these keyboard shortcuts, the whole development process becomes very easy and smooth. We define our new functions and immediately load them so we can keep on working on our project. Moreover, whenever it is required, we can create the function documentation and check that everything is fine by running unit tests.
The R Packages book (https://r-pkgs.org) describes all the details about this workflow. It could take some time and effort to familiarize ourselves with this process but the advantages are enormous.
The DESCRIPTION
is a special file with all the metadata about our package and it is used to recognize our project as an R package. Thus, we should never delete it.
A DESCRIPTION
looks like this,
#----- DESCRIPTION ----# Package: <pkg-name> Title: One line description of the package Version: the package version number Authors@R: # authors list c(person(given = "name", family = "surname", role = "aut", # cre = creator and maintainer; aut = other authors; email = "name@email.com"), ...) Description: A detailed description of the package Depends: R (>= 3.5) # Specify required R version License: GPL-3 # Our prefered license Encoding: UTF-8 Imports: # list of required packages Suggests: # list of suggested packages Config/testthat/edition: 3 RoxygenNote: 7.1.2 VignetteBuilder: knitr URL: # add useful links to online resources or documentation
The DESCRIPTION
file is particularly important if we are creating an R package. However, it can be used for any project to collect metadata, list project dependencies, or add other useful information.
Note that the DESCRIPTION
file follows specific syntax rules. To know more about the DESCRIPTION
file, see https://r-pkgs.org/description.html.
roxygen2
{#documentation-roxygen}package_logo("images/coding/roxygen2.png", format = output_format, tex_width = .25)
The roxygen2
R package [@R-roxygen2] allows us to create functions documentation simply by adding comments with all the required information right before the function source code definition. roxygen2
will process our source code and comments to produce the function documentation files in the man/
directory.
roxygen2
assumes a specific structure and uses specific tags to correctly produce the different parts of the documentation. Below is a simple example of documenting a function using roxygen2
. Note the use of #'
instead of #
to create the comments and the special tags (@<tag-name>
) used to specify the different documentation components.
#---- format_perc ---- #' Format Values as Percentages #' #' Given a numeric vector, return a string vector with the values #' formatted as percentage (e.g.,"12.5%"). The argument `digits` allows #' to specify the rounding number of decimal places. #' #' @param x Numeric vector of values to format. #' @param digits Integer indicating the rounding number of decimal places #' (default 2) #' #' @return A string vector with values formatted as percentages #' (e.g.,"12.5%"). #' #' @examples #' format_perc(c(.749, .251)) #' format_perc(c(.749, .251), digits = 0) #' format_perc <- function(x, digits = 2){ perc_values <- round(x * 100, digits = digits) res <- paste0(perc_values, "%") return(res) }
We can use the keyboard shortcut Ctrl+Alt+Shift+R
(macOS Cmd+Option+Shift+R
) to automatically insert the Roxygen skeleton for the documentation.
To create the function documentation using roxygen2
, we need to select the “Generate documentation with Roxygen” box from the “Project Options” > “Build Tools” (remember Project Options and not Global Options).
knitr::include_graphics("images/coding/settings-doc.png")
Next, we can run devtools::document()
(or Ctrl+Shift+D
/ macOS Cmd+Shift+D
) to automatically create the function documentations. Now, we can use the common help function ?<function-name>
(or help(<function-name>)
) to navigate the help page of newly created functions.
To learn all the details about documenting functions using roxygen2
, consider:
roxygen2
official documentation (https://roxygen2.r-lib.org).testthat
{#testthat}package_logo("images/coding/test-that.png", format = output_format, tex_width = .25)
The testthat
R package [@R-testthat] allows us to create and run unit tests for our functions. In particular, testthat
provides dedicated functions to easily describe what we expect a function to do, including catching errors, warnings, and messages.
To create the unit tests using testthat
, we need to use dedicated functions following specific folders and file structures. Below is a simple example of a unit test using testthat
.
#---- testing format_perc ---- test_that("check format_perc returns the correct values", { # numbers expect_match(format_perc(.12), "12%") expect_match(format_perc(.1234, digits = 1), "12.3%") # string expect_error(format_perc("hello")) })
Once the tests are ready, we can automatically run all the unit tests using the function devtools::test()
(or Ctrl+Shift+T
/ macOS Cmd+Shift+T
).
To learn all the details about unit tests using testthat
, consider:
testthat
official documentation (https://testthat.r-lib.org/).\newpage
:::{.doclinks data-latex=""}
r break_line(format = output_format)
https://rstudio-education.github.io/hoprr break_line(format = output_format)
https://r4ds.had.co.nzr break_line(format = output_format)
https://r-pkgs.orgr break_line(format = output_format)
https://adv-r.hadley.nzr break_line(format = output_format)
https://code.tutsplus.com/tutorials/top-15-best-practices-for-writing-super-readable-code--net-8118r break_line(format = output_format)
https://irudnyts.github.io/r-coding-style-guider break_line(format = output_format)
https://style.tidyverse.orgr break_line(format = output_format)
https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts-in-the-RStudio-IDEr break_line(format = output_format)
https://r-pkgs.org/description.htmlroxygen2
{-}roxygen2
official documentationr break_line(format = output_format)
https://roxygen2.r-lib.orgr break_line(format = output_format)
https://r-pkgs.org/man.htmltestthat
{-}testthat
official documentationr break_line(format = output_format)
https://testthat.r-lib.org/r break_line(format = output_format)
https://r-pkgs.org/tests.html:::
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.