In Kiwi-Random-House/R-Projects: Building Analytic Apps with R

Principles {#principles}

Introduction {-}

Encapsulation and Abstractions

Take a look at the following two main.R versions:

Main application with low-level details

# main.R
## Load house prices data
temp_env <- new.env()
load(file = usethis::proj_path("data", "train_set", ext = "rda"), envir = temp_env)
data <- temp_env$train_set
rm(temp_env)

## Plot important amenities
par(mfrow = c(1,2))
plot(data$mpg , data$cyl, type = "p")
boxplot(mpg ~ cyl, data = data)

Main application with high-level abstractions

# main.R
data <- load_house_prices_data()
plot_important_amenities(data)

Both code snippets have the same intent: they load the house prices dataset and provide plots for data exploration. Notice how much cognitive load the first snippet requires as the human brain compiles the code. The situation aggravates further if the reader is not familiar with the R syntax. In contrast, the second snippet hides the implementation details by wrapping the details in functions. The high-level abstractions communicate that there are two events happening in main.R: loading and plotting of data. As a result, the code is simpler to read and understand.

load_mtcars_data <- function(){
    mtcars <- datasets::mtcars
    return(mtcars)
}

Furthermore, the second snippet is easier to maintain and develop. These qualities are desirable in any software application. This is because software systems evolve as programmers acquire new knowledge and understanding of the problem the software is set to solve. Importantly, analytic applications are the result of scattershot and serendipitous explorations. As data scientist discover new findings and signals, they incorporate them in the analytic application. For example, plot_important_attributes original implementations is:

plot_important_attributes <- function(data){
    par(mfrow = c(1,2))
    plot(data$mpg , data$cyl, type = "p")
    boxplot(mpg ~ cyl, data = data)
}

Imagine a data scientist discovers, whether by client feedback or other mean, that there is another important attribute to include in the data analysis. Moreover, to reduce confusion, the data scientist decides to modify the plots aesthetics such that they contain titles. Then, plot_important_attributes mutates to:

plot_important_attributes <- function(data){
    par(mfrow = c(1,3))
    plot(data$mpg , data$hp, type = "p", main = "MPG ~ Horsepower")
    plot(data$mpg , data$cyl, type = "p", main = "MPG ~ Cylinders")
    boxplot(mpg ~ cyl, data = data, main = "MPG ~ Cylinders")
}