README.md

strictlyr

The goal of strictlyr is to provide functions that are stricter about violations of some common assumptions during data manipulation.

The key issues which this package will initially focus on handling:

To address these issues, strictlyr will:

  1. include drop-in replacement functions which check for issues internally and raise errors, warnings or messages.
  2. include alternate versions of dplyr functions which enforce particular assumptions

Installation + Usage

You can install strictlyr from github with:

remotes::install_github("coolbutuseless/strictlyr")
# Always load `dplyr` first
library(dplyr    , warn.conflicts = FALSE)
library(strictlyr, warn.conflicts = FALSE)

Assumption: data in the environment should not have groups.

In 100% of the code that I write, I do not want data.frames with groups in the global environment.

Every group_by() I write is paired with an immediate ungroup() after I’ve done what needs doing. If I ever forget to ungroup() then this is a mistake that will lead to data issues later in the script.

strictlyr includes a drop-in replacement for the pipe operator which checks if the input or output data is grouped.

res1 <- mtcars %>%
  group_by(cyl) %>%
  mutate(mpg = max(mpg))
#> Error: The end result of this operation still has groups - did you mean to call `ungroup()` as well?

This error may be configured by setting either of the following options:

Possible values for STRICTLYR_PIPE are “stop”, “warning”, “message”, and “quiet”.

options(STRICTLYR_PIPE = 'quiet') # Suppress `strictlyr` output for the pipe

res1 <- mtcars %>%
  group_by(cyl) %>%
  mutate(mpg = max(mpg))

Assumption: Predicates in a filter() operation should never produce NAs

An NA as a result of a predicate in a filter() statement is almost always an indication that I have made a mistake e.g. I don’t understand my data, I’ve made an earlier data handling error, or new data has violated earlier assumptions.

To be clear: having NA values in the actual dataset is fine, but having NA as the result of a filter predicate is not.

An example of a type of error that can occur if a wild and unexpected NA appears in your dataset is included below. In this scenario, df$x previously never contained NA values, but a data update violated this assumption. Code that previously worked now silently drops any row where x == NA!

# Dataset with 3 rows
test_df <- data.frame(x = c(1, NA, 3), y = c(4, 5, 6))

# split the data
low_df  <- test_df %>% filter(x <  2)
high_df <- test_df %>% filter(x >= 2)

# calculate something on the separate datasets and then re-combine.
# Now there are only 2 rows in the data!
dplyr::bind_rows(low_df, high_df)
#>   x y
#> 1 1 4
#> 2 3 6

Assumption: The condition in an if_else() statement should never produce NAs

An NA as a result of the condition in an if_else statement is almost always an indication that I have made a mistake e.g. I don’t understand my data, I’ve made an earlier data handling error, or new data has violated earlier assumptions.

An example of a type of error that can occur if a wild and unexpected NA appears in your dataset is included below. In this scenario, x previously never contained NA values, but a data update violated this assumption. Code that previously worked now changes the total count of

# A rogue 'NA' has appeared in the data where there never was before.
x <- c(1, 2, NA)

size <- if_else(x < 2, 'small', 'large')

N_small <- length(size[size == 'small'])
N_large <- length(size[size == 'large'])

# Now have a erroneous count
N_small + N_large
#> [1] 4

Assumption: For case_when(), each input element should match only 1 rule.

In the following case_when() code the output is a pretty awful due to a combination of typos, rule misspecification, and NA values.

I want a case_when() which avoids some easy errors i.e. it should:

animal <- c('cat', 'dog', 'dogs', 'snake', NA)

case_when(
  animal == 'catt'          ~ 'mammal',
  animal == 'dog'           ~ 'mammal', 
  startsWith(animal, 'dog') ~ "best friend",
  TRUE                      ~ "reptile"
)
#> [1] "reptile"     "mammal"      "best friend" "reptile"     "reptile"

case_when() applies the first matching rule that it finds, and this is often very useful. So to the match the desired strict behaviour, there would need to be alternate function called strict_case_when(). See this post for more discussion: https://coolbutuseless.github.io/2018/09/06/strict-case_when/

Assumption: In a left_join() operation, the RHS should have (at most) one row matching each row in the LHS

In the majority of left_join() calls, I expect (at most) one match in the RHS dataset. In these types of left_join() calls, wheee there are multiple matching rows in the RHS, I would prefer an error rather than the propagation of duplicate rows.

# Expecting one measurement of weight and height per subject
# There is an erroneous duplicate height recorded for subject 2
weight <- data.frame(ID = 1:2, wt = c(10, 20))
height <- data.frame(ID = c(1, 2, 2, 3), ht = c(20, 21, 21, 22))

# Now the total measurements data has a duplicate row too!
measurements <- weight %>% left_join(height, by = 'ID')
measurements
#>   ID wt ht
#> 1  1 10 20
#> 2  2 20 21
#> 3  2 20 21

The left_join is quite a powerful operator, and restricting the RHS to one matching row would cripple its usefulness in general. So I think there should be alternate function: strict_left_join()

See other discussion about left_joins() and multiple matching rows in:

Guidelines for function design within strictlyr

Drop-in replacement functions should

New/alternate functions should



coolbutuseless/strictlyr documentation built on Dec. 9, 2019, 12:15 a.m.