knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
# Quick logo generation. Borrowed heavily from Nick Tierney's Syn logo process library(magick) library(showtext) font_add_google("Abril Fatface", "gf") # pkgdown::build_site(override = list(destination = "../coolbutuseless.github.io/package/devout"))
img <- image_read("man/figures/goose.png") hexSticker::sticker(subplot = img, s_x = 1, s_y = 1.2, s_width = 1.5, s_height = 1.5, package = "strictlyr", p_x = 1, p_y = 0.6, p_color = "#223344", p_family = "gf", p_size = 9, h_size = 1.2, h_fill = "#ffffff", h_color = "#223344", filename = "man/figures/logo.png") image_read("man/figures/logo.png")
The goal of strictlyr
is to provide functions that are stricter about violations
of some common assumptions during data manipulation.
The key issues which this package will initially focus on handling:
filter()
and if_else()
) should never
evaluate to an NA
value.case_when()
the "looseness" of the matching can mean that it is easy to make mistakesleft_join()
which enforces some conditions on the RHSTo address these issues, strictlyr
will:
You can install strictlyr
from github with:
remotes::install_github("coolbutuseless/strictlyr")
# Always load `dplyr` first library(dplyr , warn.conflicts = FALSE) library(strictlyr, warn.conflicts = FALSE)
In 100% of the code that I write, I do not want data.frames with groups in the global environment.
Every group_by()
I write is paired with an immediate ungroup()
after I've
done what needs doing. If I ever forget to ungroup()
then this is a mistake that
will lead to data issues later in the script.
strictlyr
includes a drop-in replacement for the pipe operator which checks
if the input or output data is grouped.
res1 <- mtcars %>% group_by(cyl) %>% mutate(mpg = max(mpg))
This error may be configured by setting either of the following options:
options(STRICTLYR_LOG = 'quiet')
- make all strictlyr
functions quietoptions(STRICTLYR_PIPE = 'quiet')
- make only the pipe quietPossible values for STRICTLYR_PIPE
are "stop", "warning", "message", and "quiet".
options(STRICTLYR_PIPE = 'quiet') # Suppress `strictlyr` output for the pipe res1 <- mtcars %>% group_by(cyl) %>% mutate(mpg = max(mpg))
filter()
operation should never produce NA
sAn NA
as a result of a predicate in a filter()
statement is almost always an
indication that I have made a mistake e.g. I don't understand my data, I've
made an earlier data handling error, or new data has violated earlier assumptions.
To be clear: having NA
values in the actual dataset is fine, but having NA
as the result
of a filter predicate is not.
An example of a type of error that can occur if a wild and unexpected NA
appears
in your dataset is included below. In this scenario, df$x
previously never
contained NA
values, but a data update violated this assumption. Code that
previously worked now silently drops any row where x == NA
!
# Dataset with 3 rows test_df <- data.frame(x = c(1, NA, 3), y = c(4, 5, 6)) # split the data low_df <- test_df %>% filter(x < 2) high_df <- test_df %>% filter(x >= 2) # calculate something on the separate datasets and then re-combine. # Now there are only 2 rows in the data! dplyr::bind_rows(low_df, high_df)
if_else()
statement should never produce NA
sAn NA
as a result of the condition in an if_else
statement is almost always an
indication that I have made a mistake e.g. I don't understand my data, I've
made an earlier data handling error, or new data has violated earlier assumptions.
An example of a type of error that can occur if a wild and unexpected NA
appears
in your dataset is included below. In this scenario, x
previously never
contained NA
values, but a data update violated this assumption. Code that
previously worked now changes the total count of
# A rogue 'NA' has appeared in the data where there never was before. x <- c(1, 2, NA) size <- if_else(x < 2, 'small', 'large') N_small <- length(size[size == 'small']) N_large <- length(size[size == 'large']) # Now have a erroneous count N_small + N_large
case_when()
, each input element should match only 1 rule.In the following case_when()
code the output is a pretty awful due to a combination
of typos, rule misspecification, and NA
values.
I want a case_when()
which avoids some easy errors i.e. it should:
TRUE
rule so that catt
would be picked up as a typo rather than classified as a reptile.NA
value being classed as a reptile. An easy solution would again be to disallow the
bare TRUE
rule.animal <- c('cat', 'dog', 'dogs', 'snake', NA) case_when( animal == 'catt' ~ 'mammal', animal == 'dog' ~ 'mammal', startsWith(animal, 'dog') ~ "best friend", TRUE ~ "reptile" )
case_when()
applies the first matching rule that it finds, and this is often
very useful. So to the match the desired strict behaviour, there would need to be alternate function called
strict_case_when()
. See this post for more discussion: https://coolbutuseless.github.io/2018/09/06/strict-case_when/
left_join()
operation, the RHS should have (at most) one row matching each row in the LHSIn the majority of left_join()
calls, I expect (at most) one match in the
RHS dataset. In these types of left_join()
calls, wheee there are multiple
matching rows in the RHS, I would prefer an error rather than the propagation of
duplicate rows.
# Expecting one measurement of weight and height per subject # There is an erroneous duplicate height recorded for subject 2 weight <- data.frame(ID = 1:2, wt = c(10, 20)) height <- data.frame(ID = c(1, 2, 2, 3), ht = c(20, 21, 21, 22)) # Now the total measurements data has a duplicate row too! measurements <- weight %>% left_join(height, by = 'ID') measurements
The left_join
is quite a powerful operator, and restricting the RHS to one matching
row would cripple its usefulness in general. So I think there should be alternate function: strict_left_join()
See other discussion about left_joins()
and multiple matching rows in:
strictlyr
Drop-in replacement functions should
options()
to configure output behaviour when assumptions are violated. i.e.
'error', 'warn', 'message' or 'quiet'New/alternate functions should
options()
strict_
prefix. e.g. strict_filter()
would be an alternative to filter()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.