knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The package safejoin features wrappers around packages dplyr and fuzzyjoin's functions to join safely using various checks. It also comes packed with features to select columns, rename them, operate on conflicting ones (coalesce for example), or aggregate the rhs on the joining columns before joining.
Install package with:
# install.packages(devtools) devtools::install_github("moodymudskipper/safejoin")
Joining operations often come with tests, one might want to check that:
by
columns are given explicitly (dplyr displays a message if
they're not)by
columns
(dplyr keeps them both and suffixes them silently)This package provides the possibility to ignore, inform, warn or abort for any of combination of these cases.
These checks are handled by a single string parameter, i.e. a sequence of
characters where uppercase letters trigger failures, lower case letters trigger
warnings, and letters prefixed with ~
trigger messages, the codes are as follow:
"c"
to check conflicts of columns"b"
like "by" checks if by
parameter was given explicitly"u"
like unique to check that the join columns form an unique key on x
"v"
to check that the join columns form an unique key on y
"m"
like match to check that all rows of x
have a match"n"
to check that all rows of y
have a match"e"
like expand to check that all combinations of joining columns are
present in x
"f"
to check that all combinations of joining columns are present in y
"l"
like levels to check that join columns are consistent in term of
factor levels"t"
like type to check that joining columns have same class and typeFor example, check = "MN"
will ensure that all rows of both tables are matched.
Additionally when identically named columns are present on both
sides, we can aggregate them into one in flexible ways (including coalesce or
just keeping one of them). This is done through the conflict
parameter.
The package features functions safe_left_join
, safe_right_join
,
safe_inner_join
, safe_full_join
, safe_nest_join
, safe_semi_join
,
safe_anti_join
, and eat
.
The additional function, eat
is designed to be an improved join in the cases where one is growing a
data frame. In addition to the features above :
...
argument to select columns from .y
and leverages the select helpers from dplyr, allowing also things like renaming, negative selection, quasi-quotation....y
on the fly along joining columns for more concise and
readable codeThe support of fuzzyjoin
functions is done in two ways, fuzzyjoin
functions
will be used instead of dplyr
's functions if :
match_fun
is filled. Then the standard fuzzyjoin
interface
is leveraged, except that safejoin
supports formula notation for this argument.by
argument. It should use a notation
like ~ X("var1") > Y("var2") & X("var3") < Y("var4")
. This was introduced to
avoid using the arguments multi_by
and multi_match_fun
from
fuzzyjoin::fuzzy_join
which I felt were confusing, and have a single readable
argument instead.safejoin offers the same features for all safe_*_join
functions so we'll
only review safe_left_join
here, we also limit ourselves to checks of the form
~*
We'll use dplyr's data sets band_members
and band_instruments
along with
extended versions.
library(safejoin) library(dplyr,quietly = TRUE,warn.conflicts = FALSE) band_members_extended <- band_members %>% mutate(cooks = factor(c("pasta","pizza","spaghetti"), levels = c("pasta","pizza","spaghetti"))) %>% add_row(name = "John",band = "The Who", cooks = "pizza") band_instruments_extended <- band_instruments %>% mutate(cooks = factor(c("pizza","pasta","pizza"))) band_members band_instruments band_members_extended band_instruments_extended
Not applying any check :
safe_left_join(band_members, band_instruments, check = "")
Displaying "Joining, by..." like in default dplyr behavior:
safe_left_join(band_members, band_instruments, check = "~b")
Check column conflict when joining extended datasets by name:
try(safe_left_join(band_members_extended, band_instruments_extended, by = "name", check = "~c"))
Check if x
has unmatched combinations:
safe_left_join(band_members_extended, band_instruments_extended, by = c("name","cooks"), check = "~m")
Check if y
has unmatched combinations:
safe_left_join(band_members_extended, band_instruments_extended, by = c("name","cooks"), check = "~n")
Check if x
has absent combinations:
safe_left_join(band_members_extended, band_instruments_extended, by = c("name","cooks"), check = "~e")
Check if y
has absent combinations:
safe_left_join(band_members_extended, band_instruments_extended, by = c("name","cooks"), check = "~f")
Check if x
is unique on joining columns:
safe_left_join(band_members_extended, band_instruments_extended, by = c("name","cooks"), check = "~u")
Check if y
is unique on joining columns (it is):
safe_left_join(band_members_extended, band_instruments_extended, by = c("name","cooks"), check = "~v")
Check if levels are compatible betweeb joining columns:
safe_left_join(band_members_extended, band_instruments_extended, by = c("name","cooks"), check = "~l")
In case of confict, choose either the column from x
or from y
:
safe_left_join(band_members_extended, band_instruments_extended, by = "name", conflict = ~.x) safe_left_join(band_members_extended, band_instruments_extended, by = "name", conflict = ~.y)
Or coalesce them :
safe_left_join(band_members_extended, band_instruments_extended, by = "name", conflict = coalesce) safe_left_join(band_members_extended, band_instruments_extended, by = "name", conflict = ~coalesce(.y,.x))
Or do any custom transformation :
safe_left_join(band_members_extended, band_instruments_extended, by = "name", conflict = paste)
Some common use cases for numerics would be confict = `+`
, confict = pmin
,
, confict = pmax
, confict = ~(.x+.y)/2
.
conflict = "patch"
is a special value where matches found in y
overwrite the values in x
, and other values are kept. It's different from
conflict = ~coalesce(.y,.x)
because some values in x
might be overwritten
by NA
.
safe_left_join(band_members_extended, band_instruments_extended, by = "name", conflict = "patch")
All the checks above are still relevant for eat
, we'll silence them below
with check=""
to focus on the additional features.
Same as safe_left_join
:
band_members_extended %>% eat(band_instruments_extended) band_members_extended %>% eat(band_instruments_extended, .by = "name", .check = "")
The names of eat
's parameters start with a dot to minimize the risk of conflict
when naming the arguments fed to the ...
. The ...
are usually used to pass
columns to be eaten, but they are passed to select
so more features are
available.
Select which column to eat:
band_members_extended %>% eat(band_instruments_extended, plays, .by = "name", .check = "") band_members_extended %>% eat(band_instruments_extended, -cooks, .by = "name", .check = "") band_members_extended %>% eat(band_instruments_extended, starts_with("p"), .by = "name", .check = "")
Rename eaten columns :
band_members_extended %>% eat(band_instruments_extended, .prefix = "NEW", .check = "") band_members_extended %>% eat(band_instruments_extended, PLAYS = plays, .check = "")
We can check if the dot argument was used by using the character "d"
in the check string:
band_members_extended %>% eat(band_instruments_extended, .check = "~d")
In cases of matching to many (i.e. the join columns don't form a unique key for
y
), we can use the parameter .agg
to aggregate them on the fly:
band_instruments_extended %>% eat(band_members_extended, .check = "") band_instruments_extended %>% eat(band_members_extended, .agg = ~paste(.,collapse="/"), .check = "")
Finally we can eat a list of data frames at once, and optionally override
the .prefix
argument by providing names to the elements.
X <- data.frame(a = 1:2,b = 1:2) Y1 <- list(data.frame(a = 1:2,c = 3:4), data.frame(a = 1:2,d = 5:6)) eat(X, Y1) Y2 <- list(data.frame(a = 1:2,c = c(3,NA)), data.frame(a = 1:2,c = c(NA,4))) eat(X, Y2, .by = "a", .conflict = coalesce) Y3 <- list(FOO = data.frame(a = 1:2,c = 3:4), BAR = data.frame(a = 1:2,d = 5:6)) eat(X, Y3) Y4 <- list(FOO = data.frame(a = 1:2, c = 3:4, d = 5:6), BAR = data.frame(a = 1:2, c = 3:4, e = 7:8)) eat(X, Y4) eat(X, Y4, c)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.