View source: R/generics_and_methods.R
power_left_join | R Documentation |
Power joins
power_left_join(
x,
y = NULL,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
keep = NULL,
na_matches = c("na", "never"),
check = check_specs(),
conflict = NULL,
fill = NULL
)
power_right_join(
x,
y = NULL,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
keep = NULL,
na_matches = c("na", "never"),
check = check_specs(),
conflict = NULL,
fill = NULL
)
power_inner_join(
x,
y = NULL,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
keep = NULL,
na_matches = c("na", "never"),
check = check_specs(),
conflict = NULL,
fill = NULL
)
power_full_join(
x,
y = NULL,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
keep = NULL,
na_matches = c("na", "never"),
check = check_specs(),
conflict = NULL,
fill = NULL
)
x , y |
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
by |
As in dplyr, but extended so user can supply a formula or a list of character and formulas. Formulas are used for fuzzy joins, see dedicated section below. |
copy |
Ignored at the moment because powerjoin doesn't support databases |
suffix |
If there are non-joined duplicate variables in |
keep |
A boolean for compatibility with dplyr, or a value among "left", "right", "both", "none" or "default". See dedicated section below. |
na_matches |
Should two
|
check |
A list created with |
conflict |
A function, formula, the special value amongst |
fill |
Values used to replace missing values originating in unmatched keys, or a named list of such items. |
A data frame
keep
argument valuesNULL
(default) : merge keys and name them as the left table's keys, and
keep columns used for fuzzy joins from both tables
left
: keep only key columns for left table
right
: keep only key columns for right table
both
or TRUE
: keep key columns from both tables, adding suffix if relevant
none
: drop all key columns from the output
FALSE
: merge keys and name them as the left table's keys, maps to none
for fuzzy joins
To specify fuzzy matching conditions we use formulas in which the we refer to
the columns from the left side data frame using .x
and the right side data frame
using .y
, for instance by = ~ .x$col1 > .y$col2
.
We can specify several condition and even mix equi condition with fuzzy condition,
for instance by = c(col1 = "col2", ~ .x$col3 > .y$col4)
To fuzzy match strings we can leverage the functions from the stringr
package since they are vectorized on all main arguments,
for instance to match observations where col1
contains col1
we can attach
stringr and do by = ~ str_detect(.x$col1, fixed(.y$col2))
.
Another useful function is stringdist
from the stringdist package to match
strings that are close enough, for instance by = ~ stringdist::stringdist(.x$a,.y$a) < 2
We can also define a new column computed during the fuzzy matching, using the
arrow assignment operator, for instance : by = ~ .x$col1 > (mysum <- .y$col2 + .y$col3)
When the by
condition evaluates to NA
the observation is dismissed. This makes
by = c(a = "b")
slightly different from by = ~ .x$a == .y$b
when na_matches
is "na"
(the default). To be able to match NA
with NA
in fuzzy matching condition
we can use the %==%
operator (bone operator), defined in this package.
# See README for a more verbose version
library(tibble)
male_penguins <- tribble(
~name, ~species, ~island, ~flipper_length_mm, ~body_mass_g,
"Giordan", "Gentoo", "Biscoe", 222L, 5250L,
"Lynden", "Adelie", "Torgersen", 190L, 3900L,
"Reiner", "Adelie", "Dream", 185L, 3650L
)
female_penguins <- tribble(
~name, ~species, ~island, ~flipper_length_mm, ~body_mass_g,
"Alonda", "Gentoo", "Biscoe", 211, 4500L,
"Ola", "Adelie", "Dream", 190, 3600L,
"Mishayla", "Gentoo", "Biscoe", 215, 4750L,
)
# apply different checks
power_inner_join(
male_penguins[c("species", "island")],
female_penguins[c("species", "island")],
check = check_specs(implicit_keys = "ignore", duplicate_keys_right = "inform")
)
df1 <- tibble(id = 1:3, value = c(10, NA, 30))
df2 <- tibble(id = 2:4, value = c(22, 32, 42))
# handle conflicted columns when joining
power_left_join(df1, df2, by = "id", conflict = `+`)
# the most frequent use case is to coalesce
power_left_join(df1, df2, by = "id", conflict = coalesce_xy)
power_left_join(df1, df2, by = "id", conflict = coalesce_yx)
# the conflict function is applied colwise by default!
power_left_join(df1, df2, by = "id", conflict = ~ sum(.x, .y, na.rm = TRUE))
# apply conflict function rowwise
power_left_join(df1, df2, by = "id", conflict = rw ~ sum(.x, .y, na.rm = TRUE))
# subset columns without repeating keys
power_inner_join(
male_penguins %>% select_keys_and(name),
female_penguins %>% select_keys_and(female_name = name),
by = c("species", "island")
)
# semi join
power_inner_join(
male_penguins,
female_penguins %>% select_keys_and(),
by = c("species", "island")
)
# agregate without repeating keys
power_left_join(
male_penguins %>% summarize_by_keys(male_weight = mean(body_mass_g)),
female_penguins %>% summarize_by_keys(female_weight = mean(body_mass_g)),
by = c("species", "island")
)
# pack auxiliary colums without repeating keys
power_left_join(
male_penguins %>% pack_along_keys(name = "m"),
female_penguins %>% pack_along_keys(name = "f"),
by = c("species", "island")
)
# fuzzy join
power_inner_join(
male_penguins %>% select_keys_and(male_name = name),
female_penguins %>% select_keys_and(female_name = name),
by = c(~.x$flipper_length_mm < .y$flipper_length_mm, ~.x$body_mass_g > .y$body_mass_g)
)
# fuzzy + equi join
power_inner_join(
male_penguins %>% select_keys_and(male_name = name),
female_penguins %>% select_keys_and(female_name = name),
by = c("island", ~.x$flipper_length_mm > .y$flipper_length_mm)
)
# define new column without repeating computation
power_inner_join(
male_penguins %>% select_keys_and(male_name = name),
female_penguins %>% select_keys_and(female_name = name),
by = ~ (mass_ratio <- .y$body_mass_g / .x$body_mass_g) > 1.2
)
power_inner_join(
male_penguins %>% select_keys_and(male_name = name),
female_penguins %>% select_keys_and(female_name = name),
by = ~ (mass_ratio <- .y$body_mass_g / .x$body_mass_g) > 1.2,
keep = "none"
)
# fill unmatched values
df1 <- tibble(id = 1:3)
df2 <- tibble(id = 1:2, value2 = c(2, NA), value3 = c(NA, 3))
power_left_join(df1, df2, by = "id", fill = 0)
power_left_join(df1, df2, by = "id", fill = list(value2 = 0))
# join recursively
df1 <- tibble(id = 1, a = "foo")
df2 <- tibble(id = 1, b = "bar")
df3 <- tibble(id = 1, c = "baz")
power_left_join(list(df1, df2, df3), by = "id")
power_left_join(df1, list(df2, df3), by = "id")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.