join_by | R Documentation |
join_by()
constructs a specification that describes how to join two tables
using a small domain specific language. The result can be supplied as the
by
argument to any of the join functions (such as left_join()
).
join_by(...)
... |
Expressions specifying the join. Each expression should consist of one of the following:
Other expressions are not supported. If you need to perform a join on
a computed variable, e.g. Column names should be specified as quoted or unquoted names. By default,
the name on the left-hand side of a join condition refers to the left-hand
table, unless overridden by explicitly prefixing the column name with
either If a single column name is provided without any join conditions, it is
interpreted as if that column name was duplicated on each side of |
The following types of joins are supported by dplyr:
Equality joins
Inequality joins
Rolling joins
Overlap joins
Cross joins
Equality, inequality, rolling, and overlap joins are discussed in more detail
below. Cross joins are implemented through cross_join()
.
Equality joins require keys to be equal between one or more pairs of columns,
and are the most common type of join. To construct an equality join using
join_by()
, supply two column names to join with separated by ==
.
Alternatively, supplying a single name will be interpreted as an equality
join between two columns of the same name. For example, join_by(x)
is
equivalent to join_by(x == x)
.
Inequality joins match on an inequality, such as >
, >=
, <
, or <=
, and
are common in time series analysis and genomics. To construct an inequality
join using join_by()
, supply two column names separated by one of the above
mentioned inequalities.
Note that inequality joins will match a single row in x
to a potentially
large number of rows in y
. Be extra careful when constructing inequality
join specifications!
Rolling joins are a variant of inequality joins that limit the results
returned from an inequality join condition. They are useful for "rolling" the
closest match forward/backwards when there isn't an exact match. To construct
a rolling join, wrap an inequality with closest()
.
closest(expr)
expr
must be an inequality involving one of: >
, >=
, <
, or <=
.
For example, closest(x >= y)
is interpreted as: For each value in x
,
find the closest value in y
that is less than or equal to that x
value.
closest()
will always use the left-hand table (x
) as the primary table,
and the right-hand table (y
) as the one to find the closest match in,
regardless of how the inequality is specified. For example,
closest(y$a >= x$b)
will always be interpreted as closest(x$b <= y$a)
.
Overlap joins are a special case of inequality joins involving one or two
columns from the left-hand table overlapping a range defined by two columns
from the right-hand table. There are three helpers that join_by()
recognizes to assist with constructing overlap joins, all of which can be
constructed from simpler inequalities.
between(x, y_lower, y_upper, ..., bounds = "[]")
For each value in x
, this finds everywhere that value falls between
[y_lower, y_upper]
. Equivalent to x >= y_lower, x <= y_upper
by
default.
bounds
can be one of "[]"
, "[)"
, "(]"
, or
"()"
to alter the inclusiveness of the lower and upper bounds. This
changes whether >=
or >
and <=
or <
are used to build the
inequalities shown above.
Dots are for future extensions and must be empty.
within(x_lower, x_upper, y_lower, y_upper)
For each range in [x_lower, x_upper]
, this finds everywhere that range
falls completely within [y_lower, y_upper]
. Equivalent to x_lower >= y_lower, x_upper <= y_upper
.
The inequalities used to build within()
are the same regardless of the
inclusiveness of the supplied ranges.
overlaps(x_lower, x_upper, y_lower, y_upper, ..., bounds = "[]")
For each range in [x_lower, x_upper]
, this finds everywhere that range
overlaps [y_lower, y_upper]
in any capacity. Equivalent to x_lower <= y_upper, x_upper >= y_lower
by default.
bounds
can be one of "[]"
, "[)"
, "(]"
, or
"()"
to alter the inclusiveness of the lower and upper bounds.
"[]"
uses <=
and >=
, but the 3 other options use <
and >
and generate the exact same inequalities.
Dots are for future extensions and must be empty.
These conditions assume that the ranges are well-formed and non-empty, i.e.
x_lower <= x_upper
when bounds are treated as "[]"
, and
x_lower < x_upper
otherwise.
When specifying join conditions, join_by()
assumes that column names on the
left-hand side of the condition refer to the left-hand table (x
), and names
on the right-hand side of the condition refer to the right-hand table (y
).
Occasionally, it is clearer to be able to specify a right-hand table name on
the left-hand side of the condition, and vice versa. To support this, column
names can be prefixed by x$
or y$
to explicitly specify which table they
come from.
sales <- tibble(
id = c(1L, 1L, 1L, 2L, 2L),
sale_date = as.Date(c("2018-12-31", "2019-01-02", "2019-01-05", "2019-01-04", "2019-01-01"))
)
sales
promos <- tibble(
id = c(1L, 1L, 2L),
promo_date = as.Date(c("2019-01-01", "2019-01-05", "2019-01-02"))
)
promos
# Match `id` to `id`, and `sale_date` to `promo_date`
by <- join_by(id, sale_date == promo_date)
left_join(sales, promos, by)
# For each `sale_date` within a particular `id`,
# find all `promo_date`s that occurred before that particular sale
by <- join_by(id, sale_date >= promo_date)
left_join(sales, promos, by)
# For each `sale_date` within a particular `id`,
# find only the closest `promo_date` that occurred before that sale
by <- join_by(id, closest(sale_date >= promo_date))
left_join(sales, promos, by)
# If you want to disallow exact matching in rolling joins, use `>` rather
# than `>=`. Note that the promo on `2019-01-05` is no longer considered the
# closest match for the sale on the same date.
by <- join_by(id, closest(sale_date > promo_date))
left_join(sales, promos, by)
# Same as before, but also require that the promo had to occur at most 1
# day before the sale was made. We'll use a full join to see that id 2's
# promo on `2019-01-02` is no longer matched to the sale on `2019-01-04`.
sales <- mutate(sales, sale_date_lower = sale_date - 1)
by <- join_by(id, closest(sale_date >= promo_date), sale_date_lower <= promo_date)
full_join(sales, promos, by)
# ---------------------------------------------------------------------------
segments <- tibble(
segment_id = 1:4,
chromosome = c("chr1", "chr2", "chr2", "chr1"),
start = c(140, 210, 380, 230),
end = c(150, 240, 415, 280)
)
segments
reference <- tibble(
reference_id = 1:4,
chromosome = c("chr1", "chr1", "chr2", "chr2"),
start = c(100, 200, 300, 415),
end = c(150, 250, 399, 450)
)
reference
# Find every time a segment `start` falls between the reference
# `[start, end]` range.
by <- join_by(chromosome, between(start, start, end))
full_join(segments, reference, by)
# If you wanted the reference columns first, supply `reference` as `x`
# and `segments` as `y`, then explicitly refer to their columns using `x$`
# and `y$`.
by <- join_by(chromosome, between(y$start, x$start, x$end))
full_join(reference, segments, by)
# Find every time a segment falls completely within a reference.
# Sometimes using `x$` and `y$` makes your intentions clearer, even if they
# match the default behavior.
by <- join_by(chromosome, within(x$start, x$end, y$start, y$end))
inner_join(segments, reference, by)
# Find every time a segment overlaps a reference in any way.
by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end))
full_join(segments, reference, by)
# It is common to have right-open ranges with bounds like `[)`, which would
# mean an end value of `415` would no longer overlap a start value of `415`.
# Setting `bounds` allows you to compute overlaps with those kinds of ranges.
by <- join_by(chromosome, overlaps(x$start, x$end, y$start, y$end, bounds = "[)"))
full_join(segments, reference, by)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.