reviewr: Tools to quality review data wrangling and analysis code

join_qc

R Documentation

Report number of matched and/or extra cases when performing a dplyr join

Description

full_join_qc, inner_join, left_join, right_join, anti_join, and semi_join return by default near identical objects as their dplyr equivalents.

Usage

full_join_qc(x, y, by = NULL, copy = FALSE, .merge = NULL, .extra = NULL)

inner_join_qc(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  .merge = NULL,
  .extra = NULL,
  ...
)

left_join_qc(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  .merge = NULL,
  .extra = NULL,
  ...
)

right_join_qc(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  .merge = NULL,
  .extra = NULL,
  ...
)

anti_join_qc(x, y, by = NULL, copy = FALSE, ...)

semi_join_qc(x, y, by = NULL, copy = FALSE, ...)

Arguments

`x, y`	A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details.
`by`	A join specification created with `join_by()`, or a character vector of variables to join by. If `NULL`, the default, `⁠*_join()⁠` will perform a natural join, using all variables in common across `x` and `y`. A message lists the variables so that you can check they're correct; suppress the message by supplying `by` explicitly. To join on different variables between `x` and `y`, use a `join_by()` specification. For example, `join_by(a == b)` will match `x$a` to `y$b`. To join by multiple variables, use a `join_by()` specification with multiple expressions. For example, `join_by(a == b, c == d)` will match `x$a` to `y$b` and `x$c` to `y$d`. If the column names are the same between `x` and `y`, you can shorten this by listing only the variable names, like `join_by(a, c)`. `join_by()` can also be used to perform inequality, rolling, and overlap joins. See the documentation at ?join_by for details on these types of joins. For simple equality joins, you can alternatively specify a character vector of variable names to join by. For example, `by = c("a", "b")` joins `x$a` to `y$a` and `x$b` to `y$b`. If variable names differ between `x` and `y`, use a named character vector like `by = c("x_a" = "y_a", "x_b" = "y_b")`. To perform a cross-join, generating all combinations of `x` and `y`, see `cross_join()`.
`copy`	If `x` and `y` are not from the same data source, and `copy` is `TRUE`, then `y` will be copied into the same src as `x`. This allows you to join tables across srcs, but it is a potentially expensive operation so you must opt into it.
`.merge`	a character value used to name a new character variable, which tracks the source of each row of the new, joined data. If `NULL`, the default, no new merge-tracking variable will be created. An error will occur if a variable is already named the value specified in`.merge`, so make sure to choose different names for different joins.
`.extra`	a character value used to name a new character variable, which identifies any row of the new joined data that represents a combination of the `by` identifiers that has more rows than the original left and/or right data frames. If `NULL`, the default, no new extra row tracking variable will be created. An error will occur if a variable is already named the value specified in`.extra`, so make sure to choose different names for different joins.
`suffix`	If there are non-joined duplicate variables in `x` and `y`, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.
`...`	Other parameters passed onto methods.

Details

Each _qc version of the join functions is identical to its dplyr equivalent except that it automatically prints the number of rows that were matched, the number of rows that were not matched, and the number of additional rows compared to the initial data frame(s) - for example when there is more than one match on the by identifier(s). There are also options to create new variables identifying and classifying rows based on how/if they matched.

Join Descriptions

All joins except anti_join and semi_join are classified as one to one, one to many, many to one, or many to many. These definitions describe the extent to which there are duplicated rows of unique combinations of the by variable(s). In one to one merges, there is only one unique row of identifiers in each data set. Extra rows are never created in one to one joins. One to many and many to one joins occur when one of the data sets has a duplicated id row. One to many implies the right data has the duplicated id; many to one implies the left data has the duplicated id. Additional rows may be created in these types of joins. Many to many joins imply that both data sets have duplicated rows on the id variable(s). Additional rows may be created in this types of join. The join description is printed when performing any join except anti_join and semi_join.

Optional New Variables

For full_join_qc, left_join_qc, and right_join_qc, there is an added option of creating a new variable that indicates whether the row in the joined data was from the "left_only", "right_only" or "matched". This variable can be helpful when diagnosing why the join did or did not match as desired. Whatever character value that is supplied to .merge becomes the name of this new variable.

left_join_qc, and right_join_qc also have the option of creating a new variable indicating whether the row in the joined data is an additional row with the given combination of by. For example, if there were only 2 rows with an ID equal to "A" in the original left data but 3 rows with this ID in the right data set, then the left joined data will have more rows with this ID than the original left. .extra is a chracter value that when supplied becomes the name of this new variable flagging a row that has additional rows than the original left or right data frame.

Grouping

Groups in the data frames are ignored for the purpose of joining, but the result preserves the grouping of x.

Examples

data_A <- 
  data.frame(
    id = 1:10, 
    var_A = 11:20
   )
data_B <- 
  data.frame(
    id = c(5, 5, 5, 5, 6, 7, 7, 9, 10, 11), 
    id_A = c(1:10), 
    var_B = 21:30
  )

# Full join with new .merge variable
full_join_qc(data_A, data_B, .merge = "merge_ab")

# Left join with new .extra variable
left_join_qc(data_A, data_B, .extra = "extra_ab")

# Right join with both new variables
right_join_qc(data_A, data_B, .merge = "merge_ab", .extra = "extra_ab")

adamMaier/reviewr documentation built on Nov. 5, 2023, 7:21 a.m.