reprores-v2: Data Skills for Reproducible Research

Data Relations {#joins}

3-panel meme. 1: dog looks at a baby in a chair, text reads 'An SQL query goes into a bar, walks up to two tables and asks...'; 2: baby looks at dog, text reads 'Can I join you?'; 3: dog and baby look at camera, no text

Be able to use the 4 mutating join verbs: (video){class="video"}
Be able to use the 2 filtering join verbs: (video){class="video"}
- semi_join()
- anti_join()
Be able to use the 2 binding join verbs: (video){class="video"}
- bind_rows()
- bind_cols()
Be able to use the 3 set operations: (video){class="video"}

# libraries needed
library(tidyverse)
library(reprores)

First, we'll create two small data tables.

subject has id, gender and age for subjects 1-5. Age and gender are missing for subject 3.

subject <- tibble(
  id = 1:5,
  gender = c("m", "m", NA, "nb", "f"),
  age = c(19, 22, NA, 19, 18)
)

id gender age 1 m 19 2 m 22 3 NA NA 4 nb 19 5 f 18

exp has subject id and the score from an experiment. Some subjects are missing, some completed twice, and some are not in the subject table.

exp <- tibble(
  id = c(2, 3, 4, 4, 5, 5, 6, 6, 7),
  score = c(10, 18, 21, 23, 9, 11, 11, 12, 3)
)

id score 2 10 3 18 4 21 4 23 5 9 5 11 6 11 6 12 7 3

Mutating joins act like the mutate() function in that they add new columns to one table based on values in another table.

All the mutating joins have this basic syntax:

****_join(x, y, by = NULL, suffix = c(".x", ".y")

x = the first (left) table
y = the second (right) table
by = what columns to match on. If you leave this blank, it will match on all columns with the same names in the two tables.
suffix = if columns have the same name in the two tables, but you aren't joining by them, they get a suffix to make them unambiguous. This defaults to ".x" and ".y", but you can change it to something more meaningful.

::: {.info data-latex=""} You can leave out the by argument if you're matching on all of the columns with the same name, but it's good practice to always specify it so your code is robust to changes in the loaded data. :::

(\#fig:img-left-join)Left Join

A left_join keeps all the data from the first (left) table and joins anything that matches from the second (right) table. If the right table has more than one match for a row in the right table, there will be more than one row in the joined table (see ids 4 and 5).

left_join(subject, exp, by = "id")

id gender age score 1 m 19 NA 2 m 22 10 3 NA NA 18 4 nb 19 21 4 nb 19 23 5 f 18 9 5 f 18 11

(\#fig:img-left-join-rev)Left Join (reversed)

The order of tables is swapped here, so the result is all rows from the exp table joined to any matching rows from the subject table.

left_join(exp, subject, by = "id")

id score gender age 2 10 m 22 3 18 NA NA 4 21 nb 19 4 23 nb 19 5 9 f 18 5 11 f 18 6 11 NA NA 6 12 NA NA 7 3 NA NA

(\#fig:img-right-join)Right Join

A right_join keeps all the data from the second (right) table and joins anything that matches from the first (left) table.

right_join(subject, exp, by = "id")

id gender age score 2 m 22 10 3 NA NA 18 4 nb 19 21 4 nb 19 23 5 f 18 9 5 f 18 11 6 NA NA 11 6 NA NA 12 7 NA NA 3

::: {.info data-latex=""} This table has the same information as left_join(exp, subject, by = "id"), but the columns are in a different order (left table, then right table). :::

(\#fig:img-inner-join)Inner Join

An inner_join returns all the rows that have a match in the other table.

inner_join(subject, exp, by = "id")

id gender age score 2 m 22 10 3 NA NA 18 4 nb 19 21 4 nb 19 23 5 f 18 9 5 f 18 11

(\#fig:img-full-join)Full Join

A full_join lets you join up rows in two tables while keeping all of the information from both tables. If a row doesn't have a match in the other table, the other table's column values are set to NA.

full_join(subject, exp, by = "id")

id gender age score 1 m 19 NA 2 m 22 10 3 NA NA 18 4 nb 19 21 4 nb 19 23 5 f 18 9 5 f 18 11 6 NA NA 11 6 NA NA 12 7 NA NA 3

Filtering joins act like the filter() function in that they remove rows from the data in one table based on the values in another table. The result of a filtering join will only contain rows from the left table and have the same number or fewer rows than the left table.

(\#fig:img-semi-join)Semi Join

A semi_join returns all rows from the left table where there are matching values in the right table, keeping just columns from the left table.

semi_join(subject, exp, by = "id")

id gender age 2 m 22 3 NA NA 4 nb 19 5 f 18

::: {.info data-latex=""} Unlike an inner join, a semi join will never duplicate the rows in the left table if there is more than one matching row in the right table. :::

(\#fig:img-semi-join-rev)Semi Join (Reversed)

Order matters in a semi join.

semi_join(exp, subject, by = "id")

id score 2 10 3 18 4 21 4 23 5 9 5 11

(\#fig:img-anti-join)Anti Join

An anti_join return all rows from the left table where there are not matching values in the right table, keeping just columns from the left table.

anti_join(subject, exp, by = "id")

id gender age 1 m 19

(\#fig:img-anti-join-rev)Anti Join (Reversed)

Order matters in an anti join.

anti_join(exp, subject, by = "id")

id score 6 11 6 12 7 3

Binding joins bind one table to another by adding their rows or columns together.

You can combine the rows of two tables with bind_rows.

Here we'll add subject data for subjects 6-9 and bind that to the original subject table.

new_subjects <- tibble(
  id = 6:9,
  gender = c("nb", "m", "f", "f"),
  age = c(19, 16, 20, 19)
)

bind_rows(subject, new_subjects)

id gender age 1 m 19 2 m 22 3 NA NA 4 nb 19 5 f 18 6 nb 19 7 m 16 8 f 20 9 f 19

The columns just have to have the same names, they don't have to be in the same order. Any columns that differ between the two tables will just have NA values for entries from the other table.

If a row is duplicated between the two tables (like id 5 below), the row will also be duplicated in the resulting table. If your tables have the exact same columns, you can use union() (see below) to avoid duplicates.

new_subjects <- tibble(
  id = 5:9,
  age = c(18, 19, 16, 20, 19),
  gender = c("f", "nb", "m", "f", "f"),
  new = c(1,2,3,4,5)
)

bind_rows(subject, new_subjects)

id gender age new 1 m 19 NA 2 m 22 NA 3 NA NA NA 4 nb 19 NA 5 f 18 NA 5 f 18 1 6 nb 19 2 7 m 16 3 8 f 20 4 9 f 19 5

You can merge two tables with the same number of rows using bind_cols. This is only useful if the two tables have their rows in the exact same order. The only advantage over a left join is when the tables don't have any IDs to join by and you have to rely solely on their order.

new_info <- tibble(
  colour = c("red", "orange", "yellow", "green", "blue")
)

bind_cols(subject, new_info)

id gender age colour 1 m 19 red 2 m 22 orange 3 NA NA yellow 4 nb 19 green 5 f 18 blue

Set operations compare two tables and return rows that match (intersect), are in either table (union), or are in one table but not the other (setdiff).

intersect() returns all rows in two tables that match exactly. The columns don't have to be in the same order.

new_subjects <- tibble(
  id = seq(4, 9),
  age = c(19, 18, 19, 16, 20, 19),
  gender = c("f", "f", "m", "m", "f", "f")
)

intersect(subject, new_subjects)

id gender age 5 f 18

::: {.warning data-latex=""} If you've forgotten to load dplyr or the tidyverse, base R also has an intersect() function. The error message can be confusing and looks something like this:

base::intersect(subject, new_subjects)

## list()

:::

union() returns all the rows from both tables, removing duplicate rows.

union(subject, new_subjects)

id gender age 1 m 19 2 m 22 3 NA NA 4 nb 19 5 f 18 4 f 19 6 m 19 7 m 16 8 f 20 9 f 19

::: {.warning data-latex=""} If you've forgotten to load dplyr or the tidyverse, base R also has a union() function. You usually won't get an error message, but the output won't be what you expect.

base::union(subject, new_subjects)

## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1] "m"  "m"  NA   "nb" "f" 
## 
## [[3]]
## [1] 19 22 NA 19 18
## 
## [[4]]
## [1] 4 5 6 7 8 9
## 
## [[5]]
## [1] 19 18 19 16 20 19
## 
## [[6]]
## [1] "f" "f" "m" "m" "f" "f"

:::

setdiff returns rows that are in the first table, but not in the second table.

setdiff(subject, new_subjects)

id gender age 1 m 19 2 m 22 3 NA NA 4 nb 19

Order matters for setdiff.

setdiff(new_subjects, subject)

id age gender 4 19 f 6 19 m 7 16 m 8 20 f 9 19 f

::: {.warning data-latex=""} If you've forgotten to load dplyr or the tidyverse, base R also has a setdiff() function. You usually won't get an error message, but the output might not be what you expect because the base R setdiff() expects columns to be in the same order, so id 5 here registers as different between the two tables.

base::setdiff(subject, new_subjects)

## $id
## [1] 1 2 3 4 5
## 
## $gender
## [1] "m"  "m"  NA   "nb" "f" 
## 
## $age
## [1] 19 22 NA 19 18

:::

term definition [base r](https://psyteachr.github.io/glossary/b.html#base-r){class="glossary" target="_blank"} The set of R functions that come with a basic installation of R, before you add external packages. [binding joins](https://psyteachr.github.io/glossary/b.html#binding-joins){class="glossary" target="_blank"} Joins that bind one table to another by adding their rows or columns together. [filtering joins](https://psyteachr.github.io/glossary/f.html#filtering-joins){class="glossary" target="_blank"} Joins that act like the dplyr::filter() function in that they remove rows from the data in one table based on the values in another table. [mutating joins](https://psyteachr.github.io/glossary/m.html#mutating-joins){class="glossary" target="_blank"} Joins that act like the dplyr::mutate() function in that they add new columns to one table based on values in another table. [set operations](https://psyteachr.github.io/glossary/s.html#set-operations){class="glossary" target="_blank"} Functions that compare two tables and return rows that match (intersect), are in either table (union), or are in one table but not the other (setdiff).

Chapter 13: Relational Data in R for Data Science
Cheatsheet for dplyr join functions
Lecture slides on dplyr two-table verbs

PsyTeachR/reprores-v2 documentation built on Sept. 26, 2022, 10:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PsyTeachR/reprores-v2
Data Skills for Reproducible Research

docs/05-joins.md
In PsyTeachR/reprores-v2: Data Skills for Reproducible Research

Data Relations {#joins}

Learning Objectives {#ilo-joins}

Setup {#setup-joins}

Data

Mutating Joins

left_join() {#left_join}

right_join() {#right_join}

inner_join() {#inner_join}

full_join() {#full_join}

Filtering Joins

semi_join() {#semi_join}

anti_join() {#anti_join}

Binding Joins

bind_rows() {#bind_rows}

bind_cols() {#bind_cols}

Set Operations

intersect() {#intersect}

union() {#union}

setdiff() {#setdiff}

Glossary {#glossary-joins}

Further Resources {#resources-joins}

R Package Documentation

Browse R Packages

We want your feedback!

PsyTeachR/reprores-v2 Data Skills for Reproducible Research

docs/05-joins.md In PsyTeachR/reprores-v2: Data Skills for Reproducible Research

Data Relations {#joins}

Learning Objectives {#ilo-joins}

Setup {#setup-joins}

Data

Mutating Joins

left_join() {#left_join}

right_join() {#right_join}

inner_join() {#inner_join}

full_join() {#full_join}

Filtering Joins

semi_join() {#semi_join}

anti_join() {#anti_join}

Binding Joins

bind_rows() {#bind_rows}

bind_cols() {#bind_cols}

Set Operations

intersect() {#intersect}

union() {#union}

setdiff() {#setdiff}

Glossary {#glossary-joins}

Further Resources {#resources-joins}

R Package Documentation

Browse R Packages

We want your feedback!

PsyTeachR/reprores-v2
Data Skills for Reproducible Research

docs/05-joins.md
In PsyTeachR/reprores-v2: Data Skills for Reproducible Research