library(tidyverse)
Data analysis rarely involves only a single table of data. Typically you have multiple tables of data, which combined, answer the questions at hand. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual data sets, that are important. Relations are always defined between a pair of tables. All other relations are built up from this simple idea. A key is a variable, or set of variables, that uniquely identifies an observation and links tables together.
To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:
Mutating joins, which add new variables to one data frame from matching observations in another.
Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.
Set operations, which treat observations as if they were set elements.
The most common place to find relational data is in a relational database management system (or RDBMS), a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. A key is a variable, or set of variables, that uniquely identifies an observation.
knitr::include_graphics("images/relational/join-setup.png")
x <- tribble( ~key, ~val_x, 1, "x1", 2, "x2", 3, "x3" ) y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 4, "y3" )
x y
A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other. Like mutate(), the join functions add variables to the right, so if you have a lot of variables already, the new variables won’t get printed out.
Inner Joins
knitr::include_graphics("images/relational/join-inner.png")
inner_join(x, y)
``{block2, type='rmdtip'}
By default the **join** functions will use all variables in common as the key. Best practice is to explicitly state the variables you wish to join by using the
by` parameter.
**Outer Joins** - **left join** keeps all observations in x append columns from y - **right join** keeps all observations in y append columns from x - **full join** keeps all observations in x and y ```r knitr::include_graphics("images/relational/join-outer.png")
left_join(x, y, by = "key") right_join(x, y, by = "key") full_join(x, y, by = "key")
```{block type='rmdnote'} left_join and right_join are interchangeable with the arguments reversed. The only difference is the order of the columns in the resulting data frame.
```r left_join(x, y, by = "key") right_join(y, x, by = "key")
There are many cases where a single key does not uniquely identify an observation.
One-to-Many
knitr::include_graphics("images/relational/join-one-to-many.png")
Many-to-Many
knitr::include_graphics("images/relational/join-many-to-many.png")
By default all variables that are in both tables. To specify a subset of variables for the key use the by
parameter.
x <- tribble( ~id, ~yq, ~question, 1, "2018Q1", "Yes", 1, "2018Q2", "No", 2, "2018Q1", "Yes", 3, "2018Q1", "Yes", 3, "2018Q2", "Yes" ) y <- tribble( ~id, ~yq, ~question, 1, "2018Q2", "Method1", 2, "2018Q1", "Method2", 2, "2018Q2", "Method2", 3, "2018Q2", "Method2", 4, "2018Q1", "Method1" ) # full full_join(x, y) # probably not what we want full_join(x, y, by = c("id", "yq"))
By default if there are variables in both data frames that are not used as a key the common variables names will be appended with ".x"
and ".y"
corresponding the the first and second argument. If you want to change these use the suffix
parameter.
full_join(x, y, by = c("id", "yq"), suffix = c("_1", "_2"))
A common occurrence is for the tables you wish to join to have different variable names. You can either change the variable names so the do match with rename()
or you can use the by
parameter to indicate which variable name in x matches the variable name in y.
x <- tribble( ~id, ~yq, ~question, 1, "2018Q1", "Yes", 1, "2018Q2", "No", 2, "2018Q1", "Yes", 3, "2018Q1", "Yes", 3, "2018Q2", "Yes" ) y2 <- tribble( ~id, ~yearquarter, ~question, 1, "2018Q2", "Method1", 2, "2018Q1", "Method2", 2, "2018Q2", "Method2", 3, "2018Q2", "Method2", 4, "2018Q1", "Method1" ) # inner full_join(x, y2, by = c("id", "yq" = "yearquarter"), suffix = c("_1", "_2"))
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
knitr::include_graphics("images/relational/join-semi.png")
knitr::include_graphics("images/relational/join-anti.png")
Semi-joins are useful for matching filtered summary tables back to the original rows. The inverse of a semi-join is an anti-join. An anti-join keeps the rows that don’t have a match. Anti-joins are useful for diagnosing join mismatches.
orig <- tribble( ~id, ~yq, ~question, 1, "2018Q1", "Yes", 1, "2018Q2", "No", 2, "2018Q1", "Yes", 3, "2018Q1", "Yes", 3, "2018Q2", "Yes" ) val <- tribble( ~id, ~yq, ~question, 1, "2018Q1", "No", 1, "2018Q2", "No", 2, "2018Q1", "Yes", 3, "2018Q3", "No", 3, "2018Q2", "Yes" ) semi_join(orig, val, by = c("id", "yq", "question")) anti_join(orig, val, by = c("id", "yq", "question"))
The final type of two-table verb are the set operations. These operations work with a complete row, comparing the values of every variable. These expect the x
and y
inputs to have the same variables, in the same order, and treat the observations like sets:
intersect(x, y)
: return only observations in both x
and y
.union(x, y)
: return unique observations in x
and y
.setdiff(x, y)
: return observations in x
, but not in y
.Given this simple data:
df1 <- tribble( ~x, ~y, 1, 1, 2, 1 ) df2 <- tribble( ~x, ~y, 1, 1, 1, 2 )
The four possibilities are:
intersect(df1, df2) # Note that we get 3 rows, not 4 union(df1, df2) setdiff(df1, df2) setdiff(df2, df1)
subjects <- tribble( ~id, ~gender, ~age, 101L, "M", 25L, 102L, "F", 40L, 103L, "F", 18L, 104L, "M", 57L ) answers <- tribble( ~id, ~qid, ~answer, 101L, 1L, "Yes", 101L, 2L, "No", 102L, 1L, "No", 102L, 2L, "Yes", 102L, 4L, "Yes", 103L, 1L, "No", 103L, 2L, "No", 103L, 3L, "No" ) survey <- tribble( ~qid, ~question, 1L , "Have you ever used Product A?", 2L , "Have you ever used Product B?", 3L , "Have you ever used Product C?", 4L , "Have you ever used Product D?" )
The resulting data set should look like
crossing(id = subjects$id, qid = survey$qid) %>% left_join(., survey, by = "qid") %>% left_join(., answers, by = c("id", "qid")) %>% mutate(answer = replace_na(answer, "No Response")) %>% inner_join(., subjects, by = "id")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.