Relational Data

library(tidyverse)

Data analysis rarely involves only a single table of data. Typically you have multiple tables of data, which combined, answer the questions at hand. Collectively, multiple tables of data are called relational data because it is the relations, not just the individual data sets, that are important. Relations are always defined between a pair of tables. All other relations are built up from this simple idea. A key is a variable, or set of variables, that uniquely identifies an observation and links tables together.

To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:

The most common place to find relational data is in a relational database management system (or RDBMS), a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. A key is a variable, or set of variables, that uniquely identifies an observation.

knitr::include_graphics("images/relational/join-setup.png")
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)
x
y

Mutating Joins

A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other. Like mutate(), the join functions add variables to the right, so if you have a lot of variables already, the new variables won’t get printed out.

Inner Joins

knitr::include_graphics("images/relational/join-inner.png")
inner_join(x, y)

``{block2, type='rmdtip'} By default the **join** functions will use all variables in common as the key. Best practice is to explicitly state the variables you wish to join by using theby` parameter.

**Outer Joins**

-  **left join** keeps all observations in x append columns from y
-  **right join** keeps all observations in y append columns from x
-  **full join** keeps all observations in x and y

```r
knitr::include_graphics("images/relational/join-outer.png")
left_join(x, y, by = "key")
right_join(x, y, by = "key")
full_join(x, y, by = "key")

```{block type='rmdnote'} left_join and right_join are interchangeable with the arguments reversed. The only difference is the order of the columns in the resulting data frame.

```r
left_join(x, y, by = "key")
right_join(y, x, by = "key")

Duplicate keys

There are many cases where a single key does not uniquely identify an observation.

One-to-Many

knitr::include_graphics("images/relational/join-one-to-many.png")

Many-to-Many

knitr::include_graphics("images/relational/join-many-to-many.png")

Multi-Column Keys

By default all variables that are in both tables. To specify a subset of variables for the key use the by parameter.

x <- tribble(
  ~id, ~yq, ~question,
     1, "2018Q1", "Yes",
     1, "2018Q2", "No",
     2, "2018Q1", "Yes",
     3, "2018Q1", "Yes",
     3, "2018Q2", "Yes"
)

y <- tribble(
  ~id, ~yq, ~question,
     1, "2018Q2", "Method1",
     2, "2018Q1", "Method2",
     2, "2018Q2", "Method2",
     3, "2018Q2", "Method2",
     4, "2018Q1", "Method1"
)

# full
full_join(x, y)                          # probably not what we want
full_join(x, y, by = c("id", "yq"))

Duplicate Non-Key Columns

By default if there are variables in both data frames that are not used as a key the common variables names will be appended with ".x" and ".y" corresponding the the first and second argument. If you want to change these use the suffix parameter.

full_join(x, y, by = c("id", "yq"), suffix = c("_1", "_2"))

Different Key Column Names

A common occurrence is for the tables you wish to join to have different variable names. You can either change the variable names so the do match with rename() or you can use the by parameter to indicate which variable name in x matches the variable name in y.

x <- tribble(
  ~id, ~yq, ~question,
     1, "2018Q1", "Yes",
     1, "2018Q2", "No",
     2, "2018Q1", "Yes",
     3, "2018Q1", "Yes",
     3, "2018Q2", "Yes"
)

y2 <- tribble(
  ~id, ~yearquarter, ~question,
     1, "2018Q2", "Method1",
     2, "2018Q1", "Method2",
     2, "2018Q2", "Method2",
     3, "2018Q2", "Method2",
     4, "2018Q1", "Method1"
)

# inner
full_join(x, y2, by = c("id", "yq" = "yearquarter"), suffix = c("_1", "_2"))

Filtering Joins

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

knitr::include_graphics("images/relational/join-semi.png")
knitr::include_graphics("images/relational/join-anti.png")

Semi-joins are useful for matching filtered summary tables back to the original rows. The inverse of a semi-join is an anti-join. An anti-join keeps the rows that don’t have a match. Anti-joins are useful for diagnosing join mismatches.

orig <- tribble(
  ~id, ~yq, ~question,
     1, "2018Q1", "Yes",
     1, "2018Q2", "No",
     2, "2018Q1", "Yes",
     3, "2018Q1", "Yes",
     3, "2018Q2", "Yes"
)

val <- tribble(
  ~id, ~yq, ~question,
     1, "2018Q1", "No",
     1, "2018Q2", "No",
     2, "2018Q1", "Yes",
     3, "2018Q3", "No",
     3, "2018Q2", "Yes"
)

semi_join(orig, val, by = c("id", "yq", "question"))
anti_join(orig, val, by = c("id", "yq", "question"))

Set Operations

The final type of two-table verb are the set operations. These operations work with a complete row, comparing the values of every variable. These expect the x and y inputs to have the same variables, in the same order, and treat the observations like sets:

Given this simple data:

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)

The four possibilities are:

intersect(df1, df2)

# Note that we get 3 rows, not 4
union(df1, df2)

setdiff(df1, df2)

setdiff(df2, df1)

Exercises

  1. Continue the exercises from last week. (See email)
  2. We have a survey in which the survey, questions, answers, and respondents demographics are stored in separate data sets (see below code). Generate a single analysis data set which contains every persons, with their demographic information, and their response to all questions. If they do not have a response to a question mark it as "No Response".
subjects <- tribble(
  ~id, ~gender, ~age,
  101L, "M", 25L,
  102L, "F", 40L,
  103L, "F", 18L,
  104L, "M", 57L
)

answers <- tribble(
  ~id, ~qid, ~answer,
  101L,  1L, "Yes",
  101L,  2L, "No",
  102L,  1L, "No",
  102L,  2L, "Yes",
  102L,  4L, "Yes",
  103L,  1L, "No",
  103L,  2L, "No",
  103L,  3L, "No"
)

survey <- tribble(
  ~qid, ~question,
  1L  , "Have you ever used Product A?",
  2L  , "Have you ever used Product B?",
  3L  , "Have you ever used Product C?",
  4L  , "Have you ever used Product D?"
)

The resulting data set should look like

crossing(id = subjects$id, qid = survey$qid) %>%
  left_join(., survey, by = "qid") %>%
  left_join(., answers, by = c("id", "qid")) %>%
  mutate(answer = replace_na(answer, "No Response")) %>%
  inner_join(., subjects, by = "id")
  1. From the above data set what is the mean age of those answering "Yes", "No", "No Response", for each equstion? Repeat by gender


DavisBrian/rclassnotes documentation built on May 17, 2019, 8:19 a.m.