In altaf-ali/tidydata_tutorial: Data Manipulation in R

Merging Datasets

Let's start by loading the tidyverse package

library(tidyverse)

library(gridExtra)
library(RColorBrewer)

Clear everything to make sure there's nothing leftover in our environment

rm(list = ls())

source("R/draw_table.R")

Next, we load three datasets of universities, cities, and states.

universities <- read_csv("https://raw.githubusercontent.com/altaf-ali/tidydata_tutorial/master/data/universities.csv")
cities <- read_csv("https://raw.githubusercontent.com/altaf-ali/tidydata_tutorial/master/data/cities.csv")
states <- read_csv("https://raw.githubusercontent.com/altaf-ali/tidydata_tutorial/master/data/states.csv")

Let's see how we can merge the universities dataset with the cities dataset.

grid.arrange(draw_table(universities, "universities"), draw_table(cities, "cities"), ncol = 2)

Left Join

universities %>%
  left_join(cities, by = "city")

universities %>%
  left_join(cities, by = "city") %>% 
  draw_table()

Right Join

universities %>%
  right_join(cities, by = "city")

universities %>%
  right_join(cities, by = "city") %>% 
  draw_table()

Inner Join

universities %>%
  inner_join(cities, by = "city")

universities %>%
  inner_join(cities, by = "city") %>% 
  draw_table()

Full Join

universities %>%
  full_join(cities, by = "city")

universities %>%
  full_join(cities, by = "city") %>% 
  draw_table()

Different Column Names

In the previous example both our datasets included a column named city. But what if the names of the columns in the two datasets were not the same? For example, let's take a look at the states table:

states %>% 
  draw_table("states")

What if we were to merge the cities dataset with states?

grid.arrange(draw_table(cities, "cities"), draw_table(states, "states"), ncol = 2)

One option would be to rename the columns so their names would match, but you don't really need to do that. You can simply tell the join functions the mapping between the different names.

cities %>%
  left_join(states, by = c("state" = "statename"))

In the above example, we're telling left_join() to merge using the state column from the cities data frame and statename column from the states data frame.

left_join(cities, states, by = c("state" = "statename")) %>%
  draw_table()

Exercise

Load the following datasets:

r presidents <- read_csv("https://raw.githubusercontent.com/altaf-ali/tidydata_tutorial/master/data/presidents.csv") presidents_home <- read_csv("https://raw.githubusercontent.com/altaf-ali/tidydata_tutorial/master/data/presidents_home.csv")

The datasets include names of U.S. presidents:

r presidents %>% draw_table("presidents")

r presidents_home %>% draw_table("presidents_home")
Merge the two datasets so that it ONLY includes observations that exist in BOTH the datasets. There should be no missing values or NA in the merged table. The results should match the following:

r inner_join(presidents, presidents_home, by = c("First" = "GivenName", "Middle", "Last" = "Surname")) %>% draw_table()
Merge the two datasets so that it includes ALL the observations from both the datasets. Some TookOffice, LeftOffice and HomeState values will be NA and that's ok. The results should match the following:

r full_join(presidents, presidents_home, by = c("First" = "GivenName", "Middle", "Last" = "Surname")) %>% draw_table()
Merge the two datasets so that ALL observations from the presidents datasets are included. Some HomeState values will be NA and that's ok. The results should match the following:

r left_join(presidents, presidents_home, by = c("First" = "GivenName", "Middle", "Last" = "Surname")) %>% draw_table()
Merge the two datasets so that ALL observations from the presidents_home datasets are included. Some TookOffice and LeftOffice values will be NA and that's ok. The results should match the following:

r right_join(presidents, presidents_home, by = c("First" = "GivenName", "Middle", "Last" = "Surname")) %>% draw_table()