$\$

SDS230::download_data("x_y_join.rda")

SDS230::download_data("metal_bands.rda")

SDS230::download_data("IPED_salaries_2016.rda")
# install.packages("latex2exp")

library(latex2exp)
library(dplyr)
library(ggplot2)
library(tidyr)
library(plotly)

#options(scipen=999)


knitr::opts_chunk$set(echo = TRUE)

set.seed(123)

$\$

Overview

$\$

Part 1: tidyr for pivoting data

We can use the tidyr package to pivot data between "long" and "wide" formats.

Having data in different formats can be useful to calculating particular statistics and for visualizing data using ggplot.

$\$

Part 1.1: tidyr for pivoting data longer

Let's see if we can compare men and women salary on the same plot using ggplot by first pivoting our longer.

load("IPED_salaries_2016.rda")


library(tidyr)


# get salaries for men and women
men_women <- IPED_salaries |>
  filter(rank_name == "Full") |>
  select(school, endowment, salary_men, salary_women) |>
  na.omit()

# how can plot men and women salaries on the same plot using ggplot? 

# let's pivot the data longer
men_women_long <- men_women |>
  pivot_longer(c("salary_men", "salary_women"),
               names_to = "gender",
               values_to = "salary")


# visualize as a boxplot
men_women_long |>
  ggplot(aes(gender, salary)) + 
  geom_boxplot()

# visualize as a density plot
men_women_long |>
  ggplot(aes(salary, col = gender)) + 
  geom_density()

Does it appear that men and women are being paid differently?

$\$

Part 1.2: tidyr for pivoting data wider

Let's pivot back wider to see if we can come up with more informative plots using ggplot.

# create the data longer again and mutate on salary difference
men_women_wider <- men_women_long |>
  pivot_wider(names_from = "gender", values_from = "salary") |>
  mutate(salary_diff = salary_men - salary_women)


# visualize as a boxplot
men_women_wider |>
  ggplot(aes(salary_diff)) + 
  geom_boxplot()


# visualize as a density
men_women_wider |>
  ggplot(aes(salary_diff)) + 
  geom_density()

Does it appear that men and women are being paid differently?

$\$

Part 2: Joining data frames

Often data of interest is spread across multiple data frames that need to be joined together into a single data frame for further analyses. We will explore how to do this using dplyr.

Let's look at a very simple data set to explore joining data frames.

library(dplyr)


load('x_y_join.rda')

x
y

$\$

Part 2.1: Left join

Left joins keep all rows in the left table.

Data from right table added when there is the key matches, otherwise NA as added.

Try to do a left join of the data frames x and y using their keys.

left_join(x, y, by = c("key_x" = "key_y"))

$\$

Part 2.2: Right join

Right joins keep all rows in the right table.

Data from left table added when there is the key matches, otherwise NA as added.

Try to do a right join of the data frames x and y using their keys.

right_join(x, y, by = c("key_x" = "key_y"))

$\$

Part 2.3: Inner join

Inner joins only keep rows in which there are matches between the keys in both tables

Try to do an inner join of the data frames x and y using their keys.

inner_join(x, y, by = c("key_x" = "key_y"))

$\$

Part 2.4: Full join

Full joins keep all rows in both table.

NAs are added where there are no matches.

full_join(x, y, by = c("key_x" = "key_y"))

$\$

Part 2.5a: Duplicate keys

Duplicate keys are useful if there is a one-to-many relationship (duplicates are usually in the left table).

Let's look at two other tables that have duplicate keys

x2
y2

nrow(x2)
nrow(y2)

$\$

Part 2.5b: Duplicate keys

If both tables have duplicate keys you get all possible combinations (Cartesian product). This is almost always an error! Always check the output dimension after you join a table because even if there is not a syntax error you might not get the table you are expecting!

Try doing a left join on the data frames x2 and y2 using only their first keys (i.e., key1_x and key1_y). Save the joined data frame to an object called x2_joined. Note that x2_joined has more rows than the original x2 data frame despite the fact that you did a left join! This is due to duplicate keys in both x2 and y2.

Usually a mistake was made when a data frame ends up having more rows after a left join. It is good to check how many rows a data frame has before and after a join to catch any possible errors.

# initial left data frame only has 3 rows
nrow(x2)


# left join when both the left and right tables have duplicate keys
(x2_joined <- left_join(x2, y2, by = c("key1_x" = "key1_y")))


# output now has more rows than the initial table
nrow(x2_joined)

$\$

Part 2.5c: Duplicate keys

To deal with duplicate keys in both tables, we can join the tables using multiple keys in order to make sure that each row is uniquely specified.

Try doing a left join on the data frames x2 and y2 using both the keys. Save the joined data frame to an object called x2_joined_mult_keys. Note that x2_joined_mult_keys has the same number of rows as the original x2 data frame which is usually what we want when we do a left join.

# initial left data frame only has 3 rows
nrow(x2)


# join the data frame using multiple keys
x2_joined_mult_keys <- left_join(x2, y2, c("key1_x" = "key1_y", "key2_x" = "key2_y"))

# output now only has 3 rows
nrow(x2_joined_mult_keys)

$\$

Part 2.5: Exploring the flight delays data

Let's look at three data frames from the NYC flights delays data set:

library(nycflights13)

data(flights)
data(airlines)


names(flights)
names(airlines)


# join airlines on to the flights data frame
flights_airline <- flights |>
  left_join(airlines)

names(flights_airline)



# delays for each airline
flights_airline |> 
  group_by(name) |> 
  summarize(mean_delay = mean(arr_delay, na.rm = TRUE))




# let's look at the weather too
data(weather)

dim(flights)
dim(weather)


# join the flights and the weather selecting only arrival delay and time 
flights_weather <- flights |>
  select(arr_delay, time_hour) |>    # ambiguous because did not include the airport
  left_join(weather)

dim(flights_weather)



# join also including the airport location 
flights_weather <- flights |>
  select(arr_delay, origin, time_hour) |>   
  left_join(weather)

dim(flights_weather)



# visualize the regression line to the data predicting delay from wind speed
flights_weather |>
  ggplot(aes(wind_speed, arr_delay)) +
  geom_smooth(method = "lm")

$\$

Part 3: Interactive applications using shiny

Shiny is an R package that makes it easy to build interactive web applications. These applications allow users to change analysis/visualization parameters through a graphical user interface.

To learn more about shiny see:

  1. This tutorial: https://shiny.posit.co/r/getstarted/shiny-basics/lesson1/index.html
  2. Search on the web for more tutorials!

Below is an example of a simple shiny app to show how the code works.

# include the shiny package
library(shiny)


# generate 1000 random points to display in a histogram
random_data <- rnorm(1000)


# 1. The function to create the user interface
ui <- fluidPage(


  # create a slider input to select the number of bins in a histogram 
  sliderInput(inputId = "num",  
              label = "Choose a number",  
              val = 25, min = 1, max = 100), 


  # create a plot to show the histogram 
  plotOutput("my_plot") 


) # closing parenthesis for the UI  






# 2. The function to create the server
server <- function(input, output) {

  # The code that updates the plot of the histogram when the user input changes
  output$my_plot <- renderPlot({

    hist(random_data, breaks = input$num) 

  })

}   # closing brace for the server function 






# 3. Putting UI and the server together to run
shinyApp(ui = ui, server = server) 


emeyers/SDS230 documentation built on Jan. 18, 2024, 1:01 a.m.