In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

SDS230::download_data("x_y_join.rda")

SDS230::download_data("freshman-15.txt")

SDS230::download_data("IPED_salaries_2016.rda")

# install.packages("latex2exp")

library(latex2exp)
library(dplyr)
library(ggplot2)
library(tidyr)
library(plotly)

#options(scipen=999)


knitr::opts_chunk$set(echo = TRUE)

set.seed(123)

$\$

Overview

A few additional topics on the ANOVA
Joining data frames
Scrollytelling with Closeread

$\$

Part 1: A few additional topics on the ANOVA

Let's briefly discuss a few last topics on the analysis of variance.

Part 1.1: Examening unbalanced data

In an unbalanced data, there are different numbers of measured responses at the different variable levels. When running an ANOVA on unbalanced data, one needs to be careful because there are different ways to calculate the sum of squares for the different factors, and this can lead to different results about which factors are statistically significant. Let's examine this using the IPED faculty salary data.

load("IPED_salaries_2016.rda")

# Factor A: lecturer, assistant, associate, full professor
# Factor B: liberal arts vs research university 

IPED_3 <- IPED_salaries |>
  filter(rank_name %in% c("Lecturer", "Assistant", "Associate", "Full")) |>
  mutate(rank_name  = droplevels(rank_name)) |>
  filter(CARNEGIE %in% c(15, 31)) |>
 # na.omit() |>
  mutate(Inst_type = dplyr::recode(CARNEGIE, "31" = "Liberal arts", "15" = "Research extensive"))


# examine properties of the data 
table(IPED_3$Inst_type, IPED_3$rank_name)

$\$

Part 1.2: Examening unbalanced data - Type I sum of squares

In type I sum of squares, the sum of squares are calculated sequentially, where first SSA is taken into account, and then SSB is consider. In particular:

Factor A sum of squares is: SS(A) from fitting lm(y ~ A)
Factor B sum of squares is: SS(B|A) from fitting lm(y ~ A + B) and subtracting this from SS(A)
The interaction AB sum of squares is: SS(A, B, AB) - SS(A, B); i.e., the model is fit will lm(y ~ A*B) and then SS(A, B) is subtracted.

# Create a main effects and interaction model
fit_profs1 <- lm(salary_tot ~ Inst_type * rank_name, data = IPED_3)
fit_profs2 <- lm(salary_tot ~ rank_name * Inst_type, data = IPED_3)

anova(fit_profs1)
anova(fit_profs2)

$\$

Part 1.3: Examening unbalanced data - Type III sum of squares

In type III sum of squares, the sum of squares the full model is fit SS(A, B, AB) and then the sum of squares for each factor is determined by taking the full model SS(A, B, AB) and subtracting out the fit when a given factor is missing.

Factor A sum of squares is: SS(A, B, AB) - SS(B, AB)
Factor B sum of squares is: SS(A, B, AB) - SS(A, AB)
The interaction AB sum of squares is: SS(A, B, AB) - SS(A, B)

# type III sum of squares the order that variables are added does not matter
car::Anova(fit_profs1, type = "III")
car::Anova(fit_profs2, type = "III")

$\$

Part 1.4: Repeated measures ANOVA

In a repeated measures ANOVA, the same case/observational units are measured at each factor level. For example, we might want to understand if people prefer chocolate, butterscotch or caramel sauce on their ice cream. Rather than doing a "between subjects" experiment, where we would have different people taste ice cream with chocolate, butterscotch or caramel sauce, instead we can use a "within subjects" design where the each person in the experiment tastes and gives ratings for all three toppings.

To run a repeated measures ANOVA, one gives each observational unit a unique ID, and then one treats this ID as another factor in the analysis; i.e., one runs a factorial analysis where one of the factors is the observational unit ID.

An advantage of repeated measures ANOVA is similar to the advantage to running a paired samples t-test, namely it can reduce a lot of the between observational unit variability making it easier to see effects that are present. In fact, running a repeated measures ANOVA with a factor that only has two levels is equivalent to running a paired samples t-test. Let's explore this using the example of Freshman gaining weight from homework 4.

# load the data
freshman <- read.table("freshman-15.txt", header = TRUE) |>
  mutate(Subject = as.factor(Subject))


# run a paired t-test testing H0: mu_diff = 0 vs. HA: mu_diff > 0, 
#   where mu_diff = mu_end_i - mu_start_i 

t.test(freshman$Terminal.Weight, freshman$Initial.Weight, paired = TRUE)



# let's transform to put it in a long format
freshman_long <- tidyr::pivot_longer(freshman, 
                                     cols = c("Initial.Weight", "Terminal.Weight"),
                                     names_to = "time_period", 
                                     values_to = "weight")


# let's run run a repeated measures ANOVA 
# we have the same p-value, and F = t^2
summary(aov(weight ~ time_period + Subject, data = freshman_long))

If you want more practice running a repeated measures ANOVA, you can analyze the popout attention data from homework 10, where you treat the participants in the study as a factor in the analysis.

$\$

Part 2: Joining data frames

Often data of interest is spread across multiple data frames that need to be joined together into a single data frame for further analyses. We will explore how to do this using dplyr.

Let's look at a very simple data set to explore joining data frames.

library(dplyr)


load('x_y_join.rda')

x
y

$\$

Part 2.1: Left join

Left joins keep all rows in the left table.

Data from right table added when there is the key matches, otherwise NA as added.

Try to do a left join of the data frames x and y using their keys.

left_join(x, y, by = c("key_x" = "key_y"))

$\$

Part 2.2: Right join

Right joins keep all rows in the right table.

Data from left table added when there is the key matches, otherwise NA as added.

Try to do a right join of the data frames x and y using their keys.

right_join(x, y, by = c("key_x" = "key_y"))

$\$

Part 2.3: Inner join

Inner joins only keep rows in which there are matches between the keys in both tables

Try to do an inner join of the data frames x and y using their keys.

inner_join(x, y, by = c("key_x" = "key_y"))

$\$

Part 2.4: Full join

Full joins keep all rows in both table.

NAs are added where there are no matches.

full_join(x, y, by = c("key_x" = "key_y"))

$\$

Part 2.5a: Duplicate keys

Duplicate keys are useful if there is a one-to-many relationship (duplicates are usually in the left table).

Let's look at two other tables that have duplicate keys

x2
y2

nrow(x2)
nrow(y2)

$\$

Part 2.5b: Duplicate keys

If both tables have duplicate keys you get all possible combinations (Cartesian product). This is almost always an error! Always check the output dimension after you join a table because even if there is not a syntax error you might not get the table you are expecting!

Try doing a left join on the data frames x2 and y2 using only their first keys (i.e., key1_x and key1_y). Save the joined data frame to an object called x2_joined. Note that x2_joined has more rows than the original x2 data frame despite the fact that you did a left join! This is due to duplicate keys in both x2 and y2.

Usually a mistake was made when a data frame ends up having more rows after a left join. It is good to check how many rows a data frame has before and after a join to catch any possible errors.

# initial left data frame only has 3 rows
nrow(x2)


# left join when both the left and right tables have duplicate keys
(x2_joined <- left_join(x2, y2, by = c("key1_x" = "key1_y")))


# output now has more rows than the initial table
nrow(x2_joined)

$\$

Part 2.5c: Duplicate keys

To deal with duplicate keys in both tables, we can join the tables using multiple keys in order to make sure that each row is uniquely specified.

Try doing a left join on the data frames x2 and y2 using both the keys. Save the joined data frame to an object called x2_joined_mult_keys. Note that x2_joined_mult_keys has the same number of rows as the original x2 data frame which is usually what we want when we do a left join.

# initial left data frame only has 3 rows
nrow(x2)


# join the data frame using multiple keys
x2_joined_mult_keys <- left_join(x2, y2, c("key1_x" = "key1_y", "key2_x" = "key2_y"))

# output now only has 3 rows
nrow(x2_joined_mult_keys)

$\$

Part 2.6: Exploring the flight delays data

Let's look at three data frames from the NYC flights delays data set:

flights: information on flights
airlines: information on the airlines
weather: information about the weather

library(nycflights13)

data(flights)
data(airlines)


names(flights)
names(airlines)


# join airlines on to the flights data frame
flights_airline <- flights |>
  left_join(airlines)

names(flights_airline)



# delays for each airline
flights_airline |> 
  group_by(name) |> 
  summarize(mean_delay = mean(arr_delay, na.rm = TRUE))




# let's look at the weather too
data(weather)

dim(flights)
dim(weather)


# join the flights and the weather selecting only arrival delay and time 
flights_weather <- flights |>
  select(arr_delay, time_hour) |>    # ambiguous because did not include the airport
  left_join(weather)

dim(flights_weather)



# join also including the airport location 
flights_weather <- flights |>
  select(arr_delay, origin, time_hour) |>   
  left_join(weather)

dim(flights_weather)



# visualize the regression line to the data predicting delay from wind speed
flights_weather |>
  ggplot(aes(wind_speed, arr_delay)) +
  geom_smooth(method = "lm")

$\$

Part 3: Scrollytelling with Closeread and Quarto

Let's now explore creating scrollytelling documents using Closeread. It could be interesting to turn your final project into one of these documents to show off what you have learned in the class to others.

To install close read, please run the following command from the Terminal tab (which is next to the Console table at the bottom left of RStudio): quarto add qmd-lab/closeread

Please run the code below which download a Quarto document that you can use to practice creating a Closeread document.

SDS230::download_any_file("class_code/closeread.qmd")


# The version with the answers is here:
# SDS230::download_any_file("class_code/closeread_answers.qmd")


# If you can't get Closeread to install by running quarto add qmd-lab/closeread
# from the terminal, you can download the files you need by uncommenting and 
# running the following code:
#
# SDS230::download_any_file("class_code/_extensions.zip")
# unzip("_extensions.zip")

# SDS230::download_image("waldo.jpg")

emeyers/SDS230 documentation built on Feb. 6, 2025, 4:55 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

emeyers/SDS230
Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Overview

Part 1: A few additional topics on the ANOVA

Part 1.1: Examening unbalanced data

Part 1.2: Examening unbalanced data - Type I sum of squares

Part 1.3: Examening unbalanced data - Type III sum of squares

Part 1.4: Repeated measures ANOVA

Part 2: Joining data frames

Part 2.1: Left join

Part 2.2: Right join

Part 2.3: Inner join

Part 2.4: Full join

Part 2.5a: Duplicate keys

Part 2.5b: Duplicate keys

Part 2.5c: Duplicate keys

Part 2.6: Exploring the flight delays data

Part 3: Scrollytelling with Closeread and Quarto

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230 Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Overview

Part 1: A few additional topics on the ANOVA

Part 1.1: Examening unbalanced data

Part 1.2: Examening unbalanced data - Type I sum of squares

Part 1.3: Examening unbalanced data - Type III sum of squares

Part 1.4: Repeated measures ANOVA

Part 2: Joining data frames

Part 2.1: Left join

Part 2.2: Right join

Part 2.3: Inner join

Part 2.4: Full join

Part 2.5a: Duplicate keys

Part 2.5b: Duplicate keys

Part 2.5c: Duplicate keys

Part 2.6: Exploring the flight delays data

Part 3: Scrollytelling with Closeread and Quarto

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230
Tools for the class Data Exploration and Analysis