$\$
SDS230::download_data("x_y_join.rda") SDS230::download_data("freshman-15.txt") SDS230::download_data("metal_bands.rda") SDS230::download_data("IPED_salaries_2016.rda")
# install.packages("latex2exp") library(latex2exp) library(dplyr) library(ggplot2) library(tidyr) library(plotly) #options(scipen=999) knitr::opts_chunk$set(echo = TRUE) set.seed(123)
$\$
$\$
Let's briefly discuss a few last topics on the analysis of variance.
In an unbalanced data, there are different numbers of measured responses at the different variable levels. When running an ANOVA on unbalanced data, one needs to be careful because there are different ways to calculate the sum of squares for the different factors, and this can lead to different results about which factors are statistically significant. Let's examine this using the IPED faculty salary data.
load("IPED_salaries_2016.rda") # Factor A: lecturer, assistant, associate, full professor # Factor B: liberal arts vs research university IPED_3 <- IPED_salaries |> filter(rank_name %in% c("Lecturer", "Assistant", "Associate", "Full")) |> mutate(rank_name = droplevels(rank_name)) |> filter(CARNEGIE %in% c(15, 31)) |> # na.omit() |> mutate(Inst_type = dplyr::recode(CARNEGIE, "31" = "Liberal arts", "15" = "Research extensive")) # examine properties of the data table(IPED_3$Inst_type, IPED_3$rank_name)
$\$
In type I sum of squares, the sum of squares are calculated sequentially, where first SSA is taken into account, and then SSB is consider. In particular:
lm(y ~ A)
lm(y ~ A + B)
and subtracting this from SS(A)lm(y ~ A*B)
and then SS(A, B)
is subtracted. # Create a main effects and interaction model fit_profs1 <- lm(salary_tot ~ Inst_type * rank_name, data = IPED_3) fit_profs2 <- lm(salary_tot ~ rank_name * Inst_type, data = IPED_3) anova(fit_profs1) anova(fit_profs2)
$\$
In type III sum of squares, the sum of squares the full model is fit SS(A, B, AB) and then the sum of squares for each factor is determined by taking the full model SS(A, B, AB) and subtracting out the fit when a given factor is missing.
# type III sum of squares the order that variables are added does not matter car::Anova(fit_profs1, type = "III") car::Anova(fit_profs2, type = "III")
$\$
In a repeated measures ANOVA, the same case/observational units are measured at each factor level. For example, we might want to understand if people prefer chocolate, butterscotch or caramel sauce on their ice cream. Rather than doing a "between subjects" experiment, where we would have different people taste ice cream with chocolate, butterscotch or caramel sauce, instead we can use a "within subjects" design where the each person in the experiment tastes and gives ratings for all three toppings.
To run a repeated measures ANOVA, one gives each observational unit a unique ID, and then one treats this ID as another factor in the analysis; i.e., one runs a factorial analysis where one of the factors is the observational unit ID.
An advantage of repeated measures ANOVA is similar to the advantage to running a paired samples t-test, namely it can reduce a lot of the between observational unit variability making it easier to see effects that are present. In fact, running a repeated measures ANOVA with a factor that only has two levels is equivalent to running a paired samples t-test. Let's explore this using the example of Freshman gaining weight from homework 4.
# load the data freshman <- read.table("freshman-15.txt", header = TRUE) |> mutate(Subject = as.factor(Subject)) # run a paired t-test testing H0: mu_diff = 0 vs. HA: mu_diff > 0, # where mu_diff = mu_end_i - mu_start_i t.test(freshman$Terminal.Weight, freshman$Initial.Weight, paired = TRUE) # let's transform to put it in a long format freshman_long <- tidyr::pivot_longer(freshman, cols = c("Initial.Weight", "Terminal.Weight"), names_to = "time_period", values_to = "weight") # let's run run a repeated measures ANOVA # we have the same p-value, and F = t^2 summary(aov(weight ~ time_period + Subject, data = freshman_long))
If you want more practice running a repeated measures ANOVA, you can analyze the popout attention data from homework 10, where you treat the participants in the study as a factor in the analysis.
$\$
The package stringr allows you to manipulate character strings. Many of the functions in the stringr package are available in base R but the naming conventions and arguments in stringr are more consistent which makes them easier to use.
All stringr functions:
str_
$\$
Let's start by putting a string in lower case using both base R and stringr.
library(stringr) # base R tolower("Hey") # stringr str_to_lower("STOP YELLING")
$\$
We can trim and pad strings using str_trim()
and str_pad()
# trim a string str_trim(" What a mess ") # pad a string str_pad("Let's make it messier", 50, "right") str_pad(1:11, 3, pad = 0) # useful for adding leading 0’s
$\$
We can concatenate strings using str_c
.
The base R function paste()
and paste0()
are also useful for this.
str_c("What", "a", "mess", sep = " ") vec_words <- c("What", "a", "mess") str_c(vec_words, collapse = " ")
$\$
We can detect whether a substring exists in a longer string using str_detect()
load("metal_bands.rda") # How many bands come from the USA? # On homework 1 ignored bands that came from the USA and another country metal_counts <- table(metal$origin) sort(metal_counts, decreasing = TRUE)[1] # Let's find all the bands that have some origin in the USA sum(str_detect(metal$origin, "USA"))
$\$
We can replace the first occurrence of a string using str_replace("String", "old", "new")
We can replace all occurrences of a string using str_replace_all()
# replace an occurrence of a substring #str_replace("String", "old", "new") str_replace_all("One fish, two fish, red fish, blue fish", "fish", "cat") # Example: let's download an article from the internet #base_name <- "https://www.wsj.com/politics/elections/why-biden-touts-jobs-when-americans-care-about-prices-37dc7013?mod=Searchresults_pos14&page=1" base_name <- "https://www.nytimes.com/2023/11/27/climate/biden-cop28-climate-dubai.html?searchResultPosition=3" article_name <- "politics.html" download.file(base_name, article_name) # viewer <- getOption("viewer") # viewer(article_name) # read the whole article as a single string the_article <- readChar(article_name, file.info(article_name)$size) # replace all occurrences of a string article2 <- str_replace_all(the_article, "Biden", "Sleepy Joe") write(article2, "sleepy_article.html") #viewer("sleepy_article.html")
$\$
We can use regular expressions to do a lot more complex string matching!
This topic is a bit beyond the scope of the class, but if you need to do more complex string manipulation for your final project I recommend you look more into this topic using Google to find tutorials, chatGPT to help with the code, and/or talk to me or the TAs to get additional help.
As one example of using regular, let's examine how we can match string that start or end with particular letters. In particular we can:
fruits <- c("apple", "pineapple", "Pear", "orange", "peach", "banana") # detect all fruits that end with e str_detect(fruits, "e$") # detect all fruits that start with lower or upper case P str_detect(fruits, "^[Pp]")
$\$
We can use the tidyr
package to pivot data between "long" and "wide" formats.
Having data in different formats can be useful to calculating particular statistics and for visualizing data using ggplot.
$\$
Let's see if we can compare men and women salary on the same plot using ggplot by first pivoting our longer.
library(tidyr) # get salaries for men and women men_women <- IPED_salaries |> filter(rank_name == "Full") |> select(school, endowment, salary_men, salary_women) |> na.omit() # how can plot men and women salaries on the same plot using ggplot? # let's pivot the data longer men_women_long <- men_women |> pivot_longer(c("salary_men", "salary_women"), names_to = "gender", values_to = "salary") # visualize as a boxplot men_women_long |> ggplot(aes(gender, salary)) + geom_boxplot() # visualize as a density plot men_women_long |> ggplot(aes(salary, col = gender)) + geom_density()
Does it appear that men and women are being paid differently?
$\$
Let's pivot back wider to see if we can come up with more informative plots using ggplot.
# create the data longer again and mutate on salary difference men_women_wider <- men_women_long |> pivot_wider(names_from = "gender", values_from = "salary") |> mutate(salary_diff = salary_men - salary_women) # visualize as a boxplot men_women_wider |> ggplot(aes(salary_diff)) + geom_boxplot() # visualize as a density men_women_wider |> ggplot(aes(salary_diff)) + geom_density()
Does it appear that men and women are being paid differently?
$\$
Often data of interest is spread across multiple data frames that need to be joined together into a single data frame for further analyses. We will explore how to do this using dplyr.
Let's look at a very simple data set to explore joining data frames.
library(dplyr) load('x_y_join.rda') x y
$\$
Left joins keep all rows in the left table.
Data from right table added when there is the key matches, otherwise NA as added.
Try to do a left join of the data frames x and y using their keys.
left_join(x, y, by = c("key_x" = "key_y"))
$\$
Right joins keep all rows in the right table.
Data from left table added when there is the key matches, otherwise NA as added.
Try to do a right join of the data frames x and y using their keys.
right_join(x, y, by = c("key_x" = "key_y"))
$\$
Inner joins only keep rows in which there are matches between the keys in both tables
Try to do an inner join of the data frames x and y using their keys.
inner_join(x, y, by = c("key_x" = "key_y"))
$\$
Full joins keep all rows in both table.
NAs are added where there are no matches.
full_join(x, y, by = c("key_x" = "key_y"))
$\$
Duplicate keys are useful if there is a one-to-many relationship (duplicates are usually in the left table).
Let's look at two other tables that have duplicate keys
x2 y2 nrow(x2) nrow(y2)
$\$
If both tables have duplicate keys you get all possible combinations (Cartesian product). This is almost always an error! Always check the output dimension after you join a table because even if there is not a syntax error you might not get the table you are expecting!
Try doing a left join on the data frames x2 and y2 using only their first keys
(i.e., key1_x and key1_y). Save the joined data frame to an object called
x2_joined
. Note that x2_joined
has more rows than the original x2
data
frame despite the fact that you did a left join! This is due to duplicate keys
in both x2 and y2.
Usually a mistake was made when a data frame ends up having more rows after a left join. It is good to check how many rows a data frame has before and after a join to catch any possible errors.
# initial left data frame only has 3 rows nrow(x2) # left join when both the left and right tables have duplicate keys (x2_joined <- left_join(x2, y2, by = c("key1_x" = "key1_y"))) # output now has more rows than the initial table nrow(x2_joined)
$\$
To deal with duplicate keys in both tables, we can join the tables using multiple keys in order to make sure that each row is uniquely specified.
Try doing a left join on the data frames x2 and y2 using both the keys. Save the
joined data frame to an object called x2_joined_mult_keys
. Note that
x2_joined_mult_keys
has the same number of rows as the original x2
data
frame which is usually what we want when we do a left join.
# initial left data frame only has 3 rows nrow(x2) # join the data frame using multiple keys x2_joined_mult_keys <- left_join(x2, y2, c("key1_x" = "key1_y", "key2_x" = "key2_y")) # output now only has 3 rows nrow(x2_joined_mult_keys)
$\$
Let's look at three data frames from the NYC flights delays data set:
flights
: information on flights airlines
: information on the airlines weather
: information about the weather library(nycflights13) data(flights) data(airlines) names(flights) names(airlines) # join airlines on to the flights data frame flights_airline <- flights |> left_join(airlines) names(flights_airline) # delays for each airline flights_airline |> group_by(name) |> summarize(mean_delay = mean(arr_delay, na.rm = TRUE)) # let's look at the weather too data(weather) dim(flights) dim(weather) # join the flights and the weather selecting only arrival delay and time flights_weather <- flights |> select(arr_delay, time_hour) |> # ambiguous because did not include the airport left_join(weather) dim(flights_weather) # join also including the airport location flights_weather <- flights |> select(arr_delay, origin, time_hour) |> left_join(weather) dim(flights_weather) # visualize the regression line to the data predicting delay from wind speed flights_weather |> ggplot(aes(wind_speed, arr_delay)) + geom_smooth(method = "lm")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.