library(learnr) knitr::opts_chunk$set(echo = FALSE)
Practice data manipulation, including: - summarize() - mutate() - select() - filter() - arrange()
Run the code below to load the packages and data for these exercises.
library(ds) library(tidyverse) data("hsls_clean")
The summarize function is used to make a new table which reports a summary statistic (like the average) for your data.
What is the average GPA for students in the HSLS data?
#The code should follow this logic: Name of Data Frame %>% Manipulation Function (Sub-Manipulatin Function(variable,na.rm=TRUE))
#Fill in the blanks with the "summarize" and "mean" functions to make a new variable that is the mean of GPA hsls_clean%>% _____(_____(gpa,na.rm=TRUE))
#Solution hsls_clean%>% summarize(mean(gpa,na.rm=TRUE))
You can use summarize to obtain other summary statistics by swapping out the sub-function. For example, switch the code from mean to median to obtain the median.
What is the median GPA?
#Fill in the blanks with the "summarize" and "median" functions to make a new variable that is the mean of GPA hsls_clean%>% _____(_____(gpa,na.rm=TRUE))
#Solution hsls_clean%>% summarize(median(gpa,na.rm=TRUE))
If we need multiple summary statistics for one variable, we can combine them into one command.
What is the average, minimum value, maximum value, and standard deviation of SES?
#The sub-function for the average is: mean #The sub-function for the minimum value is: min #The sub-function for the maximum value is: max #The sub-function for the standard deviation is: sd
#Fill in the blanks with mean, min, max, and sd hsls_clean%>% summarize(____(ses,na.rm=TRUE), ____(ses,na.rm=TRUE), ____(ses,na.rm=TRUE), ____(ses,na.rm=TRUE))
#Solution hsls_clean%>% summarize(mean(ses,na.rm=TRUE), min(ses,na.rm=TRUE), max(ses,na.rm=TRUE), sd(ses,na.rm=TRUE))
We can find a summary statistic for a specific group of students by combining the summarize command with the group_by() command.
We already found the average GPA for all students. What is the average GPA for student athletes and non-athletes (indicated by the sports variable)?
#The code should follow this logic: Name of Data Frame %>% group_by(Groups variable) %>% Manipulation Function (Sub-Manipulatin Function(variable,na.rm=TRUE))
#Fill in the blanks with the "group_by" function hsls_clean%>% _____(sports)%>% summarize(mean(gpa,na.rm=TRUE))
#Solution hsls_clean%>% group_by(sports)%>% summarize(mean(gpa,na.rm=TRUE))
The mutate function can give you the same information as summarize, but mutate adds a new variable to your data instead of giving you a table. Why would you need a new variable? Sometimes, you will want to run an analysis only on groups that have a specific characteristic. For example, you might make a new variable for the average GPA for students within a school so you could group students by if they attend a school with a higher or lower average GPA.
The coding logic is the same for mutate as it is for summarize. However, mutate will not give you the table like summarize did. To see that it works, we will have to check our data.
In the Summarize excersize, we calculated the average GPA. This code produces a nice table:
hsls_clean%>% summarize(mean(gpa,na.rm=TRUE))
To make a variable with this value, change summarize to mutate, and name the new variable:
#Make a new variable equal to the average GPA hsls_clean<-hsls_clean%>% mutate(mean_gpa=mean(gpa,na.rm=TRUE)) #List the variables so we can see our new variable, "mean_gpa" ls(hsls_clean) #Then, we can use summarize commands our new variable: #Note: "first" lists the first value of a variable. Since we made a variable for the average of all observations, every observation has the same value in the mean_gpa column. hsls_clean%>% summarize(first(mean_gpa))
Make a new variable for the median ses of all observations.
#Fill in the blanks with the "mutate" and "median" functions hsls_clean<-hsls_clean%>% _____(med_ses=_____(ses,na.rm=TRUE)) #add the ls function to see that the new variable has been created ls(hsls_clean)
#Solution hsls_clean<-hsls_clean%>% mutate(med_ses=median(ses,na.rm=TRUE)) #add the ls function to see that the new variable has been created ls(hsls_clean)
Just like with summarize, we can calculate a summary statistic within groups. The resulting new variable will have a different value for each group.
Make a new variable for the maximum GPA by family income category.
#Fill in the blanks with the "group_by" function hsls_clean<-hsls_clean%>% _____(family_income)%>% mutate(max_gpa_inc=max(gpa,na.rm=TRUE)) #add the ls function to see that the new variable has been created ls(hsls_clean)
#Solution hsls_clean<-hsls_clean%>% group_by(family_income)%>% mutate(max_gpa_inc=max(gpa,na.rm=TRUE)) #add the ls function to see that the new variable has been created ls(hsls_clean)
We can also use mutate to perform mathematical calculations on a variable and make a new variable. For example, we might need to re-scale a variable, square a variable, or multiply a variable by another number.
Make a new variable equal to GPA times 2.
#Fill in the blanks with the "mutate" function hsls_clean<-hsls_clean%>% ____(gpa2=gpa*2) #add the ls function to see that the new variable has been created ls(hsls_clean)
#Solution hsls_clean<-hsls_clean%>% mutate(gpa2=gpa*2) #add the ls function to see that the new variable has been created ls(hsls_clean)
Select allows you to subset a command or data based on the variable names.
In the most basic form, select will print out the values of the selected variables:
hsls_clean %>% select(student_id, music, art, drama)
This can be useful if you want to make a new data frame containing only a subset of data. Make a new data frame called arts that includes the variables student_id, music, art, and drama.
#Fill in the blanks with the new data frame name and the "select" function ____<-hsls_clean %>% ____(student_id, music, art, drama) #add the ls function to see that the new data frame has been created ls(arts)
#Solution arts<-hsls_clean %>% select(student_id, music, art, drama) #add the ls function to see that the new data frame has been created ls(arts)
A useful operator within the select command is the exclamation point. The ! operator selects everything except for the listed variables.
You have decided that you won't be selecting students based on other_club, so you can remove that variable from the data. Make a new data frame called df with every variable except other_club.
#The ! operator negates a selection df<-hsls_clean %>% ____(!(other_club)) #add the ls function to see that the new data frame has been created ls(df)
#Solution df<-hsls_clean %>% select(!(other_club)) #add the ls function to see that the new data frame has been created ls(df)
Select can be combined with other functions.
Make a new data frame called df that includes student id, gpa, and gpa squared.
#Select student_id and gpa #Then, mutate gpa to equal gpa*gpa
#Select student_id and gpa #Then, mutate gpa to equal gpa*gpa df<-hsls_clean %>% ____(student_id, gpa) %>% ____(gpa_sq = gpa*gpa) #add the ls function to see that the new data frame has been created ls(df)
#Solution df<-hsls_clean %>% select(student_id, gpa) %>% mutate(gpa_sq = gpa*gpa) #add the ls function to see that the new data frame has been created ls(df)
The filter function selects observations based on their values. Whereas select picks columns, filter picks rows.
The filter command alone will print the data for which the filter condition is met.
Filter the data to just view students who took a college tour.
#Use the filter function hsls_clean %>% _____(college_tour=="Yes")
#Solution hsls_clean %>% filter(college_tour=="Yes")
Filter the data to just view students who have a GPA above 3.5.
#Use the filter function hsls_clean %>% _____(gpa>3.5)
#Solution hsls_clean %>% filter(gpa>3.5)
Just like with select, we can make a new data frame with only the observations that meet the filter condition.
Make a new data frame called hispanic that only includes students who identified as Hispanic.
#Use the filter function hispanic<-hsls_clean %>% _____(hispanic=="Yes") #add the ls function to see that the new data frame has been created ls(hispanic)
#Solution hispanic<-hsls_clean %>% filter(hispanic=="Yes") #add the ls function to see that the new data frame has been created ls(hispanic)
Show observations where the student participated in music, art, and drama, but not sports. For sports, use the != operator to indicate not equal to.
#Use the filter function hsls_clean %>% ____(music=="Yes",art=="Yes",drama=="Yes",sports!="Yes")
#Solution hsls_clean %>% filter(music=="Yes",art=="Yes",drama=="Yes",sports!="Yes")
Filter out students who have a below-average GPA.
#Use the "filter" function and the "mean" operator hsls_clean %>% ____(gpa>____(gpa, na.rm = TRUE))
#Solution hsls_clean %>% filter(gpa>mean(gpa, na.rm = TRUE))
The example above compared students to the average GPA for all students. We might want to compare students within another grouping, like income. This will compare a student's GPA with other students who are in the same income category
Filter out students who have a below-average GPA for their family income level.
#Use the "group_by" and "filter" functions hsls_clean %>% _____(family_income)%>% ____(gpa>mean(gpa, na.rm = TRUE))
#Solution hsls_clean %>% group_by(family_income)%>% filter(gpa>mean(gpa, na.rm = TRUE))
Arrange changes the order of the rows. This can be helpful if you want to sort the observations so by a value of one of the variables.
Use arrange to order the observations from highest to lowest GPA.
#Use the "arrange" function hsls_clean %>% _____(gpa)
#Solution hsls_clean %>% arrange(gpa)
You can also arrange by multiple variables. The data will be arranged by the variable listed first first, and then by the second variable within categories of the first variable. Arrange by sports participation and income.
#Use the "arrange" function hsls_clean %>% _____(sports, family_income)
#Solution hsls_clean %>% arrange(sports, family_income)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.