library(learnr) library(tidyverse) library(tutorialExtras) library(gradethis) library(nycflights13) library(tutorial.helpers) library(ggcheck) gradethis_setup() knitr::opts_chunk$set(echo = FALSE) options( tutorial.exercise.timelimit = 60 #tutorial.storage = "local" ) freq_dest <- flights %>% group_by(dest) %>% summarize(num_flights = n())
grade_server("grade")
question_text("Name:", answer_fn(function(value){ if(length(value) >= 1 ) { return(mark_as(TRUE)) } return(mark_as(FALSE) ) }), correct = "submitted", allow_retry = FALSE )
Complete this tutorial while reading Sections 3.4 - 3.9 of the textbook. Each question allows 3 'free' attempts. After the third attempt a 10% deduction occurs per attempt.
You can check your current grade and the number of attempts you are on in the "View grade" section. You can click this button as often and as many times as you would like as you progress through the tutorial. Before submitting, make sure your grade is as expected.
group_by
, mutate
,arrange
, and a few other functions.group_by()
rowsIf you would like to compute summary statistics based on a categorical variable instead of for the entire data set you can use the group_by()
function.
In the previous tutorial we calculated the average and standard deviation of temp
erature in the weather
data frame with the following code:
summary_temp <- weather %>% summarize(mean_temp = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE)) summary_temp
Let's say instead we wanted the average temp
erature for each month
.
Copy the above code and before summarize
add in group_by(month) %>%
.
Change the name of this dataset to be summary_monthly_temp
instead of summary_temp
.
summary_monthly_temp <- weather %>% ...(...) %>% summarize(mean_temp = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE)) summary_monthly_temp
summary_monthly_temp <- weather %>% group_by(month) %>% summarize(mean_temp = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE)) summary_monthly_temp
grade_this_code()
Grouping the weather
dataset by month
and then applying the summarize()
function yields a data frame that displays the mean and standard deviation temperature split by the 12 months of the year.
Let's consider another example using the diamonds
data frame included in the ggplot2
package.
Type diamonds
in the code chunk and click "Submit Answer" to print the data frame.
...
diamonds
grade_this_code()
Observe that the first line of the output reads # A tibble: 53,940 x 10
. This is an example of meta-data, in this case the number of observations/rows and variables/columns in diamonds. The actual data itself are the subsequent table of values.
Now let’s pipe the diamonds data frame into group_by(cut)
.
diamonds %>% group_by(...)
diamonds %>% group_by(cut)
grade_this_code()
Observe that now there is additional meta-data: # Groups: cut [5]
indicating that the grouping structure meta-data has been set based on the 5 possible values AKA levels of the categorical variable cut: "Fair"
, "Good"
, "Very Good"
, "Premium"
, "Ideal"
.
Copy the previous code and pipe on summarize(avg_price = mean(price))
at the end. Don't forget to use the pipe operator %>%
to link functions together.
diamonds %>% group_by(cut) %>% summarize(...)
diamonds %>% group_by(cut) %>% summarize(avg_price = mean(price))
grade_this_code()
Only by combining a group_by()
with another data wrangling operation, in this case summarize()
will the actual data be transformed.
If we would like to remove this group structure meta-data, we can pipe the resulting data frame into the ungroup()
function.
You are not limited to grouping by one variable! You can group by multiple variables within the same group_by function.
For example:
new_data <- data %>% group_by(var1, var2) %>% ...
mutate
existing variablesAnother common transformation of data is to create/compute new variables based on existing ones.
Using the weather
data frame from the nycflights13
package, let's convert the temp
erature variable from degrees Fahrenheit to degrees Celsius.
Start with the weather
data frame and then pipe on mutate(temp_in_C = (temp-32)/1.8)
... %>% mutate(...)
weather %>% mutate(temp_in_C = (temp-32)/1.8)
grade_this_code()
If you scroll across the output variables you will see the new variable called temp_in_C
at the end.
Notice that the data is being directly printed because we did not assign it to an object.
Copy the previous code and type weather <-
before weather
.
... <- weather %>% mutate(temp_in_C = (temp-32)/1.8)
weather <- weather %>% mutate(temp_in_C = (temp-32)/1.8)
grade_this_code()
Note that we have overwritten the original weather data frame with a new version that now includes the additional variable temp_in_C
.
It is very important that you ONLY overwrite existing data frames if you are not losing original information that you might need later.
If you are making modifications that lose original information then you should call this data frame something different, such as weather_new
or weather_celsius
.
Now print the modified data frame by copying the previous code and typing weather
on the next line.
weather <- weather %>% mutate(temp_in_C = (temp-32)/1.8) ...
weather <- weather %>% mutate(temp_in_C = (temp-32)/1.8) weather
grade_this_code()
Notice the weather
data frame now has our new variable at the end.
arrange()
and sort rowsThe dplyr
package has a function called arrange()
that we will use to sort/reorder a data frame’s rows according to the values of the specified variable.
Let’s suppose we were interested in determining the most frequent destination airports for all domestic flights departing from New York City in 2013.
We start with the flights
data set and then group_by()
dest
ination. Then we count the number of flights in each group with the n()
function within summarize()
. We called this new variable num_flights
.
Notice the n()
function never has any arguments within it. It simply counts the number of observations/rows of the data.
This dataset has been created for you and is called freq_dest
.
freq_dest <- flights %>% group_by(dest) %>% summarize(num_flights = n()) freq_dest
Instead of having freq_dest
ordered based on the destination, let's say we want to arrange based on the number of flights.
Start with freq_dest
and then pipe on arrange(num_flights)
.
... %>% arrange(...)
freq_dest %>% arrange(num_flights)
grade_this_code()
By default, the rows are sorted with the least frequent destination airports displayed first.
To switch the ordering to be descending instead of ascending we use the desc()
function, which is short for “descending”.
Copy the previous code and update the arrange()
function to instead contain desc(num_flights)
.
freq_dest %>% arrange(...(num_flights))
freq_dest %>% arrange(desc(num_flights))
grade_this_code()
ORD
(O'hare) has the most number of flights from NYC in 2013.
join
data framesAnother common data transformation task is “joining” or “merging” two different datasets.
Due to the limited time in this course we will not cover this section directly.
However, I strongly encourage you to read through this section as it is a very important task in data science.
The select()
function keeps only a subset of variables/columns. This is especially useful for organization/display.
Say you only need the carrier
and flight
variables from the flights
dataset.
Start with the flights
data and pipe on the select()
function. Within select()
include the variables carrier
and flight
.
flights %>% select(..., ...)
flights %>% select(carrier, flight)
grade_this_code()
This function makes exploring data frames with a very large number of variables easier for humans to process by restricting consideration to only those we care about.
If instead you want to remove a variable from the data frame; we can deselect is by using the -
sign.
Remove the variable year
from the flights
data frame by piping on select(-year)
.
flights %>% select(-...)
flights %>% select(-year)
grade_this_code()
Another way of selecting columns/variables is by specifying a range of columns.
Start with the flights data frame and pipe on select(month:day, arr_time:arr_delay)
.
flights %>% select(...)
flights %>% select(month:day, arr_time:arr_delay)
grade_this_code()
This new data frame kept all variables between month
and day
and all variables between arr_time
and arr_delay
, inclusive.
Recall that if you would like to use this data frame later we need to store it as a new object.
Copy the previous code, and store the data frame as flight_arr_times
. Then print the data frame by typing flight_arr_times
on the next line.
... <- flights %>% select(month:day, arr_time:arr_delay) ...
flight_arr_times <- flights %>% select(month:day, arr_time:arr_delay) flight_arr_times
grade_this_code()
The select()
function can also be used to reorder columns in combination with the everything()
helper function.
Lastly, the helper functions starts_with()
, ends_with()
, and contains()
can be used to select variables/column that match those conditions.
Another useful function is rename()
, which as you may have guessed renames one column to another name.
Start with the flights
data frame and pipe on select(contains("time"))
.
flights %>% ...(...)
flights %>% select(contains("time"))
grade_this_code()
This subsets the data frame to only include variables that contain the word "time".
Now rename dep_time
to departure_time
and arr_time
to arrival_time
by piping rename(departure_time = dep_time, arrival_time = arr_time)
onto the previous code.
flights %>% select(contains("time")) %>% rename(..., ...)
flights %>% select(contains("time")) %>% rename(departure_time = dep_time, arrival_time = arr_time)
grade_this_code()
Note that in this case we used a single =
sign within rename()
, because we are assigning a new variable name and not testing for equality.
It's a good habit to run your code before adding on additional functions as we have done because it helps you catch errors and see what your new data frame looks like.
Copy your code and store this new data frame as flights_time
. No need to print anything.
... <- flights %>% select(contains("time")) %>% rename(departure_time = dep_time, arrival_time = arr_time)
flights_time <- flights %>% select(contains("time")) %>% rename(departure_time = dep_time, arrival_time = arr_time)
grade_this_code()
There is no output because we did not print anything. In RStudio we could look at this new data frame in the Environment pane.
We can return observations with maximum or minimum values of a variable using the slice_max()
or slice_min()
.
Consider this example:
freq_dest %>% slice_max(n = 5, order_by = num_flights)
slice_max
means that we are keeping the largest values of a specific variable, in this case num_flights
. By specifying n = 5
we are keeping the largest 5
flights.
In your Console, run ?slice
(If that produces an error try ?dplyr::slice
)
question("Which of the following are valid slice_*() functions? Select all that apply.", answer("slice_rand()"), answer("slice_n()"), answer("slice_tail()", correct=TRUE), answer("slice_sample()", correct=TRUE), answer("slice_head()", correct = TRUE), allow_retry = TRUE, random_answer_order = TRUE)
grade_button_ui(id = "grade")
Once you are finished:
grade_print_ui("grade")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.