Create new variables

library(learnr)
library(tidyverse)
library(nycflights13)
library(Lahman)

tutorial_options(exercise.timelimit = 60)
knitr::opts_chunk$set(error = TRUE)

Welcome

In this tutorial, you will learn how to derive new variables from a data frame, including:

The readings in this tutorial follow R for Data Science, section 5.5.

Setup

To practice these skills, we will use the flights data set from the nycflights13 package, which you met in Data Basics. This data frame comes from the US Bureau of Transportation Statistics and contains all r format(nrow(nycflights13::flights), big.mark = ",") flights that departed from New York City in 2013. It is documented in ?flights.

To visualize the data, we will use the ggplot2 package that you met in Data Visualization Basics.

I've preloaded the packages for this tutorial with

library(tidyverse) # loads dplyr, ggplot2, and others
library(nycflights13)

Add new variables with mutate()

A data set often contains information that you can use to compute new variables. mutate() helps you compute those variables. Since mutate() always adds new columns to the end of a dataset, we'll start by creating a narrow dataset which will let us see the new variables (If we added new variables to flights, the new columns would run off the side of your screen, which would make them hard to see).

select()

You can select a subset of variables by name with the select() function in dplyr. Run the code below to see the narrow data set that select() creates.

flights_sml <- select(flights, 
  arr_delay, 
  dep_delay,
  distance, 
  air_time
)

mutate()

The code below creates two new variables with dplyr's mutate() function. mutate() returns a new data frame that contains the new variables appended to a copy of the original data set. Take a moment to imagine what this will look like, and then click "Run Code" to find out.

flights_sml <- select(flights, 
  arr_delay, 
  dep_delay,
  distance, 
  air_time
)
mutate(flights_sml,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)

Note that when you use mutate() you can create multiple variables at once, and you can even refer to variables that are created earlier in the call to create other variables later in the call:

flights_sml <- select(flights, 
  arr_delay, 
  dep_delay,
  distance, 
  air_time
)
mutate(flights_sml,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

transmute()

mutate() will always return the new variables appended to a copy of the original data. If you want to return only the new variables, use transmute(). In the code below, replace mutate() with transmute() and then spot the difference in the results.

mutate(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
transmute(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
"Excellent job! `transmute()` and `mutate()` do the same thing, but `transmute()` only returnsd the new variables. `mutate()` returns a copy of the original data set with the new variables appended."

Useful mutate functions

You can use any function inside of mutate() so long as the function is vectorised. A vectorised function takes a vector of values as input and returns a vector with the same number of values as output.

Over time, I've found that several families of vectorised functions are particularly useful with mutate():

Exercises

flights <- flights %>% mutate(
  dep_time = hour * 60 + minute,
  arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100),
  airtime2 = arr_time - dep_time,
  dep_sched = dep_time + dep_delay
)

ggplot(flights, aes(dep_sched)) + geom_histogram(binwidth = 60)
ggplot(flights, aes(dep_sched %% 60)) + geom_histogram(binwidth = 1)
ggplot(flights, aes(air_time - airtime2)) + geom_histogram()

Exercise 1

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they're not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.


mutate(flights, dep_time = dep_time %/% 100 * 60 + dep_time %% 100,
       sched_dep_time = sched_dep_time %/% 100 * 60 + sched_dep_time %% 100)
**Hint:** `423 %% 100` returns `23`, `423 %/% 100` returns `4`.
"Good Job!"

Exercise 2

Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? How do you explain this?

# flights <- mutate(flights, total_time = _____________)
# flight_times <- select(airtime, total_time)
# filter(flight_times, air_time != total_time)
flights <- mutate(flights, total_time = arr_time - dep_time)
flight_times <- select(airtime, total_time)
filter(flight_times, air_time != total_time)
"Good Job! it doesn't make sense to do math with `arr_time` and `dep_time` until you convert the values to minutes past midnight (as you did with `dep_time` and `sched_dep_time` in the previous exercise)."

Exercise 3

Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?


Exercise 4

Find the 10 most delayed flights (dep_delay) using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().


?min_rank
flights <- mutate(flights, delay_rank = min_rank(dep_delay))
filter(flights, delay_rank <= 10)
**Hint:** Once you compute a rank, you can filter the data set based on the ranks.
"Excellent! It's not possible to choose exactly 10 flights unless you pick an arbitrary method to choose between ties."

Exercise 5

What does 1:3 + 1:10 return? Why?


1:3 + 1:10
**Hint:** Remember R's recycling rules.
"Nice! R repeats 1:3 three times to create a vector long enough to add to 1:10. Since the length of the new vector is not exactly the length of 1:10, R also returns a warning message."

Exercise 6

What trigonometric functions does R provide? Hint: look up the help page for Trig.




Try the learnr package in your browser

Any scripts or data that you put into this service are public.

learnr documentation built on March 26, 2020, 7:45 p.m.