library(learnr) library(tutorial.helpers) library(tidyverse) library(nycflights13) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 600, tutorial.storage = "local") df <- tribble( ~x, ~y, 1, 3, 5, 2, 7, NA, ) x <- c(1, 2, 3, 4, NA) ranktypes <- tibble(x = x) numbers <- tibble(id = 1:10) times_visited <- tibble( time = c(0, 1, 2, 3, 5, 10, 12, 15, 17, 19, 20, 27, 28, 30) ) repetition <- tibble( x = c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b"), y = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199) )
This tutorial covers Chapter 13: Numbers from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We will be utilizing two core packages of Tidyverse, readr and dplyr. Key commands of this section will include parse_double()
for parsing numbers directly from strings, parse_number()
for removing useless characters and parsing numbers from strings, count()
which counts the unique values of one or more variables, pmin()
which take one or more vectors in and returns the minima or maxima of these vectors,
round()
which rounds values in its first argument to the specified number of decimal places, and min_rank()
which gives every tie the same value and ranks an inputted vector.
In most cases, you’ll get numbers already recorded in one of R’s numeric types: integer or double. In some cases, however, you’ll encounter them as strings, possibly because you’ve created them by pivoting from column headers or because something has gone wrong in your data import process.
This chapter mostly uses functions from base R, which are available without loading any packages. But we still need tidyverse functions like mutate()
and filter()
.
Use library()
to load in the tidyverse package.
library(...)
library(tidyverse)
The readr package is one of the nine core packages in the Tidyverse. It provides two useful functions for parsing strings into numbers: parse_double()
and parse_number()
.
In many of the exercises in this tutorial, we will provide the code for creating an example object. You will then just add code for working with that object. Don't delete the object creation code! Below, we create an object x
.
On the next line, run parse_double()
on x
.
x <- c("1.2", "5.6", "1e3")
x <- c("1.2", "5.6", "1e3") parse_double(...)
x <- c("1.2", "5.6", "1e3") parse_double(x)
This should return an output of: #> [1] 1.2 5.6 1000.0
. parse_double()
works well on regular numbers. You can use parse_integer()
if all the inputs are integers.
Use parse_number()
when the string contains non-numeric text that you want to ignore. This is particularly useful for currency data and percentages.
Use parse_number()
with y
as the argument to ignore this non-numeric text.
y <- c("$1,234", "USD 3,513", "59%")
y <- c("$1,234", "USD 3,513", "59%") parse_number(y)
y <- c("$1,234", "USD 3,513", "59%") parse_number(y)
The result is #> [1] 1234 3513 59
. Note how parse_number()
returns only the underling number, ignoring the all the other characters in the inputs.
It’s surprising how much data science you can do with just counts and a little basic arithmetic, so the dplyr package strives to make counting as easy as possible with count()
. This function is great for quick exploration and checks during analysis.
Use library()
to load the nycflights13 package.
library(...)
library(nycflights13)
This dataset from the contains information about all flights that departed from NYC (e.g. EWR, JFK and LGA) to destinations in the United States, Puerto Rico, and the American Virgin Islands) in 2013.
Type in flights
and hit "Run Code."
flights
flights
Run ?flights
to pull up the help page in order to learn more about the data.
Let's count the number of planes of which departed from a certain destination. Pipe flights
to the count()
function. Within the call to count()
, put dest
.
flights |> count(dest)
flights |> count(dest)
We usually put count()
on a single line because it’s usually used at the console for a quick check that a calculation is working as expected.
If you want to see the most common values, use the previous pipe and add the parameter sort = TRUE
to the function count()
.
flights |> count(dest, sort = ...)
flights |> count(dest, sort = TRUE)
To see all the values, you can use |> View()
or |> print(n = Inf)
.
Pipe flights
to summarize()
wthin which set the argument n
equal to n()
.
flights |> ...( n = n())
flights |> summarize( n = n())
We are using two totally different n
's here. The first n
is the number of a variable which we are creating. We could use any variable name we want. The second n
--- in n()
--- is the name of a function which does the same thing as count()
.
Using the same code, add another argument to summarize()
: delay = mean(arr_delay, na.rm = TRUE)
.
flights |> summarize( n = n(), delay = ...(arr_delay, ... = TRUE))
flights |> summarize( n = n(), delay = mean(arr_delay, na.rm = TRUE))
n()
is a special summary function that doesn’t take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verbs like summarize()
.
Using the same code, add another argument to summarize()
: carriers = n_distinct(carrier)
. Don't forget to separate out each of the arguments in summarize()
using a comma.
flights |> summarize( n = n(), delay = mean(arr_delay, na.rm = TRUE), carriers = ...(carrier))
flights |> summarize( n = n(), delay = mean(arr_delay, na.rm = TRUE), carriers = n_distinct(carrier))
n_distinct(x)
counts the number of distinct (unique) values of one or more variables. It is part of the same family of functions as count()
and n()
.
Using the same code, add another argument to summarize()
: .by = dest
.
flights |> summarize( n = n(), delay = mean(arr_delay, na.rm = TRUE), carriers = n_distinct(carrier), ... = dest)
flights |> summarize( n = n(), delay = mean(arr_delay, na.rm = TRUE), carriers = n_distinct(carrier), .by = dest)
The .by
argument modifies summarize()
so that it does the same calculation for each value of .dest
. Until now, it has been showing us the results for the entire data set.
Continue the pipe by adding arrange(desc(carriers))
.
flights |> summarize( n = n(), delay = mean(arr_delay, na.rm = TRUE), carriers = n_distinct(carrier), .by = dest) |> ...(desc(...))
flights |> summarize( n = n(), delay = mean(arr_delay, na.rm = TRUE), carriers = n_distinct(carrier), .by = dest) |> arrange(desc(carriers))
Keep track of variable names. carrier
is the variable in flights
which tells us the individual carrier
for a specific flight. carriers
is a variable we created within summarize()
. If we did not create it, it would not be available for use by desc()
and arrange()
.
Start a new pipe by starting with flights
again and then adding summarize(miles = sum(distance), .by = tailnum)
.
flights |> summarize(miles = sum(distance), .by = tailnum)
This is the distance that each plane flew. In a sense, we have count()
'd those miles by using sum()
.
We can accomplish the same goal by using the wt
argument to count()
. Pipe flights
to count(tailnum, wt = distance)
.
flights |> count(tailnum, wt = distance)
The answers are the same, although to confirm that claim you would need to compare the resulting tibbles since the two commands output the data in different orders.
Pipe flights
to summarize(n_cancelled = sum(is.na(dep_time)))
.
flights |> summarize(n_cancelled = sum(is.na(dep_time)), .by = dest)
This code counts the missing values by combining sum()
and is.na()
. In the flights dataset this represents flights that are cancelled.
This section will go over transforming numeric vectors with the function mutate()
and using various mathematical methods to make new columns.
Operations like flights |> mutate(air_time = air_time / 60)
have different lengths in the left and the right hand side. There are 336,776 individual numbers on the left of the /
but only one number on the right. This would normally be a problem but R handles these mismatched lengths by recycling, or repeating, the short vector.
By the end of this section we will be making a plot that looks like this:
plots <- flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)), n = n(), .by = hour) |> filter(hour > 1) |> select(hour, prop_cancelled, n) |> distinct() |> ggplot(aes(x = hour, y = prop_cancelled)) + geom_line() + geom_point(aes(size = n)) plots
On a new line, divide the vector x
by 5.
x <- c(1, 2, 10, 20)
x <- c(1, 2, 10, 20) x / ...
x <- c(1, 2, 10, 20) x / 5
This operation is shorthand for x / c(5, 5, 5, 5)
. Generally, you only want to recycle single numbers (i.e. vectors of length 1), but R will recycle any shorter length vector.
Multiply the vector x
with the values c(1, 2, 3)
x <- c(1, 2, 10, 20)
x <- c(1, 2, 10, 20) x * c(...)
x <- c(1, 2, 10, 20) x * c(1,2,3)
It usually (but not always) gives you a warning if the longer vector isn’t a multiple of the shorter:
Create a new pipeline. Pipe flights
to the filter()
function. And add an argument that checks if month
equals c(1, 2)
.
flights |> filter(month == c(...))
flights |> filter(month == c(1, 2))
The code runs without error, but it doesn’t return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unfortunately there’s no warning because flights
has an even number of rows.
Most arithmetic functions work with pairs of variables. Two closely related functions are pmin()
and pmax()
, which when given two or more variables will return the smallest or largest value in each row.
Let's utilize a premade dataframe df
. Type in df
and hit "Run Code."
df
df
Create a new pipeline. Pipe df
to the mutate()
. Within the mutate()
, create a new variable called min
and set it to pmin(x, y)
.
df |> mutate( min = ...)
df |> mutate( min = pmin(x, y) )
We can see that the missing values are still being accounted for in the min variable (NA
).
Using the previous pipe, add the argument na.rm
to the pmin()
function call within the mutate()
function and set it to TRUE
.
df |> mutate( min = pmin(x, y, na.rm = ...) )
df |> mutate( min = pmin(x, y, na.rm = TRUE) )
If na.rm
is FALSE
an NA
value in any of the arguments will cause a value of NA
to be returned, otherwise NA
values are ignored.
Using the previous pipeline. Within the call to the mutate()
function create a new variable max
and set it to pmax(x, y)
.
df |> mutate( min = pmin(x, y, na.rm = TRUE), max = ...)
df |> mutate( min = pmin(x, y, na.rm = TRUE), max = pmax(x, y) )
We can see that the missing values are still being accounted for in the max variable (NA
).
Using the previous pipe, add the argument na.rm
to the pmax()
function call within the mutate()
function and set it to TRUE
.
df |> mutate( min = pmin(x, y, na.rm = TRUE), max = pmax(x, y, na.rm = ...) )
df |> mutate( min = pmin(x, y, na.rm = TRUE), max = pmax(x, y, na.rm = TRUE) )
Note that these are different to the summary functions min()
and max()
which take multiple observations and return a single value. You can tell that you’ve used the wrong form when all the minimums and all the maximums have the same value.
Divide the vector z
by 3 using integer division (%/%
).
z <- 1:10
z <- 1:10 z ... 3
z <- 1:10 z %/% 3
We can also find the remainder using %%
.
Compute the remainder of the vector z
divided by 3
by typing in z %% 3
and hitting "Run Code."
z <- 1:10
z <- 1:10 z ... 3
z <- 1:10 z %% 3
Create a new pipeline. Pipe flights
with the function mutate()
. Within the mutate()
function, create a new variable called hour
and set it to sched_dep_time %/% 100
.
flights |> mutate( hour = ...)
flights |> mutate( hour = sched_dep_time %/% 100)
You can read more about other arithmetic operators here.
Using the previous code, within the mutate()
function, create a new variable called minute
and set it to sched_dep_time %% 100
.
flights |> mutate( hour = sched_dep_time %/% 100, minute = ...)
flights |> mutate( hour = sched_dep_time %/% 100, minute = sched_dep_time %% 100)
Modular arithmetic is handy for the flights dataset, because we can use it to unpack the sched_dep_time
variable into hour
and minute
.
Using your previous code, within the mutate()
function and add an argument called .keep
and set it to "used"
.
flights |> mutate( hour = sched_dep_time %/% 100, minute = sched_dep_time %% 100, .keep = ... )
flights |> mutate( hour = sched_dep_time %/% 100, minute = sched_dep_time %% 100, .keep = "used" )
Read more about the .keep
argument in ?summarize
.
Let's use all of the stuff we've learnt to create a graph. This is what the graph is supposed to look like.
plots
Create a new pipeline. Pipe flights
to the function mutate()
. Within the call to mutate()
, create a new variable called hour
and set it to sched_dep_time %/% 100
.
flights |> mutate(... = sched_dep_time %/% ...)
flights |> mutate(hour = sched_dep_time %/% 100)
We can combine that with a trick using mean(is.na(x))
to see how the proportion of cancelled flights varies over the course of the day.
Copy your previous code. Within the call to mutate()
, create a new variable called prop_cancelled
and set it to mean(is.na(dep_time))
.
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = ...(is.na(...)))
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)))
is.na(x)
works with any type of vector and returns TRUE for missing values and FALSE for everything else, which we can use to find all the rows with a missing dep_time
.
Copy your previous code. Within the call to mutate()
, create another variable called n
and set it to n()
.
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)), n = ...())
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)), n = n())
n()
is a special summary function that doesn’t take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verb.
Copy your previous code. Within the call to mutate()
, add an argument called .by
and set it to hour
, so that we can group by hour.
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)), n = n(), .by = ...)
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)), n = n(), .by = hour)
Copy your previous code. Add the function filter()
to the pipeline. Within the call to filter()
create a new argument that checks if the variable hour
is > 1
.
... |> filter(... > 1)
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)), n = n(), .by = hour) |> filter(hour > 1)
Continue the pipe to select()
, with the columns hour
, prop_cancelled
, and n
as arguments.
... |> ...(hour, ..., n)
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)), n = n(), .by = hour) |> filter(hour > 1) |> select(hour, prop_cancelled, n)
Now continue the pipe to distinct()
.
... |> ...()
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)), n = n(), .by = hour) |> filter(hour > 1) |> select(hour, prop_cancelled, n) |> distinct()
Recall that the distinct()
function removes any duplicate rows.
Copy your previous code. Add the function ggplot()
to the pipeline. Within the function ggplot()
and map x
to hour
and y
to prop_cancelled
using aes()
.
... |> ggplot(aes(x = ..., y = ...))
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)), n = n(), .by = hour) |> filter(hour > 1) |> select(hour, prop_cancelled, n) |> distinct() |> ggplot(aes(x = hour, y = prop_cancelled))
The plot has no data because we have not yet provided a geom.
Using your previous code. Add the function geom_line()
to the pipeline.
... + ...()
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)), n = n(), .by = hour) |> filter(hour > 1) |> select(hour, prop_cancelled, n) |> distinct() |> ggplot(aes(x = hour, y = prop_cancelled)) + geom_line()
Don't forget to use +
instead of |>
to separate out the component parts of your ggplot()
object.
Using your previous code, add the function geom_point()
to the pipeline.
... + geom_...()
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)), n = n(), .by = hour) |> filter(hour > 1) |> select(hour, prop_cancelled, n) |> distinct() |> ggplot(aes(x = hour, y = prop_cancelled)) + geom_line(color = "grey50") + geom_point()
Copy your previous code. Within the function geom_point()
map size
to n
.
... + geom_point(aes(size = ...))
flights |> mutate(hour = sched_dep_time %/% 100, prop_cancelled = mean(is.na(dep_time)), n = n(), .by = hour) |> filter(hour > 1) |> select(hour, prop_cancelled, n) |> distinct() |> ggplot(aes(x = hour, y = prop_cancelled)) + geom_line() + geom_point(aes(size = n))
What we have here is a line plot with scheduled departure hour on the x-axis, and proportion of cancelled flights on the y-axis. Cancellations seem to accumulate over the course of the day until 8pm, very late flights are much less likely to be cancelled.
Reminder: Your plot should look somewhat like this.
plots
This section will allow you to practice with many other number functions with many different uses.
Run ?log
in the Console and copy paste the Description below. (Don't worry about formatting)
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 6)
Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude and converting exponential growth to linear growth.
Use the function round()
with the argument 123.456
and hit "Run Code."
round(...)
round(123.456)
The function round(x) rounds a number to the nearest integer. You can control the precision of the rounding with the second argument, digits. round(x, digits)
rounds to the nearest 10^-digits so digits = 2
will round to the nearest 0.01.
Copy your previous code. Add the digits
argument, with a value of 2
, to the round()
function.
round(123.456, digits = ...)
round(123.456, digits = 2)
You can control the precision of the rounding with the second argument, digits. round(x, digits) rounds to the nearest 10^-n so digits = 2 will round to the nearest 0.01.
Copy your previous code and change the digits
argument to -2
.
round(123.456 , digits = ...)
round(123.456, digits = -2)
The -2 argument in the round()
function will round the number 123.456
to the nearest hundred which would be 100 in this case.
Type in round(c(1.5, 2.5))
and hit "Run Code."
round(...)
round(c(1.5, 2.5))
round() uses what’s known as “round half to even” or Banker’s rounding: if a number is half way between two integers, it will be rounded to the even integer. This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.
The functions floor()
and ceiling()
are paired with round()
. Use the function floor()
with the numerical argument of 123.456
.
floor(...)
floor(123.456)
The function floor()
will always round down to the nearest integer.
Type in ceiling() with the numerical argument of 123.456
.
ceiling(...)
ceiling(123.456)
The function ceiling()
always round up to the nearest integer.
Make a new variable x
and set it to the number 123.456.
x <- ...
x <- 123.456
The floor()
and ceiling()
functions don’t have a digits argument, so you can instead scale down, round, and then scale back up
Let's round down to the nearest two digits. Type in floor(x / 0.01) * 0.01
and hit "Run Code."
x <- 123.456
x <- 123.456 floor(...) * ...
x <- 123.456 floor(x / 0.01) * 0.01
Let's round up to the nearest two digits. Type in ceiling(x / 0.01) * 0.01
and hit "Run Code."
x <- 123.456
x <- 123.456 ceiling(...) * ...
x <- 123.456 ceiling(x / 0.01) * 0.01
You can use the same technique if you want to round() to a multiple of some other number
Let's round to the nearest multiple of 4. Type in round(x / 4) * 4
and hit "Run Code."
x <- 123.456
x <- 123.456 round(...) * ...
x <- 123.456 round(x/4) * 4
This time let's round to the nearest 0.25. Type in round(x / 0.25) * 0.25
and hit "Run Code."
x <- 123.456
x <- 123.456 round(...) * ...
x <- 123.456 round(x / 0.25) * 0.25
Use cut()
on the numeric vector x
and add in the breaks
argument and set breaks
to c(0, 5, 10, 15, 20)
.
x <- c(1, 2, 5, 10, 15, 20)
x <- c(1, 2, 5, 10, 15, 20) cut(x, breaks = ...)
x <- c(1, 2, 5, 10, 15, 20) cut(x, breaks = c(0,5,10,15,20))
Use cut()
to break up (aka bin) a numeric vector into discrete buckets.
The breaks don't need to be evenly spaced. Copy the previous code and change the breaks
argument and set it to c(0, 5, 10, 100)
.
x <- c(1, 2, 5, 10, 15, 20)
x <- c(1, 2, 5, 10, 15, 20) cut(x, breaks = ...)
x <- c(1, 2, 5, 10, 15, 20) cut(x, breaks = c(0, 5, 10, 100))
You can optionally supply your own labels
. Note that there should be one less labels
than breaks
.
Copy your previous code and add a new argument called labels
and set it to c("sm", "md", "lg")
.
x <- c(1, 2, 5, 10, 15, 20)
x <- c(1, 2, 5, 10, 15, 20) cut(x, breaks = ..., labels = c(...))
x <- c(1, 2, 5, 10, 15, 20) cut(x, breaks = c(0, 5, 10, 100), labels = c("sm", "md", "lg"))
Any values outside of the range of the breaks will become NA
. See the documentation for other useful arguments like right
and include.lowest
, which control if the intervals are [a, b) or (a, b] and if the lowest interval should be [a, b] by running ?cut()
in the Console.
Make a new vector x
and set it to 1:10
.
x <- ...
x <- 1:10
The function cumsum()
will cumulate the sum using all the previous integers within the vector
Use the function cumsum()
on x
.
x <- 1:10
x <- 1:10 cumsum(...)
x <- 1:10 cumsum(x)
Base R provides cumsum(), cumprod(), cummin(), cummax() for running, or cumulative, sums, products, mins and maxes. dplyr provides cummean() for cumulative means. Cumulative sums tend to come up the most in practice.
This section will utilize the package dplyr and allow you to make use of transformations on an actual data frame.
We provide a new x
vector. On the next line, run min_rank()
on x
.
x <- c(1, 2, 3, 4, NA)
x <- c(1, 2, 3, 4, NA) min_rank(x)
x <- c(1, 2, 3, 4, NA) min_rank(x)
dplyr provides a number of ranking functions inspired by SQL, but you should always start with dplyr::min_rank()
. It uses the typical method for dealing with ties, e.g., 1st, 2nd, 2nd, 4th
Note that the smallest values get the lowest ranks; use desc(x) to give the largest values the smallest ranks
Run min_rank()
on desc(x)
.
x <- c(1, 2, 3, 4, NA)
x <- c(1, 2, 3, 4, NA) min_rank(...)
x <- c(1, 2, 3, 4, NA) min_rank(desc(x))
If min_rank()
doesn’t do what you need, look at the variants dplyr::row_number()
, dplyr::dense_rank()
, dplyr::percent_rank()
, and dplyr::cume_dist()
. See the documentation for details.
Type in ranktypes
and hit "Run Code."
ranktypes
ranktypes
Right now, ranktypes
is just a tibble which includes the vector x
.
Pipe ranktypes
to the mutate()
function. Within the call to mutate()
, create a variable row_number
which equals row_number(x)
.
ranktypes |> mutate(row_number = ...)
ranktypes |> mutate(row_number = row_number(x))
Using the same pipe as above, create another variable, within the call to mutate()
, called dense_rank
equal to dense_rank(x)
.
ranktypes |> mutate(row_number = row_number(x), dense_rank = ...)
ranktypes |> mutate(row_number = row_number(x), dense_rank = dense_rank(x))
Using the same pipe as above, create another variable, within the call to mutate()
, called percent_rank
equal to percent_rank(x)
.
ranktypes |> mutate(row_number = row_number(x), dense_rank = dense_rank(x), percent_rank = ...)
ranktypes |> mutate(row_number = row_number(x), dense_rank = dense_rank(x), percent_rank = percent_rank(x))
Using the same pipe as above, create another variable, within the call to mutate()
, called cume_dist
equal to cume_dist(x)
.
ranktypes |> mutate(row_number = row_number(x), dense_rank = dense_rank(x), percent_rank = percent_rank(x), cume_dist = ...)
ranktypes |> mutate(row_number = row_number(x), dense_rank = dense_rank(x), percent_rank = percent_rank(x), cume_dist = cume_dist(x))
You can achieve many of the same results by picking the appropriate ties.method argument to base R’s rank()
function; you’ll probably also want to set na.last = "keep"
to keep NAs as NA.
Type in numbers
and hit "Run Code."
numbers
numbers
The numbers
tibble just has one variable, id
, with values from 1 through 10.
Pipe numbers
to the mutate()
function. Within the call to mutate()
, create a variable row0
which equals row_number()
minus 1
.
numbers |> mutate( row0 = ... - ... )
numbers |> mutate( row0 = row_number() - 1)
row_number()
can also be used without any arguments when inside a dplyr verb. In this case, it’ll give the number of the “current” row.
Using the same pipe as above, create another variable, within the call to mutate()
, called three_groups
equal to row0 %% 3
.
numbers |> mutate( row0 = row_number() - 0, three_groups = ... %% ... )
numbers |> mutate( row0 = row_number() - 0, three_groups = row0 %% 3)
When combined with %% or %/%, `row_number()`` can be a useful tool for dividing data into similarly sized groups
Using the same pipe as above, create another variable, within the call to mutate()
, called three_in_each_group
set to row0 %/% 3
.
numbers |> mutate( row0 = row_number() - 0, three_groups = row0 %% 3, three_in_each_group = ... %/% ... )
numbers |> mutate( row0 = row_number() - 0, three_groups = row0 %% 3, three_in_each_group = row0 %/% 3)
dplyr::lead()
and dplyr::lag()
allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with NAs at the start or end
We provide a new vector x
. On the next line, use the function lag()
with x
as its argument.
x <- c(2, 5, 11, 11, 19, 35)
x <- c(2, 5, 11, 11, 19, 35) lag(...)
x <- c(2, 5, 11, 11, 19, 35) lag(x)
Use the function lead()
taking in x
as its argument.
x <- c(2, 5, 11, 11, 19, 35)
x <- c(2, 5, 11, 11, 19, 35) lead(...)
x <- c(2, 5, 11, 11, 19, 35) lead(x)
Subtract lag(x)
from x
.
x <- c(2, 5, 11, 11, 19, 35)
x <- c(2, 5, 11, 11, 19, 35) x - lag(...)
x <- c(2, 5, 11, 11, 19, 35) x - lag(x)
z - lag(z)
will give you the difference between the current and previous value for all the elements of the vector z
.
Note that z
represents any variable, and must be changed in every situation to match the variable being used (i.e. apples - lag(apples)
, or with any other variable).
Type in x == lag(x)
and hit "Run Code."
x <- c(2, 5, 11, 11, 19, 35)
x <- c(2, 5, 11, 11, 19, 35) x == ...
x <- c(2, 5, 11, 11, 19, 35) x == lag(x)
x == lag(x)
tells you when the current value changes. You can lead or lag by more than one position by using the second argument, n
.
When you’re looking at website data, it’s common to want to break up events into sessions, where you begin a new session after a gap of more than x
minutes since the last activity. For example, the times_visited
dataset has the times when someone visited a website.
Type in times_visited
and hit "Run Code."
times_visited
times_visited
Sometimes you want to start a new group every time some event occurs. We've computed the time between each event, and figured out that there's a gap that's big enough to qualify.
Pipe times_visited
to the mutate()
function. Within the call to mutate()
, create a variable diff
which is set to time
minus lag(time, default = first(time))
.
times_visited |> mutate(diff = ...)
times_visited |> mutate(diff = lag(time, default = first(time)))
Using the same pipe as above, create another variable, within the call to mutate()
, called has_gap
set to diff >= 5
.
times_visited |> mutate( diff = diff = lag(time, default = first(time)), has_gap = ...)
times_visited |> mutate(diff = lag(time, default = first(time)), has_gap = diff >=5)
But how do we go from that logical vector to something that we can use .by
with? cumsum()
comes to the rescue here.
Using the same pipe as above, create another variable, within the call to mutate()
, called group
set to cumsum(has_gap)
.
times_visited |> mutate( diff = diff = lag(time, default = first(time)), has_gap = diff >=5, group = ...)
times_visited |> mutate(diff = lag(time, default = first(time)), has_gap = diff >=5, group = cumsum(has_gap))
When has_gap
is TRUE
, cumsum()
will increment group
by 1
.
Imagine you have a dataframe with a bunch of repeated values. Type in repetition
and hit "Run Code."
repetition
repetition
Create a new pipeline and pipe repetition
to the mutate()
function. Within the call to mutate()
, create a variable id
and set it to consecutive_id(x)
.
repetition |> mutate(id = ...)
repetition |> mutate(id = consecutive_id(x))
Another approach for creating grouping variables is consecutive_id()
, which starts a new group every time one of its arguments changes.
Using your previous code, within your call to the mutate()
function, add the parameters x
and y
so that we could include these columns in our plot.
repetition |> mutate(id = consecutive_id(x), x, ...)
repetition |> mutate(id = consecutive_id(x), x, y)
Using the same pipe as above, add the function slice_head()
to the pipeline. Within this function add the argument n
and set it to 1
.
repetition |> mutate(id = consecutive_id(x), x, y) slice_head(n = ...)
repetition |> mutate(id = consecutive_id(x), x, y) |> slice_head(n = 1)
This keeps the first row from each repeated x
.
Using your previous code, within the slice_head()
function, add the by
argument and set it to id
.
repetition |> mutate(id = consecutive_id(x), x, y) |> slice_head(n = 1, by = ...)
repetition |> mutate(id = consecutive_id(x), x, y) |> slice_head(n = 1, by = id)
This section will introduce more useful summary functions that will help to summarize your data much better.
Create a new pipeline. Pipe flights
to the mutate()
function. Within the mutate()
function, create a new variable mean
and set it to mean(dep_delay, na.rm = TRUE)
.
flights |> mutate( mean = ... )
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE) )
An alternative to mean()
to use the median()
, which finds a value that lies in the “middle” of the vector, Depending on the shape of the distribution of the variable you’re interested in, mean or median might be a better measure of center.
Using the same pipe as above, create another variable, within the call to mutate()
, called median
and set it to median(dep_delay, na.rm = TRUE)
.
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = ... )
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE) )
Using the same pipe as above, create another variable, within the call to mutate()
, called n
and set it to n()
.
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE), n = ... )
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE), n = n() )
Using your previous code, within your call to the mutate()
function add an argument called .by
and set it to c(year, month, day)
.
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE), n = n(), .by = c(...) )
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE), n = n(), .by = c(year, month, day) )
Now continue the pipe to select()
, using the columns year
, month
, day
, mean
, median
, and n
as the arguments.
... |> select(year, ..., day, ..., median, ...)
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE), n = n(), .by = c(year, month, day)) |> select(year, month, day, mean, median, n)
Now continue the pipe to distinct()
.
... |> ...()
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE), n = n(), .by = c(year, month, day)) |> select(year, month, day, mean, median, n) |> distinct()
Using the same pipe as above, add the ggplot()
function to the pipeline. Within the ggplot()
function, map x
to mean
and y
to median
.
...elt() |> ggplot(aes(x = ..., y = ...))
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE), n = n(), .by = c(year, month, day)) |> select(year, month, day, mean, median, n) |> distinct() |> ggplot(aes(x = mean, y = median))
Using the same pipe as above, add the geom_abline()
function to the pipeline. Within the geom_abline()
function, add the argument slope
and set it to 1
.
... + geom_abline(slope = ...)
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE), n = n(), .by = c(year, month, day)) |> select(year, month, day, mean, median, n) |> distinct() |> ggplot(aes(x = mean, y = median)) + geom_abline(slope = 1)
Using the same pipe as above, within the geom_abline()
function, add another argument called intercept
and set it to 0
.
... + geom_abline(slope = 1, intercept = ...)
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE), n = n(), .by = c(year, month, day)) |> select(year, month, day, mean, median, n) |> distinct() |> ggplot(aes(x = mean, y = median)) + geom_abline(slope = 1, intercept = 0)
Using the same pipe as above, within the geom_abline()
function, add another argument called color
and set it to "white"
.
... + geom_abline(slope = 1, intercept = 0, color = ...)
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE), n = n(), .by = c(year, month, day)) |> select(year, month, day, mean, median, n) |> distinct() |> ggplot(aes(x = mean, y = median)) + geom_abline(slope = 1, intercept = 0, color = "white")
Using the same pipe as above, within the geom_abline()
function, add another argument called linewidth
and set it to 2
.
... + geom_abline(slope = 1, intercept = 0, color = "white", linewidth = ...)
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE), n = n(), .by = c(year, month, day)) |> select(year, month, day, mean, median, n) |> distinct() |> ggplot(aes(x = mean, y = median)) + geom_abline(slope = 1, intercept = 0, color = "white", linewidth = 2)
Depending on the shape of the distribution of the variable you’re interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.
Using the same pipe as above, add the geom_point()
function to the pipeline.
+ geom_...()
flights |> mutate( mean = mean(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE), n = n(), .by = c(year, month, day)) |> select(year, month, day, mean, median, n) |> distinct() |> ggplot(aes(x = mean, y = median)) + geom_abline(slope = 1, intercept = 0, color = "white", linewidth = 2) + geom_point()
This plot compares the mean vs. the median departure delay (in minutes) for each destination. The median delay is always smaller than the mean delay because flights sometimes leave multiple hours late, but never leave multiple hours early. It is a scatterplot showing the differences of summarizing daily depature delay with median instead of mean.
Create a new pipeline. Pipe flights
to the mutate()
function. Within the mutate()
function, create a new variable max
and set it to max(dep_delay, na.rm = TRUE)
.
flights |> mutate( max = ... )
flights |> mutate( max = max(dep_delay, na.rm = TRUE) )
min() and max() will give you the largest and smallest values.
Using the same pipe as above, create another variable, within the call to mutate()
, called q95
and set it to quantile(dep_delay, 0.95, na.rm = TRUE)
.
flights |> mutate( max = max(dep_delay, na.rm = TRUE), q95 = ... )
flights |> mutate( max = max(dep_delay, na.rm = TRUE), q95 = quantile(dep_delay, 0.95, na.rm = TRUE) )
Another powerful tool is quantile() which is a generalization of the median: quantile(x, 0.25) will find the value of x that is greater than 25% of the values, quantile(x, 0.5) is equivalent to the median, and quantile(x, 0.95) will find the value that’s greater than 95% of the values.
Using your previous code, within your call to the mutate()
function, add the .by
argument and set it to c(year, month, day)
.
flights |> mutate( max = max(dep_delay, na.rm = TRUE), q95 = quantile(dep_delay, 0.95, na.rm = TRUE), .by = ... )
flights |> mutate( max = max(dep_delay, na.rm = TRUE), q95 = quantile(dep_delay, 0.95, na.rm = TRUE), .by = c(year, month, day) )
Continue the pipe to select()
, using the year
, month
, day
, max
, and q95
columns as the arguments.
... |> select(year, ..., day, ..., q95)
flights |> mutate( max = max(dep_delay, na.rm = TRUE), q95 = quantile(dep_delay, 0.95, na.rm = TRUE), .by = c(year, month, day)) |> select(year, month, day, max, q95)
Now finalize the model by piping it to distinct()
.
... |> distinct()
flights |> mutate( max = max(dep_delay, na.rm = TRUE), q95 = quantile(dep_delay, 0.95, na.rm = TRUE), .by = c(year, month, day)) |> select(year, month, day, max, q95) |> distinct()
For the flights data, we are looking at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.
Create a new pipeline. Pipe flights
to the mutate()
function. Within the mutate()
function, create a new variable distance_sd
and set it to IQR(distance)
.
flights |> mutate( distance_sd = ... )
flights |> mutate( distance_sd = IQR(distance) )
Two commonly used summaries are the standard deviation, sd(x), and the inter-quartile range, IQR()
. IQR()
gives us the range that contains the middle 50% of the data and is calculated by subtracting quantile(x, 0.75) - quantile(x, 0.25)
.
Using the same pipe as above, create another variable, within the call to mutate()
, called n
and set it to n()
.
flights |> mutate( distance_sd = IQR(distance), n = n() )
flights |> mutate( distance_sd = IQR(distance), n=n() )
Using your previous code, within your call to mutate()
add the .by
argument and set it to c(origin, dest)
.
flights |> mutate( distance_sd = IQR(distance), n=n(), .by = ... )
flights |> mutate( distance_sd = IQR(distance), n=n(), .by = c(origin, dest) )
Using the same pipe as above, add the filter()
function to the pipeline. Within the filter()
function add the argument distance_sd > 0
.
... |> filter(... > 0)
flights |> mutate( distance_sd = IQR(distance), n=n(), .by = c(origin, dest)) |> filter(distance_sd > 0)
Continue the pipe to select()
, with the columns origin
, dest
, distance_sd
, and n
as the arguments.
... |> select(origin, ..., dest, ...)
flights |> mutate( distance_sd = IQR(distance), n=n(), .by = c(origin, dest)) |> filter(distance_sd > 0) |> select(origin, dest, distance_sd, n)
Finally, continue the pipe to distinct()
.
|> ...()
flights |> mutate( distance_sd = IQR(distance), n=n(), .by = c(origin, dest)) |> filter(distance_sd > 0) |> select(origin, dest, distance_sd, n) |> distinct()
We can use this to reveal a small oddity in the flights data. You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code above reveals a data oddity for airport EGE (Eagle County Regional Airport).
Describe the oddity.
question_text(NULL, message = "The distance between any two airports should be constant, obviously. Airports don't move! For some reason, the are distances between EGE, on one hand, and JFK/EWR on the other hand, are not always the same. That seems odd!", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Create a new pipeline. Pipe flights
to the filter()
function. Within the call to filter()
, have the argument be dep_delay < 120
, filtering out any departure delay that is greater than 2 hours.
flights |> filter(...)
flights |> filter(dep_delay < 120)
Using the same pipe as above, add the ggplot()
function to the pipeline. Within the ggplot()
function map x
to dep_delay
and group
to interaction(day, month)
flights |> filter(dep_delay < 120) |> ggplot(aes(x = ..., group = ...))
flights |> filter(dep_delay < 120) |> ggplot(aes(x = dep_delay, group = interaction(day, month)))
flights |> filter(dep_delay < 120) |> ggplot(aes(x = dep_delay, group = interaction(day, month)))
Using the same pipe as above, add the geom_freqpoly()
function to the pipeline. Within the geom_freqpoly()
function set the argument binwidth
to 5
and alpha
to 1/5
.
flights |> filter(dep_delay < 120) |> ggplot(aes(x = dep_delay, group = interaction(day, month))) |> geom_freqpoly(...)
flights |> filter(dep_delay < 120) |> ggplot(aes(x=dep_delay, group = interaction(day, month))) + geom_freqpoly(binwidth = 5, alpha = 1/5)
In the following plot 365 frequency polygons of dep_delay, one for each day, are overlaid. The distributions seem to follow a common pattern, suggesting it’s fine to use the same summary for each day.
Create a new pipeline. Pipe flights
to the mutate()
function. Within the mutate()
function, create a new variable first_dep
and set it to first(dep_time, na_rm = TRUE)
.
flights |> mutate( first_dep = ... )
flights |> mutate( first_dep = first(dep_time, na_rm = TRUE))
Using the same pipe as above, create another variable, within the call to mutate()
, called fifth_dep
and set it to nth(dep_time, 5, na_rm = TRUE)
.
flights |> mutate( first_dep = first(dep_time, na_rm = TRUE), fifth_dep = ... )
flights |> mutate( first_dep = first(dep_time, na_rm = TRUE), fifth_dep = nth(dep_time, 5, na_rm = TRUE))
Using the same pipe as above, create another variable, within the call to mutate()
, called last_dep
and set it to last(dep_time, na_rm = TRUE)
.
flights |> mutate( first_dep = first(dep_time, na_rm = TRUE), fifth_dep = nth(dep_time, 5, na_rm = TRUE), last_dep = ... )
flights |> mutate( first_dep = first(dep_time, na_rm = TRUE), fifth_dep = nth(dep_time, 5, na_rm = TRUE), last_dep = last(dep_time, na_rm = TRUE))
Using your previous code, within your call to mutate()
add the .by
argument and set it to c(year, month, day)
.
flights |> mutate( first_dep = first(dep_time, na_rm = TRUE), fifth_dep = nth(dep_time, 5, na_rm = TRUE), last_dep = last(dep_time, na_rm = TRUE), .by = ... )
flights |> mutate( first_dep = first(dep_time, na_rm = TRUE), fifth_dep = nth(dep_time, 5, na_rm = TRUE), last_dep = last(dep_time, na_rm = TRUE), .by = c(year, month, day))
Continue the pipe to select()
, with the columns year
, month
, day
, first_dep
, fifth_dep
, and last_dep
as the arguments.
... |> select(year, ..., day, ..., fifth_dep, ...)
flights |> mutate( first_dep = first(dep_time, na_rm = TRUE), fifth_dep = nth(dep_time, 5, na_rm = TRUE), last_dep = last(dep_time, na_rm = TRUE), .by = c(year, month, day)) |> select(year, month, day, first_dep, fifth_dep, last_dep)
Finally, continue the pipe to distinct()
.
... |> distinct()
flights |> mutate( first_dep = first(dep_time, na_rm = TRUE), fifth_dep = nth(dep_time, 5, na_rm = TRUE), last_dep = last(dep_time, na_rm = TRUE), .by = c(year, month, day)) |> select(year, month, day, first_dep, fifth_dep, last_dep) |> distinct()
The functions first(x)
, last(x)
, and nth(x, n)
extract a value at a specific position.
Create a new pipeline and pipe flights
to the mutate()
function to the pipeline. Within the mutate()
function, create a new variable y
and set it to min_rank(sched_dep_time)
.
flights |> mutate(y = min_rank(...))
flights |> mutate(y = min_rank(sched_dep_time))
Using your previous code, within your call to the mutate()
function, add another argument called .by
and set it to c(year, month, day)
.
flights |> mutate(y = min_rank(sched_dep_time), .by = ...)
flights |> mutate(y = min_rank(sched_dep_time), .by = c(year, month, day))
As the names suggest, the summary functions are typically paired with summarize()
. However, because of the recycling rules, they can also be usefully paired with mutate()
, as we have done so in this tutorial.
At the moment, the only two functions that work in these cases are mutate()
and reframe()
, as an update to the Dplyr package removed the ability to use summarize()
in these cases. At this stage, you shouldn't use reframe()
, as it's a very complicated function, so ideally, the only function that you should be using would be mutate()
.
Using the same pipe as above, add the filter()
function to the pipeline. Within the filter()
function, add the argument y %in% c(1, max(y))
.
flights |> mutate(y = min_rank(sched_dep_time), .by = c(year, month, day)) |> filter(...)
flights |> mutate(y=min_rank(sched_dep_time), .by = c(year, month, day)) |> filter(y %in% c(1, max(y)))
Extracting values at positions is complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row.
This tutorial covered Chapter 13: Numbers from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We have utilized two core packages from Tidyverse: readr and dplyr. Key commands that you learned include parse_double()
parsed numbers directly from strings, parse_number()
removed useless characters and parsing numbers from strings, count()
which counted the unique values of one or more variables, pmin()
which take one or more vectors in and returns the minima or maxima of these vectors,
round()
which rounds values in its first argument to the specified number of decimal places, and min_rank()
which gives every tie the same value and ranks an inputted vector.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.