library(learnr) library(tutorial.helpers) library(tidyverse) library(nycflights13) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage = "local")
This tutorial covers Chapter 12: Logical vectors from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn how to create logical vectors with >
, <
, <=
, =>
, ==
, !=
, and is.na()
, how to combine them with !
, &
, and |
, and how to summarize them with any()
, all()
, sum()
, and mean()
. You will also learn about the powerful if_else()
and case_when()
functions that allow you to return values depending on the value of a logical vector.
Logical vectors are the simplest type of vector because each element can only be one of three possible values: TRUE
, FALSE
, and NA
. It's relatively rare to find logical vectors in your raw data, but you'll create and manipulate them in almost every analysis.
Load the tidyverse package
library(...)
library(tidyverse)
Most of the functions you'll learn about in this chapter are provided by base R, so we don't need the tidyverse, but we'll still load it so we can use mutate()
, filter()
, and others to work with data frames.
Load the nycflights13 package.
library(...)
library(nycflights13)
You can find more information about the package with help(package = "nycflights13")
.
Type flights
and hit "Run Code".
A very common way to create a logical vector is via a numeric comparison with <
, <=
, >
, >=
, !=
, and ==
.
Pipe flights
into the filter()
function. Within filter()
, add dep_time > 600
to only look at flights scheduled to depart after 6:00 AM.
flights |> filter(...)
flights |> filter(dep_time > 600)
In addition to &
and |
, R also has &&
and ||
. Don't use them in dplyr functions! These are called short-circuiting operators and only ever return a single TRUE
or FALSE
. They're mainly used for programming, not data science.
Use the same code as above and add & dep_time < 2000
to the call to filter()
. This will look at flights scheduled to depart after 6:00 AM and before 8:00 PM.
flights |> filter(dep_time > 600 & ...)
flights |> filter(dep_time > 600 & dep_time < 2000)
An easy way to avoid the problem of getting your ==
's and |
's in the right order is to use %in%
. x %in% y
returns a logical vector the same length as x
that is TRUE
whenever a value in x
is anywhere in y
.
Use the same code as above and add & abs(arr_delay) < 20
to the call to filter()
. This will filter the data out so that the absolute value of the arrival delay is less then 20 minutes.
flights |> filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < ...)
flights |> filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
You can always explicitly create the underlying logical variables with mutate()
Start a new pipe with flights
. Pipe flights
into the mutate()
function. Within mutate
, use daytime = dep_time > 600 & dep_time < 2000
. This will create a new column called daytime
which is TRUE
when the dep_time
is between 6:00 AM and 8:00 PM.
flights |> mutate(daytime = dep_time > ... & dep_time < ...)
flights |> mutate(daytime = dep_time > 600 & dep_time < 2000)
Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside a data frame with mutate()
and friends.
Use the same code as above and add , approx_ontime = abs(arr_delay) < 20
to the mutate()
function. This will create a new column named approx_ontime
which is true when the arr_delay
is less than 20 minutes.
flights |> mutate(daytime = dep_time > 600 & dep_time < 2000, approx_ontime = abs(arr_delay) < ...)
flights |> mutate(daytime = dep_time > 600 & dep_time < 2000, approx_ontime = abs(arr_delay) < 20)
Copy previous code and add ,.keep = "used"
to the mutate()
function. This keeps only the new columns and the columns used to create them in the tibble and drops all the unused columns.
flights |> mutate(daytime = dep_time > 600 & dep_time < 2000, approx_ontime = abs(arr_delay) < 20, .keep = ...)
flights |> mutate(daytime = dep_time > 600 & dep_time < 2000, approx_ontime = abs(arr_delay) < 20, .keep = "used")
Copy previous code and pipe it into filter()
. Within filter()
add daytime & approx_ontime
. This will keep only the rows that meet the specified conditions.
flights |> mutate(daytime = dep_time > 600 & dep_time < 2000, approx_ontime = abs(arr_delay) < 20, .keep = "used")|> filter(daytime & ...)
flights |> mutate(daytime = dep_time > 600 & dep_time < 2000, approx_ontime = abs(arr_delay) < 20, .keep = "used") |> filter(daytime & approx_ontime)
So far, we’ve mostly created logical variables transiently within filter()
— they are computed, used, and then thrown away. For example, the previous filter()
finds all daytime departures that arrive roughly on time.
Set x
to c(1 / 49 * 49, sqrt(2) ^ 2)
using <-
. Then, click "Run Code".
x <- c(..., ...)
x <- c(1 / 49 * 49, sqrt(2) ^ 2)
Beware of using ==
with numbers. For example, it looks like this vector contains the numbers 1
and 2
. But if you test them for equality, you get FALSE
.
Copy previous code and run x == c(1,2)
. The expected output is FALSE
.
x <- c(1 / 49 * 49, sqrt(2) ^ 2) x == c(..., ...)
x <- c(1 / 49 * 49, sqrt(2) ^ 2) x == c(1, 2)
What's going on? Computers store numbers with a fixed number of decimal places so there's no way to exactly represent 1/49
or sqrt(2)
and subsequent computations will be very slightly off.
We can see the exact values by calling print()
with the digits argument. Copy previous code and run print(x, digits = 16)
. This will run the values of x
with 16 digits.
x <- c(1 / 49 * 49, sqrt(2) ^ 2) x == c(1, 2) print(x, digits = ...)
x <- c(1 / 49 * 49, sqrt(2) ^ 2) x == c(1, 2) print(x, digits = 16)
You can see why R defaults to rounding these numbers; they really are very close to what you expect. Now that you've seen why ==
is failing, what can you do about it? One option is to use dplyr::near()
which ignores small differences:
Copy the previous code and type near()
on the next line. Within near()
, add x, c(1,2)
. This should come out as TRUE
for both.
x <- c(1 / 49 * 49, sqrt(2) ^ 2) x == c(1, 2) print(x, digits = 16) near(x, c(..., ...))
x <- c(1 / 49 * 49, sqrt(2) ^ 2) x == c(1, 2) print(x, digits = 16) near(x, c(1, 2))
Missing values represent the unknown so they are “contagious”: almost any operation involving an unknown value will also be unknown.
Run NA > 5
and 10 == NA
. They should both come out as NA
NA > ... ... == NA
NA > 5 10 == NA
R normally calls print for you (i.e. x
is a shortcut for print(x)
), but calling it explicitly is useful if you want to provide other arguments.
Now, run NA == NA
.This should also come out as NA
NA == NA
That is the most confusing result. It's easiest to understand why this is true if we artificially supply a little more context.
Set age_mary <- NA
and age_john <- NA
on the next line. On a new line, type age_mary == age_john
and click run code.
age_mary <- ... ... <- NA age_mary == age_john
age_mary <- NA age_john <- NA age_mary == age_john
This should come out as NA
because if both of their ages are unknown, then we can't know if they are the same age.
Start a new pipe with flights
. Pipe that into the filter()
function. Within filter()
, add dep_time == NA
. This will attempt to find all the rows where dep_time
is missing.
flights|> filter(dep_time == ...)
flights|> filter(dep_time == NA)
The following code doesn’t work because dep_time == NA
will yield NA
for every single row, and filter()
automatically drops missing values. Instead we'll need a new tool: is.na()
.
Type is.na()
into the code chunk. Within is.na()
, add c(TRUE, NA, FALSE)
.
is.na(c(TRUE, NA, ...))
is.na(c(TRUE, NA, FALSE))
You will see the output FALSE
for TRUE
and FALSE
, but TRUE
for NA
. What happens when there are characters or numbers in the input vector?
Type is.na()
into the code chunk. Within is.na()
, add c(1, NA, 'b')
.
is.na(c(1, NA, ...))
is.na(c(1, NA, 'b'))
You will see the output True
for NA
and FALSE
for 1
and b
. This is because is.na()
can work with any type of vector and returns TRUE
for missing values and FALSE
for everything else..
Start a new pipe with flights
. Pipe flights
into filter()
. Within filter()
, add is.na(dep_time)
. This will find all the rows with a missing dep_time
.
flights |> filter(is.na(...))
flights |> filter(is.na(dep_time))
is.na()
can also be useful in arrange()
. arrange()
usually puts all the missing values at the end but you can override this default by first sorting with is.na()
.
Start a new pipe with flights
. Pipe it into the filter()
function. Within filter()
, add month == 1, day == 1
.
flights|> filter(month == ..., day == ...)
flights |> filter(month == 1, day == 1)
The ==
operator is a comparison operator in R that checks if two values are equal. It returns TRUE
if the values are equal and FALSE
otherwise.
Continue the pipe with arrange()
. Within arrange()
, add dep_time
. This should arrange dep_time
from least to greatest.
flights|> filter(month == 1, day == 1) |> arrange(...)
flights |> filter(month == 1, day == 1) |> arrange(dep_time)
The arrange()
function is used to order rows in a data set based on a column. It allows you to sort the data set in either ascending or descending order.
Within arrange()
, put is.na(dep_time)
.
flights|> filter(month == 1, day == 1)|> arrange(is.na(...))
flights |> filter(month == 1, day == 1) |> arrange(is.na(dep_time))
This checks if the dep_time
column has missing values (NA
). It returns a logical vector with TRUE
for rows where dep_time
is NA
and FALSE
otherwise.
Within arrange()
, put desc(is.na(dep_time)), dep_time
flights|> filter(month == 1, day == 1)|> arrange(desc(is.na(...)), ...)
flights |> filter(month == 1, day == 1) |> arrange(desc(is.na(dep_time)), dep_time)
The desc()
function is used to create a descending order of the logical vector obtained from is.na(dep_time)
. This means that rows with missing dep_time
values (NA
) will appear first in the data frame. We will discuss missing values further in the Missing Values tutorial.
Once you have multiple logical vectors, you can combine them together using Boolean algebra. In R, &
is “and”, |
is “or”, !
is “not”, and xor()
is exclusive or. For example, df |> filter(!is.na(x))
finds all rows where x
is not missing and df |> filter(x < -10 | x > 0)
finds all rows where x
is smaller than -10
or bigger than 0
. Figure 13.1 shows the complete set of Boolean operations and how they work.
The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance. Click "Run Code".
df <- tibble(x = c(TRUE, FALSE, NA)) df |> mutate( and = x & NA, or = x | NA )
To understand what’s going on, think about NA | TRUE
(NA
or TRUE
). A missing value in a logical vector means that the value could either be TRUE
or FALSE
. TRUE | TRUE
and FALSE | TRUE
are both TRUE
because at least one of them is TRUE
. NA | TRUE
must also be TRUE
because NA
can either be TRUE
or FALSE
. However, NA | FALSE
is NA
because we don’t know if NA
is TRUE
or FALSE
. Similar reasoning applies with NA & FALSE
.
Start a new pipe with flights
and pipe it into filter()
. Within filter()
, add month == 11 | month == 12
.
flights |> filter(month == ... | month == ...)
flights |> filter(month == 11 | month == 12)
Note that the order of operations doesn’t work like English. To prove this, we can take the previous code that finds all flights that departed in November or December. You might be tempted to write it like you’d say in English: “Find all flights that departed in November or December.”
Now, try replacing month == 11 | month == 12
with month == 11 | 12
. This should do the same thing as before. (Spoiler Alert: It doesn't!)
flights |> filter(month == ... | ...)
flights |> filter(month == 11 | 12)
The code does not return an error, but it doesn’t seem to have worked either. What happened? Here, R first evaluates month == 11
creating a logical vector, which we call nov
. It now computes nov | 12
. When you use a number with a logical operator it converts everything apart from 0
to TRUE
, so this is equivalent to nov | TRUE
which will always be TRUE
, so every row will be selected.
Start a new pipe with flights
. Pipe flights
into the mutate()
function. Within mutate()
, set nov = month == 11
.
flights|> mutate(nov = month == ...)
flights|> mutate(nov = month == 11)
nov = month == 11
creates a new variable called nov
and assigns it to the Boolean TRUE
for rows where the month is equal to 11
, and FALSE
otherwise. The expression month == 11
checks the equality between the month column and the value 11
, resulting in a logical vector.
Copy previous code and add final = nov | 12, .keep = "used"
to the mutate()
function. final = nov | 12
results in the variable final
being TRUE
if nov
is TRUE
or if the value 12
is considered as TRUE
. The .keep = "used"
argument keeps only the new variables, and the variables that were used to make them in the tibble.
flights|> mutate(nov = month == 11, final = nov | ..., .keep = ...)
flights|> mutate(nov = month == 11, final = nov | 12, .keep = "used")
As well as & and |, R also has && and ||. Don’t use them in dplyr functions! These are called short-circuiting operators and only ever return a single TRUE or FALSE. They’re important for programming, not data science.
An easy way to avoid the problem of getting your ==
s and |
s in the right order is to use %in%
. x %in% y
returns a logical vector the same length as x
that is TRUE
whenever a value in x
is anywhere in y
. x %in% y
returns TRUE
if anything in x
is in y
. For example, type 1:12 %in% c(1,5,11)
and click "Run Code".
1:12 %in% c(..., ..., ...)
1:12 %in% c(1, 5, 11)
This code shows what would happen if we assign all the numbers from 1
through 12
to x
, and 1
,5
, and 11
to y
, and then run x %in% y
.
To find all of the flights in November and December, start a new pipe with flights
. Pipe it into filter()
. Within filter()
, add month %in% c(11,12)
.
flights |> filter(month %in% c(..., ...))
flights |> filter(month %in% c(11, 12))
Fun Fact: %in%
obeys different rules for NA
than ==
, as NA %in% NA
is TRUE
while NA == NA
is NA
.
Type c(1, 2, NA) == NA
and click "Run code".
c(1, 2, NA) == ...
c(1, 2, NA) == NA
This should output #> [1] NA NA NA
because when comparing any value (including NA
) with NA
using the ==
operator, the result will be NA
. This is because the value of NA
represents unknown information, so the result of most comparisons involving NA
are also unknown.
Copy previous code but instead of using ==
, use %in%
. This checks to see if any value on the left side (x
), is on the rights side (y
).
c(1, 2, NA) %in% ...
c(1, 2, NA) %in% NA
This should output #> [1] FALSE FALSE TRUE
because NA
is the only variable on both sides.
Start a new pipe with flights
. Pipe it into filter()
. Within filter()
, add dep_time %in% c(NA, 0800)
to find all rows where dep_time
is NA
or 0800
.
flights |> filter(dep_time %in% c(..., ...))
flights |> filter(dep_time %in% c(NA, 0800))
This should result in a tibble where the only values in the dep_time
column are NA
and 0800
.
Next, we will describe some useful techniques for summarizing logical vectors. In addition to functions that only work with logical vectors, you can also use functions that work with numeric vectors.
Start a pipe with flights
. Pipe it into summarize()
. Within summarize()
, add all_delayed = all(dep_delay <= 60)
. This will make a new column named all_delayed
that is TRUE
when dep_delay
is less than or equal to 60
minutes.
flights |> summarize(all_delayed = all(dep_delay <= ...))
flights |> summarize(all_delayed = all(dep_delay <= 60))
There are two main logical summary functions: any()
and all()
. any(x)
is the equivalent of |
; it’ll return TRUE
if there are any TRUE
’s in x
. all(x)
is equivalent of &
; it’ll return TRUE
only if all values of x
are TRUE
’s.
Copy the previous code. Within all()
, add na.rm = TRUE
separated with a ,
. When na.rm
is set to TRUE
, it removes all NA
values. na.rm
is short for na.remove
.
flights |> summarize(all_delayed = all(dep_delay <= 60, na.rm = ...))
flights |> summarize(all_delayed = all(dep_delay <= 60, na.rm = TRUE))
Next, we will use all()
and any()
to find out if every flight was delayed on departure by at most an hour or if any flights were delayed on arrival by five hours or more. Using the group_by()
will allow us to do that by day.
Copy the previous code. Within summarize()
, add any_long_delay = any(arr_delay >= 300)
. This will make a new column named any_long_delay
that is TRUE
when arr_delay
is greater than or equal to 300
minutes.
flights |> summarize( all_delayed = all(dep_delay <= 60, na.rm = TRUE), any_long_delay = any(arr_delay >= ...))
flights |> summarize( all_delayed = all(dep_delay <= 60, na.rm = TRUE), any_long_delay = any(arr_delay >= 300))
Like all summary functions, any()
and all()
will return NA
if there are any missing values present. As usual, you can make the missing values go away with na.rm = TRUE
.
Copy the previous code. Add the na.rm = TRUE
argument in any()
.
flights |> summarize( all_delayed = all(dep_delay <= 60, na.rm = TRUE), any_long_delay = any(arr_delay >= 300), ...)
flights |> summarize( all_delayed = all(dep_delay <= 60, na.rm = TRUE), any_long_delay = any(arr_delay >= 300), na.rm = TRUE)
>=
is used when you want to find a variable that that is greater than or equal to a number. Conversely, you could use <=
to find a variable that that is less than or equal to a number. Finally, ==
can be used to find a variable that is exactly equal to a number.
Copy the previous code. Add .by = c(year, month)
to the summarize()
function. Make sure to separate the arguments with commas. The added code will make the tibble have one row for each variation of month
and year
.
flights |> summarize( all_delayed = all(dep_delay <= 60, na.rm = TRUE), any_long_delay = any(arr_delay >= 300), na.rm = TRUE, .by = c(..., ...))
flights |> summarize( all_delayed = all(dep_delay <= 60, na.rm = TRUE), any_long_delay = any(arr_delay >= 300), na.rm = TRUE, .by = c(year, month))
However, in most cases, any()
and all()
are a little too crude, and it would be helpful to get a little more detail about how many values are TRUE
and FALSE
. This is what numeric summaries are for.
Add day
as a new argument in .by()
.
flights |> summarize( all_delayed = all(dep_delay <= 60, na.rm = TRUE), any_long_delay = any(arr_delay >= 300, na.rm = TRUE), .by = c(year, month, ...))
flights |> summarize( all_delayed = all(dep_delay <= 60, na.rm = TRUE), any_long_delay = any(arr_delay >= 300), na.rm = TRUE, .by = c(year, month, day))
This allows us to find out if every flight was delayed on departure by at most an hour and the number of flights that were delayed on arrival by five hours or more. This will make the resulting tibble have one row for each combination of year
, month
, and day
. Essentially, we will have one row for each day of the year.
Start a new pipe with flights
. Pipe it into summarize()
. Within summarize()
, add all_delayed = mean(dep_delay <= 60)
.
flights |> summarize( all_delayed = ...(dep_delay <= 60))
flights |> summarize( all_delayed = mean(dep_delay <= 60))
This should create a column called all_delayed
that is the mean of the rows where dep_delay
is less than 60
.
Copy the previous code. Add na.rm = TRUE
within mean()
.
flights|> summarize(all_delayed = mean(dep_delay <= 60,...))
flights |> summarize( all_delayed = mean(dep_delay <= 60, na.rm = TRUE))
Setting na.rm
equal to TRUE
removes all NA
values.
Copy the previous code and add any_long_delay = sum(arr_delay >= 300)
as a new argument in summarize()
. Again, make sure to separate the arguments with commas.
flights |> summarize( all_delayed = mean(dep_delay <= 60, na.rm = TRUE), any_long_delay = ...(arr_delay >= 300))
flights |> summarize( all_delayed = mean(dep_delay <= 60, na.rm = TRUE), any_long_delay = sum(arr_delay >= 300))
The sum()
function gives the number of TRUE
's of whatever is inside it.
Copy the previous code and add na.rm = TRUE
within sum()
.
flights|> summarize(all_delayed = mean(dep_delay <= 60, na.rm = TRUE), any_long_delay = sum(arr_delay >= 300, ...))
flights |> summarize( all_delayed = mean(dep_delay <= 60, na.rm = TRUE), any_long_delay = sum(arr_delay >= 300, na.rm = TRUE))
When you use a logical vector in a numeric context, TRUE
becomes 1
and FALSE
becomes 0
. This makes sum()
and mean()
very useful with logical vectors because sum(x)
gives the number of TRUE
s and mean(x)
gives the proportion of TRUE
s (mean()
is sum()
divided by length()
).
Copy the previous code and add .by = c(year, month)
as a new argument within summarize()
.
flights|> summarize(all_delayed = mean(dep_delay <= 60, na.rm = TRUE), any_long_delay = sum(arr_delay >= 300, na.rm = TRUE), .by = c(..., ...))
flights |> summarize( all_delayed = mean(dep_delay <= 60, na.rm = TRUE), any_long_delay = sum(arr_delay >= 300, na.rm = TRUE), .by = c(year, month))
Always put lists of values within c()
; otherwise, the code will result in an error.
Copy previous code and add day
to the .by()
argument within summarize()
to get rows for every single day of the year. Make sure to separate the arguments with commas!
flights |> summarize( all_delayed = mean(dep_delay <= 60, na.rm = TRUE), any_long_delay = sum(arr_delay >= 300, na.rm = TRUE), .by = c(year, month, ...))
flights |> summarize( all_delayed = mean(dep_delay <= 60, na.rm = TRUE), any_long_delay = sum(arr_delay >= 300, na.rm = TRUE), .by = c(year, month, day))
This allows for us to see the proportion of flights that were delayed on departure by at most 1 hour (60
minutes) and the number of flights that were delayed on arrival by 5 hours (300
minutes) or more.
Start a new pipe with flights
. Pipe it into filter
. Within filter()
, add arr_delay > 0
as the argument.
flights |> filter(arr_delay > ...)
flights |> filter(arr_delay > 0)
There’s one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest. This makes use of the base [
(pronounced subset) operator.
Continue the pipe with summarize()
. Within summarize()
, add the argument behind = mean(arr_delay)
.
flights |> filter(arr_delay > 0) |> summarize(behind = mean(...))
flights |> filter(arr_delay > 0) |> summarize(behind = mean(arr_delay))
This will look at the average delay only for flights that were actually delayed. We did this by first filtering the flights, then calculating the average delay.
Copy the previous code and add the argument n = n()
within summarize()
. Make sure to separate arguments within a function with comma's.
flights |> filter(arr_delay > 0)|> summarize(behind = mean(arr_delay), n = ...())
flights |> filter(arr_delay > 0) |> summarize(behind = mean(arr_delay), n = n())
n = n()
is used to calculate the count of observations where arr_delay
is greater than 0
.
Add .by = c(year, month, day)
to have one row for each day of the year.
flights |> filter(arr_delay > 0) |> summarize(behind = mean(arr_delay), n = n(), .by = c(..., ..., ...))
flights |> filter(arr_delay > 0) |> summarize(behind = mean(arr_delay), n = n(), .by = c(year, month, day))
This looks at the average delay just for flights that were actually delayed. We did this by first filtering the flights and then calculating the average delay:
This works, but what if we wanted to also compute the average delay for flights that arrived early? We’d need to perform a separate filter step, and then figure out how to combine the two data frames together. Alternatively, you could use [
to perform an inline filtering: arr_delay[arr_delay > 0]
will yield only the positive arrival delays.
Press "Run Code".
flights|> summarize( behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE), ahead = mean(arr_delay[arr_delay < 0], na.rm = TRUE), n = n(), .by = c(year, month, day) )
Also note the difference in the group size: in the first chunk n()
gives the number of delayed flights per day; in the second, n()
gives the total number of flights.
Let’s begin with a simple example of labeling a numeric vector as either “+ve”
(positive) or “-ve”
(negative). Assign x
to the vector c(-3:3, NA)
and press "Run Code".
x <- c(-3:3, ...)
x <- c(-3:3, NA)
Copy the previous code. Then, type if_else(x > 0, "+ve", "-ve")
. This should output #> [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" NA
.
x <- c(-3:3, NA) if_else(x > ..., "+ve", "-ve")
x <- c(-3:3, NA) if_else(x > 0, "+ve", "-ve")
There’s an optional fourth argument, missing
, which will be used if the input is NA
. In this example, we can add the string "???"
as an argument to if_else()
.
x <- c(-3:3, NA) if_else(x > 0, "+ve", "-ve", ...)
x <- c(-3:3, NA) if_else(x > 0, "+ve", "-ve", "???")
The output should be #> [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" "???"
You can also use vectors for the the TRUE
and FALSE
arguments. Let us attempt to create a minimal implementation of abs()
. Below the code assigning a vector to x
, type if_else(x < 0, -x, x)
. Click "Run Code".
x <- c(-3:3, NA)
x <- c(-3:3, NA) if_else(x < ..., -x, x)
x <- c(-3:3, NA) if_else(x < 0, -x, x)
This should output #> [1] 3 2 1 0 1 2 3 NA
. So far, all the arguments have used the same vectors, but you can also mix and match!
For example, you could implement a simple version of coalesce()
. Below the code assigning vectors to x1
and y1
, type if_else(is.na(x1), y1, x1)
and click "Run Code".
x1 <- c(NA, 1, 2, NA) y1 <- c(3, NA, 4, 6)
x1 <- c(NA, 1, 2, NA) y1 <- c(3, NA, 4, 6) if_else(is.na(...), y1, x1)
x1 <- c(NA, 1, 2, NA) y1 <- c(3, NA, 4, 6) if_else(is.na(x1), y1, x1)
The output should be #> [1] 3 1 2 6
. You might have noticed a small infelicity in our labeling example above: zero is neither positive nor negative. We could resolve this by adding an additional if_else()
.
Below the provided code, type if_else(x == 0, "0", if_else(x < 0, "-ve", "+ve"), "???")
and click "Run Code".
x <- c(-3:3, NA)
x <- c(-3:3, NA) if_else(x == 0, "0", ...(x < 0, "-ve", "+ve"), "???")
x <- c(-3:3, NA) if_else(x == 0, "0", if_else(x < 0, "-ve", "+ve"), "???")
This is already a little hard to read, and you can imagine it would only get harder if you have more conditions. One solution is to switch to dplyr::case_when()
.
The code below should have the same output as above, but instead of using if_else
, we are using case_when()
. Click "Run Code".
x <- c(-3:3, NA) case_when( x == 0 ~ "0", x < 0 ~ "-ve", x > 0 ~ "+ve", is.na(x) ~ "???" )
dplyr’s case_when()
is inspired by SQL’s CASE
statement and provides a flexible way of performing different computations for different conditions. It has a special syntax that unfortunately looks like nothing else you’ll use in the tidyverse. It takes pairs that look like condition ~ output
. condition
must be a logical vector; when it’s TRUE
, output
will be used.
To explain how case_when()
works, let’s explore some simpler cases. If none of the cases match, the output gets an NA
. Click "Run Code".
x <- c(-3:3, NA) case_when( x < 0 ~ "-ve", x > 0 ~ "+ve" )
The output should be #> [1] "-ve" "-ve" "-ve" NA "+ve" "+ve" "+ve" NA
.
If you want to create a “default”/catch all value, use TRUE on the left hand side. Copy the previous code and add TRUE ~ "???"
as a new argument. Make sure to separate the arguments with a comma.
x <- c(-3:3, NA) case_when( x < 0 ~ "-ve", x > 0 ~ "+ve", ... ~ "???" )
x <- c(-3:3, NA) case_when( x < 0 ~ "-ve", x > 0 ~ "+ve", TRUE ~ "???" )
This should output #> [1] "-ve" "-ve" "-ve" "???" "+ve" "+ve" "+ve" "???"
.
Also, note that if multiple conditions match, only the first will be used. Click "Run Code".
x <- c(-3:3, NA) case_when( x > 0 ~ "+ve", x > 2 ~ "big" )
This should output #> [1] NA NA NA NA "+ve" "+ve" "+ve" NA
because it used the first argument. Just like with if_else()
you can use variables on both sides of the~
and you can mix and match variables as needed for your problem.
For example, we could use case_when()
to provide some human readable labels for the arrival delay. Your gratitude for having the code written for you is much appreciated.
flights |> mutate( status = case_when( is.na(arr_delay) ~ "cancelled", arr_delay < -30 ~ "very early", arr_delay < -15 ~ "early", abs(arr_delay) <= 15 ~ "on time", arr_delay < 60 ~ "late", arr_delay < Inf ~ "very late", ), .keep = "used" )
Be wary when writing this sort of complex case_when()
statement.
Note that both if_else()
and case_when()
require compatible types in the output. If they’re not compatible, you’ll see errors. To demonstrate, click "Run Code".
if_else(TRUE, "a", 1)
This should output #> Error in if_else(): #> ! Can't combine true <character> and false <double>.
This tutorial covered Chapter 12: Logical vectors from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You learned how to create logical vectors with >
, <
, <=
, =>
, ==
, !=
, and is.na()
, how to combine them with !
, &
, and |
, and how to summarize them with any()
, all()
, sum()
, and mean()
. You also learned about the powerful if_else()
and case_when()
functions that allow you to return values depending on the value of a logical vector.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.