Logical vectors

library(learnr)
library(tutorial.helpers)
library(tidyverse)
library(nycflights13)

knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 60, 
        tutorial.storage = "local") 


Introduction

This tutorial covers Chapter 12: Logical vectors from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn how to create logical vectors with >, <, <=, =>, ==, !=, and is.na(), how to combine them with !, &, and |, and how to summarize them with any(), all(), sum(), and mean(). You will also learn about the powerful if_else() and case_when() functions that allow you to return values depending on the value of a logical vector.

Comparisons

Logical vectors are the simplest type of vector because each element can only be one of three possible values: TRUE, FALSE, and NA. It's relatively rare to find logical vectors in your raw data, but you'll create and manipulate them in almost every analysis.

Exercise 1

Load the tidyverse package


library(...)
library(tidyverse)

Most of the functions you'll learn about in this chapter are provided by base R, so we don't need the tidyverse, but we'll still load it so we can use mutate(), filter(), and others to work with data frames.

Exercise 2

Load the nycflights13 package.


library(...)
library(nycflights13)

You can find more information about the package with help(package = "nycflights13").

Exercise 3

Type flights and hit "Run Code".


A very common way to create a logical vector is via a numeric comparison with <, <=, >, >=, !=, and ==.

Exercise 4

Pipe flights into the filter() function. Within filter(), add dep_time > 600 to only look at flights scheduled to depart after 6:00 AM.


flights |> 
  filter(...)
flights |> 
  filter(dep_time > 600)

In addition to & and |, R also has && and ||. Don't use them in dplyr functions! These are called short-circuiting operators and only ever return a single TRUE or FALSE. They're mainly used for programming, not data science.

Exercise 5

Use the same code as above and add & dep_time < 2000 to the call to filter(). This will look at flights scheduled to depart after 6:00 AM and before 8:00 PM.


flights |> 
  filter(dep_time > 600 & ...)
flights |> 
  filter(dep_time > 600 & dep_time < 2000)

An easy way to avoid the problem of getting your =='s and |'s in the right order is to use %in%. x %in% y returns a logical vector the same length as x that is TRUE whenever a value in x is anywhere in y .

Exercise 6

Use the same code as above and add & abs(arr_delay) < 20 to the call to filter(). This will filter the data out so that the absolute value of the arrival delay is less then 20 minutes.


flights |> 
  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < ...)
flights |> 
  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)

You can always explicitly create the underlying logical variables with mutate()

Exercise 7

Start a new pipe with flights. Pipe flights into the mutate() function. Within mutate, use daytime = dep_time > 600 & dep_time < 2000. This will create a new column called daytime which is TRUE when the dep_time is between 6:00 AM and 8:00 PM.


flights |> 
  mutate(daytime = dep_time > ... & dep_time < ...)
flights |> 
  mutate(daytime = dep_time > 600 & dep_time < 2000)

Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside a data frame with mutate() and friends.

Exercise 8

Use the same code as above and add , approx_ontime = abs(arr_delay) < 20 to the mutate() function. This will create a new column named approx_ontime which is true when the arr_delay is less than 20 minutes.


flights |> 
  mutate(daytime = dep_time > 600 & dep_time < 2000,
         approx_ontime = abs(arr_delay) < ...)
flights |> 
  mutate(daytime = dep_time > 600 & dep_time < 2000,
         approx_ontime = abs(arr_delay) < 20)

Exercise 9

Copy previous code and add ,.keep = "used" to the mutate() function. This keeps only the new columns and the columns used to create them in the tibble and drops all the unused columns.


flights |> 
  mutate(daytime = dep_time > 600 & dep_time < 2000,
         approx_ontime = abs(arr_delay) < 20,
         .keep = ...)
flights |> 
  mutate(daytime = dep_time > 600 & dep_time < 2000,
         approx_ontime = abs(arr_delay) < 20,
         .keep = "used")

Exercise 10

Copy previous code and pipe it into filter(). Within filter() add daytime & approx_ontime. This will keep only the rows that meet the specified conditions.


flights |> 
  mutate(daytime = dep_time > 600 & dep_time < 2000,
         approx_ontime = abs(arr_delay) < 20,
         .keep = "used")|>
  filter(daytime & ...)
flights |> 
  mutate(daytime = dep_time > 600 & dep_time < 2000,
         approx_ontime = abs(arr_delay) < 20,
         .keep = "used") |>
  filter(daytime & approx_ontime)

So far, we’ve mostly created logical variables transiently within filter() — they are computed, used, and then thrown away. For example, the previous filter() finds all daytime departures that arrive roughly on time.

Exercise 11

Set x to c(1 / 49 * 49, sqrt(2) ^ 2) using <-. Then, click "Run Code".


x <- c(..., ...)
x <- c(1 / 49 * 49, sqrt(2) ^ 2)

Beware of using == with numbers. For example, it looks like this vector contains the numbers 1 and 2. But if you test them for equality, you get FALSE.

Exercise 12

Copy previous code and run x == c(1,2). The expected output is FALSE.


x <- c(1 / 49 * 49, sqrt(2) ^ 2)
x == c(..., ...)
x <- c(1 / 49 * 49, sqrt(2) ^ 2)
x == c(1, 2)

What's going on? Computers store numbers with a fixed number of decimal places so there's no way to exactly represent 1/49 or sqrt(2) and subsequent computations will be very slightly off.

Exercise 13

We can see the exact values by calling print() with the digits argument. Copy previous code and run print(x, digits = 16). This will run the values of x with 16 digits.


x <- c(1 / 49 * 49, sqrt(2) ^ 2)
x == c(1, 2)
print(x, digits = ...)
x <- c(1 / 49 * 49, sqrt(2) ^ 2)
x == c(1, 2)
print(x, digits = 16)

You can see why R defaults to rounding these numbers; they really are very close to what you expect. Now that you've seen why == is failing, what can you do about it? One option is to use dplyr::near() which ignores small differences:

Exercise 14

Copy the previous code and type near() on the next line. Within near(), add x, c(1,2). This should come out as TRUE for both.


x <- c(1 / 49 * 49, sqrt(2) ^ 2)
x == c(1, 2)
print(x, digits = 16)
near(x, c(..., ...))
x <- c(1 / 49 * 49, sqrt(2) ^ 2)
x == c(1, 2)
print(x, digits = 16)
near(x, c(1, 2))

Missing values represent the unknown so they are “contagious”: almost any operation involving an unknown value will also be unknown.

Exercise 15

Run NA > 5 and 10 == NA. They should both come out as NA


NA > ...
... == NA
NA > 5
10 == NA

R normally calls print for you (i.e. x is a shortcut for print(x)), but calling it explicitly is useful if you want to provide other arguments.

Exercise 16

Now, run NA == NA.This should also come out as NA


NA == NA

That is the most confusing result. It's easiest to understand why this is true if we artificially supply a little more context.

Exercise 17

Set age_mary <- NA and age_john <- NA on the next line. On a new line, type age_mary == age_john and click run code.


age_mary <- ...
... <- NA
age_mary == age_john
age_mary <- NA
age_john <- NA
age_mary == age_john

This should come out as NA because if both of their ages are unknown, then we can't know if they are the same age.

Exercise 18

Start a new pipe with flights. Pipe that into the filter() function. Within filter(), add dep_time == NA. This will attempt to find all the rows where dep_time is missing.


flights|>
  filter(dep_time == ...)
flights|>
  filter(dep_time == NA)

The following code doesn’t work because dep_time == NA will yield NA for every single row, and filter() automatically drops missing values. Instead we'll need a new tool: is.na().

Exercise 19

Type is.na() into the code chunk. Within is.na(), add c(TRUE, NA, FALSE).


is.na(c(TRUE, NA, ...))
is.na(c(TRUE, NA, FALSE))

You will see the output FALSE for TRUE and FALSE, but TRUE for NA. What happens when there are characters or numbers in the input vector?

Exercise 20

Type is.na() into the code chunk. Within is.na(), add c(1, NA, 'b').


is.na(c(1, NA, ...))
is.na(c(1, NA, 'b'))

You will see the output True for NA and FALSE for 1 and b. This is because is.na() can work with any type of vector and returns TRUE for missing values and FALSE for everything else..

Exercise 21

Start a new pipe with flights. Pipe flights into filter(). Within filter(), add is.na(dep_time). This will find all the rows with a missing dep_time.


flights |> 
  filter(is.na(...))
flights |> 
  filter(is.na(dep_time))

is.na() can also be useful in arrange(). arrange() usually puts all the missing values at the end but you can override this default by first sorting with is.na().

Exercise 22

Start a new pipe with flights. Pipe it into the filter() function. Within filter(), add month == 1, day == 1.


flights|>
  filter(month == ..., day == ...)
flights |> 
  filter(month == 1, day == 1)

The == operator is a comparison operator in R that checks if two values are equal. It returns TRUE if the values are equal and FALSE otherwise.

Exercise 23

Continue the pipe with arrange(). Within arrange(), add dep_time. This should arrange dep_time from least to greatest.


flights|>
  filter(month == 1, day == 1) |>
  arrange(...)
flights |> 
  filter(month == 1, day == 1) |>
  arrange(dep_time)

The arrange() function is used to order rows in a data set based on a column. It allows you to sort the data set in either ascending or descending order.

Exercise 24

Within arrange(), put is.na(dep_time).


flights|>
  filter(month == 1, day == 1)|>
  arrange(is.na(...))
flights |> 
  filter(month == 1, day == 1) |>
  arrange(is.na(dep_time))

This checks if the dep_time column has missing values (NA). It returns a logical vector with TRUE for rows where dep_time is NA and FALSE otherwise.

Exercise 25

Within arrange(), put desc(is.na(dep_time)), dep_time


flights|>
  filter(month == 1, day == 1)|>
  arrange(desc(is.na(...)), ...)
flights |> 
  filter(month == 1, day == 1) |>
  arrange(desc(is.na(dep_time)), dep_time)

The desc() function is used to create a descending order of the logical vector obtained from is.na(dep_time). This means that rows with missing dep_time values (NA) will appear first in the data frame. We will discuss missing values further in the Missing Values tutorial.

Boolean algebra

Once you have multiple logical vectors, you can combine them together using Boolean algebra. In R, & is “and”, | is “or”, ! is “not”, and xor() is exclusive or. For example, df |> filter(!is.na(x)) finds all rows where x is not missing and df |> filter(x < -10 | x > 0) finds all rows where x is smaller than -10 or bigger than 0. Figure 13.1 shows the complete set of Boolean operations and how they work.

Exercise 1

The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance. Click "Run Code".

df <- tibble(x = c(TRUE, FALSE, NA))

df |> 
  mutate(
    and = x & NA,
    or = x | NA
  )

To understand what’s going on, think about NA | TRUE (NA or TRUE). A missing value in a logical vector means that the value could either be TRUE or FALSE. TRUE | TRUE and FALSE | TRUE are both TRUE because at least one of them is TRUE. NA | TRUE must also be TRUE because NA can either be TRUE or FALSE. However, NA | FALSE is NA because we don’t know if NA is TRUE or FALSE. Similar reasoning applies with NA & FALSE.

Exercise 2

Start a new pipe with flights and pipe it into filter(). Within filter(), add month == 11 | month == 12.


flights |>
  filter(month == ... | month == ...)
flights |>
  filter(month == 11 | month == 12)

Note that the order of operations doesn’t work like English. To prove this, we can take the previous code that finds all flights that departed in November or December. You might be tempted to write it like you’d say in English: “Find all flights that departed in November or December.”

Exercise 3

Now, try replacing month == 11 | month == 12 with month == 11 | 12. This should do the same thing as before. (Spoiler Alert: It doesn't!)


flights |>
  filter(month == ... | ...)
flights |>
  filter(month == 11 | 12)

The code does not return an error, but it doesn’t seem to have worked either. What happened? Here, R first evaluates month == 11 creating a logical vector, which we call nov. It now computes nov | 12. When you use a number with a logical operator it converts everything apart from 0 to TRUE, so this is equivalent to nov | TRUE which will always be TRUE, so every row will be selected.

Exercise 4

Start a new pipe with flights. Pipe flights into the mutate() function. Within mutate(), set nov = month == 11.


flights|>
  mutate(nov = month == ...)
flights|>
  mutate(nov = month == 11)

nov = month == 11 creates a new variable called nov and assigns it to the Boolean TRUE for rows where the month is equal to 11, and FALSE otherwise. The expression month == 11 checks the equality between the month column and the value 11, resulting in a logical vector.

Exercise 5

Copy previous code and add final = nov | 12, .keep = "used" to the mutate() function. final = nov | 12 results in the variable final being TRUE if nov is TRUE or if the value 12 is considered as TRUE. The .keep = "used" argument keeps only the new variables, and the variables that were used to make them in the tibble.


flights|>
 mutate(nov = month == 11,
        final = nov | ...,
        .keep = ...)
flights|>
 mutate(nov = month == 11,
        final = nov | 12,
        .keep = "used")

As well as & and |, R also has && and ||. Don’t use them in dplyr functions! These are called short-circuiting operators and only ever return a single TRUE or FALSE. They’re important for programming, not data science.

Exercise 6

An easy way to avoid the problem of getting your ==s and |s in the right order is to use %in%. x %in% y returns a logical vector the same length as x that is TRUE whenever a value in x is anywhere in y. x %in% y returns TRUE if anything in x is in y. For example, type 1:12 %in% c(1,5,11) and click "Run Code".


1:12 %in% c(..., ..., ...)
1:12 %in% c(1, 5, 11)

This code shows what would happen if we assign all the numbers from 1 through 12 to x, and 1,5, and 11 to y, and then run x %in% y.

Exercise 7

To find all of the flights in November and December, start a new pipe with flights. Pipe it into filter(). Within filter(), add month %in% c(11,12).


flights |>
  filter(month %in% c(..., ...))
flights |>
  filter(month %in% c(11, 12))

Fun Fact: %in% obeys different rules for NA than ==, as NA %in% NA is TRUE while NA == NA is NA.

Exercise 8

Type c(1, 2, NA) == NA and click "Run code".


c(1, 2, NA) == ...
c(1, 2, NA) == NA

This should output #> [1] NA NA NA because when comparing any value (including NA) with NA using the == operator, the result will be NA. This is because the value of NA represents unknown information, so the result of most comparisons involving NA are also unknown.

Exercise 9

Copy previous code but instead of using ==, use %in%. This checks to see if any value on the left side (x), is on the rights side (y).


c(1, 2, NA) %in% ...
c(1, 2, NA) %in% NA

This should output #> [1] FALSE FALSE TRUE because NA is the only variable on both sides.

Exercise 10

Start a new pipe with flights. Pipe it into filter(). Within filter(), add dep_time %in% c(NA, 0800) to find all rows where dep_time is NA or 0800.


flights |> 
  filter(dep_time %in% c(..., ...))
flights |> 
  filter(dep_time %in% c(NA, 0800))

This should result in a tibble where the only values in the dep_time column are NA and 0800.

Summaries

Next, we will describe some useful techniques for summarizing logical vectors. In addition to functions that only work with logical vectors, you can also use functions that work with numeric vectors.

Exercise 1

Start a pipe with flights. Pipe it into summarize(). Within summarize(), add all_delayed = all(dep_delay <= 60). This will make a new column named all_delayed that is TRUE when dep_delay is less than or equal to 60 minutes.


flights |>
  summarize(all_delayed = all(dep_delay <= ...))
flights |>
  summarize(all_delayed = all(dep_delay <= 60))

There are two main logical summary functions: any() and all(). any(x) is the equivalent of |; it’ll return TRUE if there are any TRUE’s in x. all(x) is equivalent of &; it’ll return TRUE only if all values of x are TRUE’s.

Exercise 2

Copy the previous code. Within all(), add na.rm = TRUE separated with a ,. When na.rm is set to TRUE, it removes all NA values. na.rm is short for na.remove.


flights |> 
  summarize(all_delayed = all(dep_delay <= 60, na.rm = ...))
flights |> 
  summarize(all_delayed = all(dep_delay <= 60, na.rm = TRUE))

Next, we will use all() and any() to find out if every flight was delayed on departure by at most an hour or if any flights were delayed on arrival by five hours or more. Using the group_by() will allow us to do that by day.

Exercise 3

Copy the previous code. Within summarize(), add any_long_delay = any(arr_delay >= 300). This will make a new column named any_long_delay that is TRUE when arr_delay is greater than or equal to 300 minutes.


flights |> 
  summarize(
    all_delayed = all(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = any(arr_delay >= ...))
flights |> 
  summarize(
    all_delayed = all(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = any(arr_delay >= 300))

Like all summary functions, any() and all() will return NA if there are any missing values present. As usual, you can make the missing values go away with na.rm = TRUE.

Exercise 4

Copy the previous code. Add the na.rm = TRUE argument in any().


flights |> 
  summarize(
    all_delayed = all(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = any(arr_delay >= 300), ...)
flights |> 
  summarize(
    all_delayed = all(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = any(arr_delay >= 300), na.rm = TRUE)

>= is used when you want to find a variable that that is greater than or equal to a number. Conversely, you could use <= to find a variable that that is less than or equal to a number. Finally, == can be used to find a variable that is exactly equal to a number.

Exercise 5

Copy the previous code. Add .by = c(year, month) to the summarize() function. Make sure to separate the arguments with commas. The added code will make the tibble have one row for each variation of month and year.


flights |> 
  summarize(
    all_delayed = all(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = any(arr_delay >= 300), na.rm = TRUE,
    .by = c(..., ...))
flights |> 
  summarize(
    all_delayed = all(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = any(arr_delay >= 300), na.rm = TRUE,
    .by = c(year, month))

However, in most cases, any() and all() are a little too crude, and it would be helpful to get a little more detail about how many values are TRUE and FALSE. This is what numeric summaries are for.

Exercise 6

Add day as a new argument in .by().


flights |>
  summarize(
    all_delayed = all(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = any(arr_delay >= 300, na.rm = TRUE),
    .by = c(year, month, ...))
flights |> 
  summarize(
    all_delayed = all(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = any(arr_delay >= 300), na.rm = TRUE,
    .by = c(year, month, day))

This allows us to find out if every flight was delayed on departure by at most an hour and the number of flights that were delayed on arrival by five hours or more. This will make the resulting tibble have one row for each combination of year, month, and day. Essentially, we will have one row for each day of the year.

Exercise 7

Start a new pipe with flights. Pipe it into summarize(). Within summarize(), add all_delayed = mean(dep_delay <= 60).


flights |> 
  summarize(
    all_delayed = ...(dep_delay <= 60))
flights |> 
  summarize(
    all_delayed = mean(dep_delay <= 60))

This should create a column called all_delayed that is the mean of the rows where dep_delay is less than 60.

Exercise 8

Copy the previous code. Add na.rm = TRUE within mean().


flights|>
  summarize(all_delayed = mean(dep_delay <= 60,...))
flights |> 
  summarize(
    all_delayed = mean(dep_delay <= 60, na.rm = TRUE))

Setting na.rm equal to TRUE removes all NA values.

Exercise 9

Copy the previous code and add any_long_delay = sum(arr_delay >= 300) as a new argument in summarize(). Again, make sure to separate the arguments with commas.


flights |> 
  summarize(
    all_delayed = mean(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = ...(arr_delay >= 300))
flights |> 
  summarize(
    all_delayed = mean(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = sum(arr_delay >= 300))

The sum() function gives the number of TRUE's of whatever is inside it.

Exercise 10

Copy the previous code and add na.rm = TRUE within sum().


flights|>
  summarize(all_delayed = mean(dep_delay <= 60, na.rm = TRUE),
            any_long_delay = sum(arr_delay >= 300, ...))
flights |> 
  summarize(
    all_delayed = mean(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = sum(arr_delay >= 300, na.rm = TRUE))

When you use a logical vector in a numeric context, TRUE becomes 1 and FALSE becomes 0. This makes sum() and mean() very useful with logical vectors because sum(x) gives the number of TRUEs and mean(x) gives the proportion of TRUEs (mean() is sum() divided by length()).

Exercise 11

Copy the previous code and add .by = c(year, month) as a new argument within summarize().


flights|>
  summarize(all_delayed = mean(dep_delay <= 60, na.rm = TRUE),
            any_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
            .by = c(..., ...))
flights |> 
  summarize(
    all_delayed = mean(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
    .by = c(year, month))

Always put lists of values within c(); otherwise, the code will result in an error.

Exercise 12

Copy previous code and add day to the .by() argument within summarize() to get rows for every single day of the year. Make sure to separate the arguments with commas!


flights |>  
  summarize(
    all_delayed = mean(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
    .by = c(year, month, ...))
flights |> 
  summarize(
    all_delayed = mean(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
    .by = c(year, month, day))

This allows for us to see the proportion of flights that were delayed on departure by at most 1 hour (60 minutes) and the number of flights that were delayed on arrival by 5 hours (300 minutes) or more.

Exercise 13

Start a new pipe with flights. Pipe it into filter. Within filter(), add arr_delay > 0 as the argument.


flights |>
  filter(arr_delay > ...)
flights |>
  filter(arr_delay > 0)

There’s one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest. This makes use of the base [ (pronounced subset) operator.

Exercise 14

Continue the pipe with summarize(). Within summarize(), add the argument behind = mean(arr_delay).


flights |>
  filter(arr_delay > 0) |>
  summarize(behind = mean(...))
flights |>
  filter(arr_delay > 0) |>
  summarize(behind = mean(arr_delay))

This will look at the average delay only for flights that were actually delayed. We did this by first filtering the flights, then calculating the average delay.

Exercise 15

Copy the previous code and add the argument n = n() within summarize(). Make sure to separate arguments within a function with comma's.


flights |>
  filter(arr_delay > 0)|>
  summarize(behind = mean(arr_delay),
            n = ...())
flights |>
  filter(arr_delay > 0) |>
  summarize(behind = mean(arr_delay),
            n = n())

n = n() is used to calculate the count of observations where arr_delay is greater than 0.

Exercise 16

Add .by = c(year, month, day) to have one row for each day of the year.


flights |>
  filter(arr_delay > 0) |>
  summarize(behind = mean(arr_delay),
            n = n(),
            .by = c(..., ..., ...))
flights |>
  filter(arr_delay > 0) |>
  summarize(behind = mean(arr_delay),
            n = n(),
            .by = c(year, month, day))

This looks at the average delay just for flights that were actually delayed. We did this by first filtering the flights and then calculating the average delay:

This works, but what if we wanted to also compute the average delay for flights that arrived early? We’d need to perform a separate filter step, and then figure out how to combine the two data frames together. Alternatively, you could use [ to perform an inline filtering: arr_delay[arr_delay > 0] will yield only the positive arrival delays.

Exercise 17

Press "Run Code".

flights|> 
  summarize(
    behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE),
    ahead = mean(arr_delay[arr_delay < 0], na.rm = TRUE),
    n = n(),
    .by = c(year, month, day)
  )

Also note the difference in the group size: in the first chunk n() gives the number of delayed flights per day; in the second, n() gives the total number of flights.

Conditional transformations

Exercise 1

Let’s begin with a simple example of labeling a numeric vector as either “+ve” (positive) or “-ve” (negative). Assign x to the vector c(-3:3, NA) and press "Run Code".


x <- c(-3:3, ...)
x <- c(-3:3, NA)

Exercise 2

Copy the previous code. Then, type if_else(x > 0, "+ve", "-ve"). This should output #> [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" NA.


x <- c(-3:3, NA)
if_else(x > ..., "+ve", "-ve")
x <- c(-3:3, NA)
if_else(x > 0, "+ve", "-ve")

Exercise 3

There’s an optional fourth argument, missing, which will be used if the input is NA. In this example, we can add the string "???" as an argument to if_else().


x <- c(-3:3, NA)
if_else(x > 0, "+ve", "-ve", ...)
x <- c(-3:3, NA)
if_else(x > 0, "+ve", "-ve", "???")

The output should be #> [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" "???"

Exercise 4

You can also use vectors for the the TRUE and FALSE arguments. Let us attempt to create a minimal implementation of abs(). Below the code assigning a vector to x, type if_else(x < 0, -x, x). Click "Run Code".

x <- c(-3:3, NA)

x <- c(-3:3, NA)
if_else(x < ..., -x, x)
x <- c(-3:3, NA)
if_else(x < 0, -x, x)

This should output #> [1] 3 2 1 0 1 2 3 NA. So far, all the arguments have used the same vectors, but you can also mix and match!

Exercise 5

For example, you could implement a simple version of coalesce(). Below the code assigning vectors to x1 and y1, type if_else(is.na(x1), y1, x1) and click "Run Code".

x1 <- c(NA, 1, 2, NA)
y1 <- c(3, NA, 4, 6)

x1 <- c(NA, 1, 2, NA)
y1 <- c(3, NA, 4, 6)
if_else(is.na(...), y1, x1)
x1 <- c(NA, 1, 2, NA)
y1 <- c(3, NA, 4, 6)
if_else(is.na(x1), y1, x1)

The output should be #> [1] 3 1 2 6. You might have noticed a small infelicity in our labeling example above: zero is neither positive nor negative. We could resolve this by adding an additional if_else().

Exercise 6

Below the provided code, type if_else(x == 0, "0", if_else(x < 0, "-ve", "+ve"), "???") and click "Run Code".

x <- c(-3:3, NA)

x <- c(-3:3, NA)
if_else(x == 0, "0", ...(x < 0, "-ve", "+ve"), "???")
x <- c(-3:3, NA)
if_else(x == 0, "0", if_else(x < 0, "-ve", "+ve"), "???")

This is already a little hard to read, and you can imagine it would only get harder if you have more conditions. One solution is to switch to dplyr::case_when().

Exercise 7

The code below should have the same output as above, but instead of using if_else, we are using case_when(). Click "Run Code".

x <- c(-3:3, NA)
case_when(
  x == 0   ~ "0",
  x < 0    ~ "-ve", 
  x > 0    ~ "+ve",
  is.na(x) ~ "???"
)

dplyr’s case_when() is inspired by SQL’s CASE statement and provides a flexible way of performing different computations for different conditions. It has a special syntax that unfortunately looks like nothing else you’ll use in the tidyverse. It takes pairs that look like condition ~ output. condition must be a logical vector; when it’s TRUE, output will be used.

Exercise 8

To explain how case_when() works, let’s explore some simpler cases. If none of the cases match, the output gets an NA. Click "Run Code".

x <- c(-3:3, NA)
case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve"
)

The output should be #> [1] "-ve" "-ve" "-ve" NA "+ve" "+ve" "+ve" NA.

Exercise 9

If you want to create a “default”/catch all value, use TRUE on the left hand side. Copy the previous code and add TRUE ~ "???" as a new argument. Make sure to separate the arguments with a comma.


x <- c(-3:3, NA)
case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve",
  ... ~ "???"
)
x <- c(-3:3, NA)
case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve",
  TRUE ~ "???"
)

This should output #> [1] "-ve" "-ve" "-ve" "???" "+ve" "+ve" "+ve" "???".

Exercise 10

Also, note that if multiple conditions match, only the first will be used. Click "Run Code".

x <- c(-3:3, NA)
case_when(
  x > 0 ~ "+ve",
  x > 2 ~ "big"
)

This should output #> [1] NA NA NA NA "+ve" "+ve" "+ve" NA because it used the first argument. Just like with if_else() you can use variables on both sides of the~ and you can mix and match variables as needed for your problem.

Exercise 11

For example, we could use case_when() to provide some human readable labels for the arrival delay. Your gratitude for having the code written for you is much appreciated.

flights |> 
  mutate(
    status = case_when(
      is.na(arr_delay)      ~ "cancelled",
      arr_delay < -30       ~ "very early",
      arr_delay < -15       ~ "early",
      abs(arr_delay) <= 15  ~ "on time",
      arr_delay < 60        ~ "late",
      arr_delay < Inf       ~ "very late",
    ),
    .keep = "used"
  )

Be wary when writing this sort of complex case_when() statement.

Exercise 12

Note that both if_else() and case_when() require compatible types in the output. If they’re not compatible, you’ll see errors. To demonstrate, click "Run Code".

if_else(TRUE, "a", 1)

This should output #> Error in if_else(): #> ! Can't combine true <character> and false <double>.

Summary

This tutorial covered Chapter 12: Logical vectors from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You learned how to create logical vectors with >, <, <=, =>, ==, !=, and is.na(), how to combine them with !, &, and |, and how to summarize them with any(), all(), sum(), and mean(). You also learned about the powerful if_else() and case_when() functions that allow you to return values depending on the value of a logical vector.




Try the r4ds.tutorials package in your browser

Any scripts or data that you put into this service are public.

r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m.