Filter observations

library(learnr)
library(tidyverse)
library(nycflights13)
library(Lahman)

tutorial_options(
  exercise.timelimit = 60,
  # A simple checker function that just returns the message in the check chunk
  exercise.checker = function(check_code, ...) {
    list(
      message = eval(parse(text = check_code)),
      correct = logical(0),
      type = "info",
      location = "append"
    )
  }
)
knitr::opts_chunk$set(error = TRUE)

Welcome

This is a demo tutorial. Compare it to the source code that made it.

In this tutorial, you will learn how to:

The readings in this tutorial follow R for Data Science, section 5.2.

Prerequisites

To practice these skills, we will use the flights data set from the nycflights13 package. This data frame comes from the US Bureau of Transportation Statistics and contains all r format(nrow(nycflights13::flights), big.mark = ",") flights that departed from New York City in 2013. It is documented in ?flights.

We will also use the ggplot2 package to visualize the data.

If you are ready to begin, click on!

Filter rows with filter()

filter()

filter() lets you use a logical test to extract specific rows from a data frame. To use filter(), pass it the data frame followed by one or more logical tests. filter() will return every row that passes each logical test.

So for example, we can use filter() to select every flight in flights that departed on January 1st. Click Run Code to give it a try:

filter(flights, month == 1, day == 1)

output

Like all dplyr functions, filter() returns a new data frame for you to save or use. It doesn't overwrite the old data frame.

If you want to save the output of filter(), you'll need to use the assignment operator, <-.

Rerun the command in the code chunk below, but first arrange to save the output to an object named jan1.

filter(flights, month == 1, day == 1)
jan1 <- filter(flights, month == 1, day == 1)

Good job! You can now see the results by running the name jan1 by itself. Or you can pass jan1 to a function that takes data frames as input.

Did you notice that this code used the double equal operator, ==? == is one of R's logical comparison operators. Comparison operators are key to using filter(), so let's take a look at them.

Logical Comparisons

Comparison operators

R provides a suite of comparison operators that you can use to compare values: >, >=, <, <=, != (not equal), and == (equal). Each creates a logical test. For example, is pi greater than three?

pi > 3

When you place a logical test inside of filter(), filter applies the test to each row in the data frame and then returns the rows that pass, as a new data frame.

Our code above returned every row whose month value was equal to one and whose day value was equal to one.

Watch out!

When you start out with R, the easiest mistake to make is to test for equality with = instead of ==. When this happens you'll get an informative error:

filter(flights, month = 1)

Multiple tests

If you give filter() more than one logical test, filter() will combine the tests with an implied "and." In other words, filter() will return only the rows that return TRUE for every test. You can combine tests in other ways with Boolean operators...

Boolean operators

&, |, and !

R uses boolean operators to combine multiple logical comparisons into a single logical test. These include & (and), | (or), ! (not or negation), and xor() (exactly or).

Both | and xor() will return TRUE if one or the other logical comparison returns TRUE. xor() differs from | in that it will return FALSE if both logical comparisons return TRUE. The name xor stands for exactly or.

Study the diagram below to get a feel for how these operators work.

knitr::include_graphics("images/transform-logical.png")

Test Your Knowledge

question(" What will the following code return? `filter(flights, month == 11 | month == 12)`",
         answer("Every flight that departed in November _or_ December", correct = TRUE),
         answer("Every flight that departed in November _and_ December", message = "Technically a flight could not have departed in November _and_ December unless it departed twice."),
         answer("Every flight _except for_ those that departed in November or December"),
         answer("An error. This is an incorrect way to combine tests.", message = "The next section will say a little more about combining tests."),
         allow_retry = TRUE
)

Common mistakes

In R, the order of operations doesn't work like English. You can't write filter(flights, month == 11 | 12), even though you might say "finds all flights that departed in November or December". Be sure to write out a complete test on each side of a boolean operator.

Here are four more tips to help you use logical tests and Boolean operators in R:

  1. A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y. We could use it to rewrite the code in the question above:

    r nov_dec <- filter(flights, month %in% c(11, 12))

  1. Sometimes you can simplify complicated subsetting by remembering De Morgan's law: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y. For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:

    r filter(flights, !(arr_delay > 120 | dep_delay > 120)) filter(flights, arr_delay <= 120, dep_delay <= 120)

  1. As well as & and |, R also has && and ||. Don't use them with filter()! You'll learn when you should use them later.

  1. Whenever you start using complicated, multipart expressions in filter(), consider making them explicit variables instead. That makes it much easier to check your work. You'll learn how to create new variables shortly.

Missing values

NA

Missing values can make comparisons tricky in R. R uses NA to represent missing or unknown values. NAs are "contagious" because almost any operation involving an unknown value (NA) will also be unknown (NA). For example, can you determine what value these expressions that use missing values should evaluate to? Make a prediction and then click "Submit Answer".

NA > 5
10 == NA
NA + 10
NA / 2
NA == NA
"In every case, R does not have enough information to compute a result. Hence, each result is an unknown value, `NA`."

is.na()

The most confusing result above is this one:

NA == NA

It's easiest to understand why this is true with a bit more context:

# Let x be Mary's age. We don't know how old she is.
x <- NA

# Let y be John's age. We don't know how old he is.
y <- NA

# Are John and Mary the same age?
x == y
# We don't know!

If you want to determine if a value is missing, use is.na():

is.na(x)

filter() and NAs

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)

Exercises

Exercise 1

Use the code chunks below to find all flights that

  1. Had an arrival delay of two or more hours

    ```r

    r filter(flights, arr_delay >= 120) # arr_delay is in minutes ```

  2. Flew to Houston (IAH or HOU)

    ```r

    r filter(flights, dest %in% c("IAH", "HOU")) ```

    Hint: This is a good case for the %in% operator.

  3. Were operated by United (UA), American (AA), or Delta (DL)

    ```r

    r filter(flights, carrier %in% c("UA", "AA", "DL")) ```

    Hint: The carrier variable lists the airline that operated each flight. This is another good case for the %in% operator.

  4. Departed in summer (July, August, and September)

    ```r

    r filter(flights, 6 < month, month < 10) ```

    Hint: When converted to numbers, July, August, and September become 7, 8, and 9.

  5. Arrived more than two hours late, but didn't leave late

    ```r

    r filter(flights, arr_delay > 120, dep_delay <= 0) ```

    Hint: Remember that departure and arrival delays are recorded in minutes.

  6. Were delayed more than an hour, but made up more than 30 minutes in flight

    ```r

    r filter(flights, dep_delay > 60, (dep_delay - arr_delay) > 30) ```

    Hint: The time a plane makes up is dep_delay - arr_delay.

  7. Departed between midnight and 6am (inclusive)

    ```r

    r filter(flights, dep_time <= 600 | dep_time == 2400) ```

    Hint: Don't forget flights that left at exactly midnight (2400). This is a good case for an "or" operator.

Exercise 2

Another useful dplyr filtering helper is between(). What does it do? Can you use between() to simplify the code needed to answer the previous challenges?

?between

Exercise 3

How many flights have a missing dep_time? What other variables are missing? What might these rows represent?


filter(flights, is.na(dep_time))
**Hint:** This is a good case for `is.na()`.
"Flights with a missing departure time are probably cancelled flights."

Exercise 4

Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)


# any number with a zero exponent is equal to one
NA ^ 0
# unknown value or true evaluates to true
# (because if one operand of "or" is true, we can be sure the result is true)
NA | TRUE
# false and unknown value evaluates to false
# (because if one operand of "and" is true, we can be sure the result is false)
FALSE & NA
# this is not a logical comparison, it's a numerical calculation involving an
# unknown value, thus resulting in an unknown value
NA * 0


Try the learnr package in your browser

Any scripts or data that you put into this service are public.

learnr documentation built on Sept. 28, 2023, 9:06 a.m.