library(learnr) library(tidyverse) library(nycflights13) library(Lahman) tutorial_options( exercise.timelimit = 60, # A simple checker function that just returns the message in the check chunk exercise.checker = function(check_code, ...) { list( message = eval(parse(text = check_code)), correct = logical(0), type = "info", location = "append" ) } ) knitr::opts_chunk$set(error = TRUE)
This is a demo tutorial. Compare it to the source code that made it.
In this tutorial, you will learn how to:
filter()
to extract observations from a data frame or tibbleThe readings in this tutorial follow R for Data Science, section 5.2.
To practice these skills, we will use the flights
data set from the nycflights13 package. This data frame comes from the US Bureau of Transportation Statistics and contains all r format(nrow(nycflights13::flights), big.mark = ",")
flights that departed from New York City in 2013. It is documented in ?flights
.
We will also use the ggplot2 package to visualize the data.
If you are ready to begin, click on!
filter()
filter()
lets you use a logical test to extract specific rows from a data frame. To use filter()
, pass it the data frame followed by one or more logical tests. filter()
will return every row that passes each logical test.
So for example, we can use filter()
to select every flight in flights that departed on January 1st. Click Run Code to give it a try:
filter(flights, month == 1, day == 1)
Like all dplyr functions, filter()
returns a new data frame for you to save or use. It doesn't overwrite the old data frame.
If you want to save the output of filter()
, you'll need to use the assignment operator, <-
.
Rerun the command in the code chunk below, but first arrange to save the output to an object named jan1
.
filter(flights, month == 1, day == 1)
jan1 <- filter(flights, month == 1, day == 1)
Good job! You can now see the results by running the name jan1 by itself. Or you can pass jan1
to a function that takes data frames as input.
Did you notice that this code used the double equal operator, ==
? ==
is one of R's logical comparison operators. Comparison operators are key to using filter()
, so let's take a look at them.
R provides a suite of comparison operators that you can use to compare values: >
, >=
, <
, <=
, !=
(not equal), and ==
(equal). Each creates a logical test. For example, is pi
greater than three?
pi > 3
When you place a logical test inside of filter()
, filter applies the test to each row in the data frame and then returns the rows that pass, as a new data frame.
Our code above returned every row whose month value was equal to one and whose day value was equal to one.
When you start out with R, the easiest mistake to make is to test for equality with =
instead of ==
. When this happens you'll get an informative error:
filter(flights, month = 1)
If you give filter()
more than one logical test, filter()
will combine the tests with an implied "and." In other words, filter()
will return only the rows that return TRUE
for every test. You can combine tests in other ways with Boolean operators...
R uses boolean operators to combine multiple logical comparisons into a single logical test. These include &
(and), |
(or), !
(not or negation), and xor()
(exactly or).
Both |
and xor()
will return TRUE if one or the other logical comparison returns TRUE. xor()
differs from |
in that it will return FALSE if both logical comparisons return TRUE. The name xor stands for exactly or.
Study the diagram below to get a feel for how these operators work.
knitr::include_graphics("images/transform-logical.png")
question(" What will the following code return? `filter(flights, month == 11 | month == 12)`", answer("Every flight that departed in November _or_ December", correct = TRUE), answer("Every flight that departed in November _and_ December", message = "Technically a flight could not have departed in November _and_ December unless it departed twice."), answer("Every flight _except for_ those that departed in November or December"), answer("An error. This is an incorrect way to combine tests.", message = "The next section will say a little more about combining tests."), allow_retry = TRUE )
In R, the order of operations doesn't work like English. You can't write filter(flights, month == 11 | 12)
, even though you might say "finds all flights that departed in November or December". Be sure to write out a complete test on each side of a boolean operator.
Here are four more tips to help you use logical tests and Boolean operators in R:
A useful short-hand for this problem is x %in% y
. This will select every row where x
is one of the values in y
. We could use it to rewrite the code in the question above:
r
nov_dec <- filter(flights, month %in% c(11, 12))
Sometimes you can simplify complicated subsetting by remembering De Morgan's law: !(x & y)
is the same as !x | !y
, and !(x | y)
is the same as !x & !y
. For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
r
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
&
and |
, R also has &&
and ||
. Don't use them with filter()
! You'll learn when you should use them later.filter()
, consider making them explicit variables instead. That makes it much easier to check your work. You'll learn how to create new variables shortly.Missing values can make comparisons tricky in R. R uses NA
to represent missing or unknown values. NA
s are "contagious" because almost any operation involving an unknown value (NA
) will also be unknown (NA
). For example, can you determine what value these expressions that use missing values should evaluate to? Make a prediction and then click "Submit Answer".
NA > 5 10 == NA NA + 10 NA / 2 NA == NA
"In every case, R does not have enough information to compute a result. Hence, each result is an unknown value, `NA`."
The most confusing result above is this one:
NA == NA
It's easiest to understand why this is true with a bit more context:
# Let x be Mary's age. We don't know how old she is. x <- NA # Let y be John's age. We don't know how old he is. y <- NA # Are John and Mary the same age? x == y # We don't know!
If you want to determine if a value is missing, use is.na()
:
is.na(x)
filter()
only includes rows where the condition is TRUE
; it excludes both FALSE
and NA
values. If you want to preserve missing values, ask for them explicitly:
df <- tibble(x = c(1, NA, 3)) filter(df, x > 1) filter(df, is.na(x) | x > 1)
Use the code chunks below to find all flights that
Had an arrival delay of two or more hours
```r
r
filter(flights, arr_delay >= 120) # arr_delay is in minutes
```
Flew to Houston (IAH
or HOU
)
```r
r
filter(flights, dest %in% c("IAH", "HOU"))
```
%in%
operator.
Were operated by United (UA
), American (AA
), or Delta (DL
)
```r
r
filter(flights, carrier %in% c("UA", "AA", "DL"))
```
carrier
variable lists the airline that operated each flight. This is another good case for the %in%
operator.
Departed in summer (July, August, and September)
```r
r
filter(flights, 6 < month, month < 10)
```
Arrived more than two hours late, but didn't leave late
```r
r
filter(flights, arr_delay > 120, dep_delay <= 0)
```
Were delayed more than an hour, but made up more than 30 minutes in flight
```r
r
filter(flights, dep_delay > 60, (dep_delay - arr_delay) > 30)
```
dep_delay - arr_delay
.
Departed between midnight and 6am (inclusive)
```r
r
filter(flights, dep_time <= 600 | dep_time == 2400)
```
2400
). This is a good case for an "or" operator.
Another useful dplyr filtering helper is between()
. What does it do? Can you use between()
to simplify the code needed to answer the previous challenges?
?between
How many flights have a missing dep_time
? What other variables are missing? What might these rows represent?
filter(flights, is.na(dep_time))
"Flights with a missing departure time are probably cancelled flights."
Why is NA ^ 0
not missing? Why is NA | TRUE
not missing?
Why is FALSE & NA
not missing? Can you figure out the general
rule? (NA * 0
is a tricky counterexample!)
# any number with a zero exponent is equal to one NA ^ 0
# unknown value or true evaluates to true # (because if one operand of "or" is true, we can be sure the result is true) NA | TRUE
# false and unknown value evaluates to false # (because if one operand of "and" is true, we can be sure the result is false) FALSE & NA
# this is not a logical comparison, it's a numerical calculation involving an # unknown value, thus resulting in an unknown value NA * 0
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.