| filter | R Documentation |
These functions are used to subset a data frame, applying the expressions in
... to determine which rows should be kept (for filter()) or dropped (
for filter_out()).
Multiple conditions can be supplied separated by a comma. These will be
combined with the & operator. To combine comma separated conditions using
| instead, wrap them in when_any().
Both filter() and filter_out() treat NA like FALSE. This subtle
behavior can impact how you write your conditions when missing values are
involved. See the section on Missing values for important details and
examples.
filter(.data, ..., .by = NULL, .preserve = FALSE)
filter_out(.data, ..., .by = NULL, .preserve = FALSE)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< |
.by |
< |
.preserve |
Relevant when the |
An object of the same type as .data. The output has the following
properties:
Rows are a subset of the input, but appear in the same order.
Columns are not modified.
The number of groups may be reduced (if .preserve is not TRUE).
Data frame attributes are preserved.
Both filter() and filter_out() treat NA like FALSE. This results in
the following behavior:
filter() drops both NA and FALSE.
filter_out() keeps both NA and FALSE.
This means that filter(data, <conditions>) + filter_out(data, <conditions>)
captures every row within data exactly once.
The NA handling of these functions has been designed to match your
intent. When your intent is to keep rows, use filter(). When your intent
is to drop rows, use filter_out().
For example, if your goal with this cars data is to "drop rows where the
class is suv", then you might write this in one of two ways:
cars <- tibble(class = c("suv", NA, "coupe"))
cars
#> # A tibble: 3 x 1
#> class
#> <chr>
#> 1 suv
#> 2 <NA>
#> 3 coupe
cars |> filter(class != "suv") #> # A tibble: 1 x 1 #> class #> <chr> #> 1 coupe
cars |> filter_out(class == "suv") #> # A tibble: 2 x 1 #> class #> <chr> #> 1 <NA> #> 2 coupe
Note how filter() drops the NA rows even though our goal was only to drop
"suv" rows, but filter_out() matches our intuition.
To generate the correct result with filter(), you'd need to use:
cars |> filter(class != "suv" | is.na(class)) #> # A tibble: 2 x 1 #> class #> <chr> #> 1 <NA> #> 2 coupe
This quickly gets unwieldy when multiple conditions are involved.
In general, if you find yourself:
Using "negative" operators like != or !
Adding in NA handling like | is.na(col) or & !is.na(col)
then you should consider if swapping to the other filtering variant would make your conditions simpler.
Base subsetting with [ doesn't treat NA like TRUE or FALSE. Instead,
it generates a fully missing row, which is different from how both filter()
and filter_out() work.
cars <- tibble(class = c("suv", NA, "coupe"), mpg = c(10, 12, 14))
cars
#> # A tibble: 3 x 2
#> class mpg
#> <chr> <dbl>
#> 1 suv 10
#> 2 <NA> 12
#> 3 coupe 14
cars[cars$class == "suv",] #> # A tibble: 2 x 2 #> class mpg #> <chr> <dbl> #> 1 suv 10 #> 2 <NA> NA cars |> filter(class == "suv") #> # A tibble: 1 x 2 #> class mpg #> <chr> <dbl> #> 1 suv 10
There are many functions and operators that are useful when constructing the expressions used to filter the data:
==, >, >= etc
&, |, !, xor()
is.na()
between(), near()
when_any(), when_all()
Because filtering expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped filtering:
starwars |> filter(mass > mean(mass, na.rm = TRUE))
With the grouped equivalent:
starwars |> filter(mass > mean(mass, na.rm = TRUE), .by = gender)
In the ungrouped version, filter() compares the value of mass in each row
to the global average (taken over the whole data set), keeping only the rows
with mass greater than this global average. In contrast, the grouped
version calculates the average mass separately for each gender group, and
keeps rows with mass greater than the relevant within-gender average.
This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
The following methods are currently available in loaded packages: \Sexpr[stage=render,results=rd]{dplyr:::methods_rd("filter")}.
Other single table verbs:
arrange(),
mutate(),
reframe(),
rename(),
select(),
slice(),
summarise()
# Filtering for one criterion
filter(starwars, species == "Human")
# Filtering for multiple criteria within a single logical expression
filter(starwars, hair_color == "none" & eye_color == "black")
filter(starwars, hair_color == "none" | eye_color == "black")
# Multiple comma separated expressions are combined using `&`
starwars |> filter(hair_color == "none", eye_color == "black")
# To combine comma separated expressions using `|` instead, use `when_any()`
starwars |> filter(when_any(hair_color == "none", eye_color == "black"))
# Filtering out to drop rows
filter_out(starwars, hair_color == "none")
# When filtering out, it can be useful to first interactively filter for the
# rows you want to drop, just to double check that you've written the
# conditions correctly. Then, just change `filter()` to `filter_out()`.
filter(starwars, mass > 1000, eye_color == "orange")
filter_out(starwars, mass > 1000, eye_color == "orange")
# The filtering operation may yield different results on grouped
# tibbles because the expressions are computed within groups.
#
# The following keeps rows where `mass` is greater than the
# global average:
starwars |> filter(mass > mean(mass, na.rm = TRUE))
# Whereas this keeps rows with `mass` greater than the per `gender`
# average:
starwars |> filter(mass > mean(mass, na.rm = TRUE), .by = gender)
# If you find yourself trying to use a `filter()` to drop rows, then
# you should consider if switching to `filter_out()` can simplify your
# conditions. For example, to drop blond individuals, you might try:
starwars |> filter(hair_color != "blond")
# But this also drops rows with an `NA` hair color! To retain those:
starwars |> filter(hair_color != "blond" | is.na(hair_color))
# But explicit `NA` handling like this can quickly get unwieldy, especially
# with multiple conditions. Since your intent was to specify rows to drop
# rather than rows to keep, use `filter_out()`. This also removes the need
# for any explicit `NA` handling.
starwars |> filter_out(hair_color == "blond")
# To refer to column names that are stored as strings, use the `.data`
# pronoun:
vars <- c("mass", "height")
cond <- c(80, 150)
starwars |>
filter(
.data[[vars[[1]]]] > cond[[1]],
.data[[vars[[2]]]] > cond[[2]]
)
# Learn more in ?rlang::args_data_masking
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.