if(packageVersion("dplyr") < "1.0.0"){ print(glue::glue( "This tutorial was designed to work with dplyr version >= 1.0.0. You have dplyr version {packageVersion(\"dplyr\")}. Parts of the tutorial will give errors unless you update dplyr." )) } # continue knitting after errors knitr::opts_chunk$set(echo = TRUE, rows.print = 3, error = TRUE) # packages library("learnr") library("tidyverse") theme_set(theme_classic()) tutorial_options(exercise.cap = "Exercise")
dplyr
Data frames are usually the most convenient objects for storing, plotting or analysing data in R. We also need to be able to manipulate data in data frames.
This tutorial will show you how to manipulate data frames using the dplyr
package, part of the tidyverse
.
Everything that can be done with dplyr
can be done with base R.
The dplyr
solution is usually easier to write and understand, and can be faster.
There are many function in dplyr
, this tutorial focuses on the functions I find most useful.
This tutorial is designed for dplyr
version >= 1.0.0.
library("tidyverse") library("conflicted") # prevent conflict errors conflict_prefer("select", winner = "dplyr") conflict_prefer("filter", winner = "dplyr")
tidylog
We can get more feedback on what dplyr
functions have done to the data by loading the tidylog
package.
library("tidylog") # use purrr::map to iterate over tidylog functions to prevent conflict with dplyr map(getNamespaceExports("tidylog"), ~conflict_prefer(.x, "tidylog", quiet = TRUE))
More than once, tidylog
has helped identified bugs in my code.
penguins
datasetThis tutorial will use the penguins
dataset from the palmerpenguins
package.
This dataset includes measurements of three species of penguin.
# Load the data data("penguins", package = "palmerpenguins") # Show the data penguins
select
You can choose which columns of the data frame you want with select()
.
The first argument is the data, which is supplied by the pipe %>%
, the next arguments are the names of the columns you want.
The names do not need quote marks.
#select species, bill_length_mm & bill_depth_mm penguins %>% select(species, bill_length_mm, bill_depth_mm)
This is equivalent to the base R code
#select species, bill_length_mm & bill_depth_mm penguins[, c("species", "bill_length_mm", "bill_depth_mm")]
Remember that if you want to use the output of this code in a further analysis, you need to assign it to an object name with <-
.
select
helpersSometimes we don't want to write out the names of all the columns we want to select. We might not even know them all in advance. Fortunately there are some helper functions.
If you want to select
adjacent columns, you can use the notation first:last
.
#select species to bill_depth_mm penguins %>% select(species:bill_depth_mm)
Sometimes it is easier to remove the columns you don't want.
You can do this by putting a -
in front of the column name.
#select everything but year and sex penguins %>% select(-year, -sex)
If there is a pattern to the column names that we want to select (or remove), there are some helper functions.
For example, to select columns that start with "bill", we can use the function starts_with
.
#select bill_length_mm & bill_depth_mm" penguins %>% select(starts_with("bill"))
Conversely, if we want to select all columns that end with "mm", we can use the function ends_with()
.
contains()
is more flexible and matches()
is the most powerful of the helper functions, using regular expressions to identify the columns.
Sometimes, you might want to select all the columns of a certain type.
For example, to select all the numeric columns we can use the is.numeric
function inside select
with the helper where
.
penguins %>% select(where(is.numeric)) # No brackets on the function
Other is.*
functions exist, for example, is.character
for text, lubridate::is.Date
for dates.
You can also select columns by number (1 being the first column), but this is generally a bad idea because it makes the code difficult to understand and if a new column is added, or the column order is changed, the code will break.
Find at least three different ways to select the bill_length_mm and bill_depth_mm columns from penguins
penguins %>% select() penguins %>% select() penguins %>% select()
#direct penguins %>% select(bill_length_mm, bill_depth_mm) #negative penguins %>% select(-species, -island, -flipper_length_mm, -body_mass_g, - sex, -year) #starts with penguins %>% select(starts_with("bill")) #matches regular expression penguins %>% select(matches("^bill")) #numeric position penguins %>% select(3:4)
Which of these strategies works best is context dependent.
rename
You can use rename()
to rename columns
penguins %>% rename(Species = species)
The syntax is new_name = current_name
.
You can also rename a column when selecting.
This is convenient if you are using select
anyway.
penguins %>% select(Species = species)
relocate
Sometimes it is useful to reorder the columns. This is never necessary for data analysis or plotting, but can be needed when making a table for presentation.
penguins %>% relocate(island)
The default is to move the named column first, the .before
and .after
arguments let you move the column into any position.
filter
Filtering rows that meet some condition is a very common task.
For example, to filter rows of penguins
that have a bill length greater than 40 mm, we can use
penguins %>% filter(bill_length_mm > 40)
This will filter out each row where the condition is TRUE.
The base R equivalent of this is
penguins[penguins$bill_length_mm > 40, ]
Generally, filter
makes it easier to understand, especially with more complex criteria.
Other tests include
==
exactly equals. Often a bad idea to use with numeric data near
safe function for testing equality of numeric data as it has a tolerance for rounding errors.sqrt(2) ^ 2 == 2 # should be true, but rounding errors sqrt(2) ^ 2 - 2 # the difference near(sqrt(2) ^ 2, 2) # safe alternative
!=
not equal to<
less than<=
less than or equal to>
greater than>=
greater than or equal tois.na()
for filtering by missing values.between()
for filtering values with a range%in%
is used when you want to test if a value is in a vectorpenguins %>% filter(species %in% c("Adelie", "Chinstrap")) #equivalent to penguins %>% filter(species == "Adelie" | species == "Chinstrap") # with many alternatives, this gets long
If we want to filter on multiple criteria, we need to decide whether we want all criteria to be TRUE (AND in Boolean logic), or for one or more to be TRUE (OR in Boolean logic).
If we want all criteria to be TRUE, we can separate them by a comma (or by an &
if you want to be explicit).
penguins %>% filter(bill_length_mm > 40, bill_depth_mm > 18)
If we want any criteria to be TRUE, we can separate them by a |
.
penguins %>% filter(bill_length_mm > 40 | bill_depth_mm > 18)
We can negate a criterion by putting !
in front of it.
So to filter rows that do not have bills longer than 40 mm we can use
penguins %>% filter(!bill_length_mm > 40)
Of course, in this example, we could also use <=
as the test.
Filter the penguins dataset so that it only has the species Chinstrap.
penguins %>% filter()
penguins %>% filter(species == "Chinstrap")
Filter the penguins dataset so that it only has flipper lengths between 200 and 210 mm.
penguins %>% filter()
penguins %>% filter(between(flipper_length_mm, left = 200, right = 210))
The commonest error is to use a single =
rather than ==
.
Only the latter is a test of equality.
If you do this, the error message is quite helpful.
penguins %>% filter(species = "Chinstrap")
Another common error is to forget to quote any strings.
penguins %>% filter(species == Chinstrap)
slice
Sometimes it is useful to extract rows by row number.
penguins %>% slice(3:7)
You can use negative numbers to remove rows.
Be careful using slice()
as if the data change, different rows may be returned.
distinct
If there are duplicates in the data, we can remove these with distinct()
.
distinct()
with no extra arguments will remove duplicate rows.
penguins %>% distinct()
If we are only interested in some of the columns, we can supply the names of these columns.
penguins %>% distinct(island)
Other columns will be removed unless the argument .keep_all = TRUE
is used.
Sometimes you want to sample rows at random from a data.frame.
Two useful functions for this are sample_n
and sample_frac
.
The first samples a constant n rows, the second samples a constant fraction of the rows.
penguins %>% sample_n(size = 10)
mutate
The function mutate()
can add an new column or replace an existing one.
To make a new column called body_mass_kg
we can use
penguins %>% mutate(body_mass_kg = body_mass_g / 1000)
There are lots of functions that are useful to use with mutate. Any function that returns either a single value or as many values as are in the data can be used.
mutate()
is very useful when cleaning data.
stringr
package.lubridate
package.Create a column with the mean flipper length. Then filter the dataset to keep just the rows with a flipper length greater than the mean flipper length
penguins %>% mutate() %>% filter()
penguins %>% mutate(flip_len_mean = mean(flipper_length_mm, na.rm = TRUE)) %>% filter(flipper_length_mm > flip_len_mean) #Also possible in one step penguins %>% filter(flipper_length_mm > mean(flipper_length_mm, na.rm = TRUE))
summarise
summarise
lets us summarise data.
We can use it if we want to calculate a summary statistic of the data.
Remember to separate arguments with a comma.
penguins %>% summarise( flipper_len_mean = mean(flipper_length_mm, na.rm = TRUE), flipper_len_sd = sd(flipper_length_mm, na.rm = TRUE) )
Only the columns created in the summarise
and any grouping columns (see below) will be kept.
Sometimes you want to summarise multiple columns at the same time.
This can be done with the across
helper function.
across
needs to be told which columns to process and what function or functions to use.
penguins %>% summarise(across(c(bill_length_mm, bill_depth_mm), .fns = mean, na.rm = TRUE)) #using a list of functions penguins %>% summarise(across(.cols = starts_with("bill"), .fns = list(sd = sd, mean = mean), na.rm = TRUE))
You can also use across
with mutate
to mutate several columns at the same time.
group_by
group_by()
changes the way that many of the dplyr
functions work.
Instead of working on the entire dataset, they now work on each group in the data
To find the mean flipper length for each species, we need to group_by
species and then summarise
.
penguins %>% group_by(species) %>% summarise(mean_flipper_length = mean(flipper_length_mm))
Grouped data can be ungrouped with ungroup()
.
This can help prevent surprises!
Modify your code from the mutate
task to filter rows where the flipper length is greater than the mean of the relevant species.
penguins %>%
penguins %>% group_by(species) %>% mutate(flip_len_mean = mean(flipper_length_mm, na.rm = TRUE)) %>% filter(flipper_length_mm > flip_len_mean) #Also possible in one step penguins %>% group_by(species) %>% filter(flipper_length_mm > mean(flipper_length_mm, na.rm = TRUE))
arrange
To sort the data frame by one or more of the variables we can use arrange
.
penguins %>% arrange(bill_length_mm, bill_depth_mm)
This will sort smallest first.
To reverse the sort order, use desc()
penguins %>% arrange(desc(bill_length_mm), desc(bill_depth_mm))
count
and n
The function n
can count how many rows there are in the each group (or, less usefully, the entire data frame if it is not grouped).
It can be used with either mutate
or summarise
.
penguins %>% group_by(species) %>% summarise(n = n())
count
roughly equivalent to this
penguins %>% count(species)
All the dplyr
functions we have looked at so far work on a single data frame.
Joins work on two data frames, combining them according to a common variable.
For example, the penguin_islands
data frame has the location of the three islands in the penguins dataset.
penguin_islands <- tribble( ~ island, ~ Latitude, ~ Longitude, # tribble is a convenient way to make small datasets "Torgersen", -64.766667,-64.083333, "Biscoe", -64.818569, -63.775636, "Dream", -64.733333, -64.233333, "Alpha", -64.316667, -63) penguin_islands
We can use a join to merge these data with the original penguin data.
The by
argument tells the join which column to make the join by.
Here, we are joining by a single column with the same name in both data frames.
It is possible to join by multiple columns and where the columns have different names in each dataset.
penguin2 <- penguin %>% group_by(species) %>% slice(1:2) # small version of data for easy viewing penguin2 %>% left_join(penguin_islands, by = "island")
Only three of the four islands in penguin_islands
have data in the penguins
data set.
Different variants of join will treat this in different ways.
left_join
will take all rows from the first (left) data frame and matching rows from the second (right). right_join
does the opposite to left_join
, taking all rows from the second (right) data frame and matching rows from the first.inner_join
will take only rows that match in both data frames.full_join
will take all rows from in both data frames.In all cases, missing values will are given an NA
.
If we have two or more data frames that we want to combine.
bind_cols
If the data frames contain information about the same observations, they can be combined with bind_cols
.
So data1, data2, and data3 can be combined to make one data frame with many columns
bind_cols(data1, data2, data3)
Note that bind_cols
expects that the row order is the same in both datasets, but cannot check this.
It only checks that the number of rows is the same in each data frame.
If possible, use a join instead.
bind_rows
If the data frame contain more observations (rows), and typically at least some of the same columns, they can be combined with bind_rows
.
This is useful if, for example, there are data from two years that need combining.
One feature of bind_rows
that I find useful is the .id
argument that makes an extra column for an identifier.
svalbard_islands <- tribble( ~ island, ~ Latitude, ~ Longitude "Nordaustlandet", 79.558405, 24.017351, "Prins Karls Forland", 78.554090, 11.256545) bind_rows(Palmer = penguin_islands, Svalbard = svalbard_islands, .id = "Archipelago")
Ideally column names should follow the standard rules for naming objects in R - UPPER and lower case letters, numbers, "." and "_" with the first character being a letter (or a dot if you want an invisible object). Sometimes when you import data, it has non-standard names with spaces or extra characters. If you need to refer to a column name that doesn't follow the rules, you need to enclose it with back-ticks.
df <- tibble(`Region/Country` = "Norway", value = 42) df df %>% rename(region_country = `Region/Country`)
It is sometimes best to rename these columns to make them easier to refer to.
stat545
R for data science (Online copy of book)
dplyr
cheatsheet
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.