knitr::opts_chunk$set(echo = TRUE)
Questions
How can I add variables to my data?
How can I alter the variables already in my data?
Objectives
To be able to add new variables to the data
To understand the basic consepts between different data types
Often, the data we have do not contain exactly what we need. We might need to change the order of factors, create new variables based on other columns in the data, or even variables conditional on specific values in other columns. Perhaps we even want to alter all columns of a specific type in the same way?
These are the things we will cover in this session.
Yesterday we went through functions in the {dplyr} package that are about subsetting and ordering the data in a data set.
This session will focus on the mutate()
function, which is {dplyr}'s function to create or alter variables in a data set.
select()
(covered in Day 1 session)filter()
(covered in Day 1 session)arrange()
(covered in Day 1 session)mutate()
(covered in this session)group_by()
(covered in this session)summarize()
(covered in Day 3 session)Let us first talk about selecting columns. In {dplyr}, the function name for selecting columns is select()
!
In fact, the {tidyverse} names for functions are inspired by English grammar, which will help us when we are writing our code.
As yesterday, we need to start off by making sure we have the {tidyverse} package loaded, and the penguins data set ready at hand.
library(tidyverse) penguins <- palmerpenguins::penguins
In {tidyverse}, when we add new variables, we use the mutate()
function. Just like the other {tidyverse} functions, mutate work specifically with data sets, and provides a nice shorthand for working directly with the columns in the data set.
penguins %>% mutate(new_var = 1)
The output of this can be hard to spot, depending on the size of the screen. But you should be able to spot a new column in the data set called "new_var", and it has the value 1 for all rows!
This is what we told mutate()
to do! We specified a new column by name, and gave it a specific value, 1
.
This works because its easy to assigning a single value to all rows. What if we try to give it three values? What would we expect?
penguins %>% mutate(var = 1:3)
Here, it's failing with a mysterious message. The error is telling us that input must be of size 344 or 1. 344 are the number of rows in the data set, so its telling us the input we gave it is not suitable because its neither of length 344 nor of length 1.
So now we know the premises for mutate, it takes inputs that are either of the same length as there are rows in the data set or length 1.
penguins %>% mutate(var = 1:344)
But generally, we create new columns based on other data in the data set. So let's do a more useful example. For instance, perhaps we want to use the ratio between the bill length and depth as a measurement for a model.
penguins %>% mutate(bill_ratio = bill_length_mm / bill_depth_mm) %>% select(starts_with("bill"))
So, here we have asked for the ratio between bill length and depth to be calculated and stored in a column named bill_ratio
. Then we selected just the bill
columns to have a peak at the output more directly.
We can do almost anything within a mutate()
to get the values as we want them, also use functions that exist in R to transform the data. For instance, perhaps we want to scale the variables of interest to have a mean of 0 and standard deviation of 1, which is quite common to improve statistical modeling. We can do that with the scale()
function.
penguins %>% mutate(bill_ratio = bill_length_mm / bill_depth_mm, bill_length_mm_z = scale(bill_length_mm)) %>% select(starts_with("bill"))
Sometimes, we want to assign certain data values based on other variables in the data set. For instance, maybe we want to classify all penguins with body mass above 4.5 kg as "large" while the rest are "normal"?
The if_else()
function takes expressions, much like filter()
.
The first value after the expression is the value assigned if the expression is TRUE
, while the second is if the expression is FALSE
penguins %>% mutate(size = if_else(condition = body_mass_g > 4500, true = "large", false = "normal"))
Now we have a column with two values, large
and normal
based on whether the penguins are above or below 4.5 kilos.
We can for instance use that in a plot.
penguins %>% mutate(size = if_else(condition = body_mass_g > 4500, true = "large", false = "normal")) %>% ggplot() + geom_jitter(mapping = aes(x = year, y = body_mass_g, colour = size))
That shows us clearly that we have grouped the penguins based on their size. But there is this strange NA
in the plot legend. what is that?
In R, missing values are usually given the value NA
which stands for Not applicable
, i.e., missing data. This is a very special name in R. Like TRUE
and FALSE
are capitalized, RStudio immediately recognizes the combination of capital letters and gives it another color than all other values.
Now we know how to create new variables, and even how to make them if there are conditions on how to add the data.
But, we often want to add several columns of different types, and maybe even add new variables based on other new columns! Oh, it's starting to sound complicated, but it does not have to be!
mutate()
is so-called lazy-evaluated. This sounds weird, but it means that each new column you make is made in the sequence you make them. So as long as you think about the order of your mutate()
creations, you can do that in a single mutate call.
penguins %>% mutate(bill_ratio = bill_depth_mm / bill_length_mm, bill_type = if_else(condition = bill_ratio < 0.5, true = "elongated", false = "stumped")) %>% select(starts_with("bill"))
Now you've created two variables. One for bill_ratio
, and then another one conditional on the values of the bill_ratio
.
If you switched the order of these two, R would produce an error, because there would be no bill ratio to create the other column.
penguins %>% mutate(bill_type = if_else(condition = bill_ratio < 0.5, true = "elongated", false = "stumped"), bill_ratio = bill_depth_mm / bill_length_mm) %>% select(starts_with("bill"))
But what if we want to categorize based on more than one condition? Nested if_else()
?
penguins %>% mutate( bill_ratio = bill_depth_mm / bill_length_mm, bill_type = if_else(condition = bill_ratio < 0.35, true = "elongated", false = if_else(condition = bill_ratio < 0.45, true = "normal", false = "stumped"))) %>% select(starts_with("bill"))
what if you have even more conditionals? Thankfully, {dplyr} has a smarter way of doing this, called case_when()
. This function is similar to if_else()
, but where you specify what each condition should be assigned.
On the left you have the logical expression, and the on the right of the tilde (~
) is the value to be assigned if that expression is TRUE
penguins %>% mutate( bill_ratio = bill_depth_mm / bill_length_mm, bill_type = case_when(bill_ratio < 0.35 ~ "elongated", bill_ratio < 0.45 ~ "normal", TRUE ~ "stumped")) %>% ggplot(mapping = aes(x = bill_length_mm, y = bill_depth_mm, colour = bill_type)) + geom_point()
That looks almost the same. The NA
's are gone! That's not right. We cannot categorize values that are missing. It's our last statement that does this, which just says "make the remainder this value". Which is not what we want. We need the NA
s to stay NA
's.
case_when()
, like the mutate()
, evaluates the expressions in sequence. Which is why we can have two statements evaluating the same column with similar expressions (below 0.35 and then below 0.45). All values that are below 0.45 are also below 0.35. Since we first assign everything below 0.35, and then below 0.45, they do not collide. We can do the same for our last statement, saying that all values that are not NA
should be given this category.
penguins %>% mutate( bill_ratio = bill_depth_mm / bill_length_mm, bill_type = case_when(bill_ratio < 0.35 ~ "elongated", bill_ratio < 0.45 ~ "normal", !is.na(bill_ratio) ~ "stumped")) %>% ggplot(mapping = aes(x = bill_length_mm, y = bill_depth_mm, colour = bill_type)) + geom_point()
Here, we use the is.na()
function we saw earlier, on bill_ratio
. But it also has an !
in front, what does that mean? In R's logical expressions, the !
is a negation specifier. It means it flips the logical so the TRUE
becomes FALSE
, and vice versa. So here, it means the bill_ratio
is not NA
.
So far, we've been looking at adding variables one by one. This is of course something we do all the time, but some times we need to do the same change to multiple columns at once. Imagine you have a data set with 20 column and you want to scale them all to the same scale. Writing the same command with different columns 20 times is very tedious. It is now the {dplyr} package truly starts to shine!
In our case, let us say we want to scale the three columns with millimeter measurements so that they have a mean of 0 and standard deviation of 1. We've already used the scale()
function once before, so we will do it again.
In this simple example we might have done so:
penguins %>% mutate(bill_depth_sc = scale(bill_depth_mm), bill_length_sc = scale(bill_length_mm), flipper_length_sc = scale(flipper_length_mm))
Its just three columns, we can do that. But let us imagine we have 20 of these, typing all that out is tedious and error prone. You might forget to alter the name or keep the same type of naming convention. We are only human, we easily make mistakes.
With {dplyr}'s across()
we can combine our knowledge of tidy-selectors and mutate to create the entire transformation for these columns at once.
penguins %>% mutate(across(.cols = ends_with("mm"), .fns = scale))
Whoa! So fast and so simple! Now the three columns are scaled. But oh no! The columns have been overwritten. Rather than creating new ones, we replaced the old ones.
This might be your intention in some instances, or maybe you will just create a new data set with the scaled variables.
penguins_mm_sc <- penguins %>% mutate(across(.cols = ends_with("mm"), .fns = scale))
but often, we'd like to keep the original but add the new variants. We can do that to within the across!
penguins %>% mutate(across(.cols = ends_with("mm"), .fns = scale, .names = "{.col}_sc")) %>% select(contains("mm"))
Now they are all there! neat! But that .names
argument is a little weird. What does it really mean?
Internally, across()
stores the column names in a vector it calls .col
. We can use this knowledge to tell the across function what to name our new columns. In this case, we want to append the column name with _sc
.
Room: Break-out
Duration: 10 minutes
1a: Create a column named
bill_ld_ratio_log
that is the natural logarithm (using thelog()
function) ofbill_length_mm
divided bybill_depth_mm
1b: Create a new column called
body_type
, where animals below 3 kg aresmall
, animals between 3 and 4.5 kg arenormal
, and animals larger than 4.5 kg arelarge
. In the same command, create a new column namedbiscoe
and its content should beTRUE
if the island isBiscoe
andFALSE
for everything else.1c: Transform all the colmns with milimetres measurements so they are scaled, and add the prefix
sc_
to the columns names.
## 1a penguins %>% mutate(bill_ld_ratio_log = log(bill_length_mm / bill_depth_mm)) ## 1b penguins %>% mutate(body_type = case_when(body_mass_g < 3000 ~ "small", body_mass_g >= 3000 & body_mass_g < 4500 ~ "normal", body_mass_g >= 4500 ~ "large"), biscoe = if_else(condition = island == "Biscoe", true = TRUE, false = FALSE)) ## 1c penguins %>% mutate(across(.cols = ends_with("mm"), .fns = scale, .names = "sc_{.col}")) %>% select(contains("mm"))
Now we've learned a little about adding and altering variables in data sets using {dplyr}'s mutate()
function. Both alone, and also together with across()
and group_by()
.
You should be able to play around with the examples provided and learn more about how things work through trial and error.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.