library(learnr) library(dplyr) library(gradethis) gradethis_setup() bfi_scored <- psych::scoreVeryFast(keys = psych::bfi.keys, items = psych::bfi, min = 1, max = 6) %>% as.data.frame() %>% tibble::rownames_to_column(var = ".id") bfi_data <- psych::bfi %>% tibble::rownames_to_column(var = ".id") %>% select(.id, gender, education, age) %>% mutate(.id = as.character(.id), gender = factor(recode(gender, "1" = "male", "2" = "female"))) %>% left_join(bfi_scored, by = ".id") rm(bfi_scored)
Today we'll get started with learning to "wrangle" data---that is, to subset it, rearrange it, transform it, summarize it, and otherwise make it ready for analysis. We are going to be working with the dplyr
package. Specifically, we're going to consider three lessons today:
dplyr
syntax%>%
pipe and the dplyr
advantagefilter
; relational/comparison and logical operators in RSpecific dplyr
functions we will cover
select()
arrange()
filter()
mutate()
summarize()
group_by()
grouped mutate()
STAT 545 chapters:
More detail can be found in the r4ds: transform chapter.
Here are some supplementary resources:
Some advanced topics you might find useful:
dplyr
package. dplyr
syntaxHere are the concepts we'll be exploring in this lesson:
tidyverse
dplyr functions:
select()
arrange()
piping
By the end of this lesson, students are expected to be able to:
%>%
) when implementing function chainsR
to dplyr
Self-documenting code
This is where the tidyverse shines.
Example of dplyr
vs base R
:
gapminder[gapminder$country == "Cambodia", c("year", "lifeExp")]
vs.
gapminder %>% filter(country == "Cambodia") %>% select(year, lifeExp)
This: %>%
is called a pipe. You can think of it as like saying the word "then" when reading the code. What it does is take whatever is passed on the left side (before) of the pipe and put it as the first argument to whatever is passed on the right side (after).
Here is an example of the logic of using a pipe:
Check out my morning routine!
leave_house(get_dressed(get_out_of_bed(wake_up(me, time = "8:00"), side = "correct"), pants = TRUE, shirt = TRUE), car = TRUE, bike = FALSE)
vs.
me %>% wake_up(time = "8:00") %>% get_out_of_bed(side = "correct") %>% get_dressed(pants = TRUE, shirt = TRUE) %>% leave_house(car = TRUE, bike = FALSE)
See how using the %>%
can make the code more intuitive and read-able?
%>%
) ExerciseUsing mtcars
print the head()
using %>%
.
mtcars %>% head()
HINT: You can use ctrl
+shift
+m
(or cmd
+shift
+m
) as a shortcut for the %>%
in R Studio (not in these code chunks :-( )
%>%
Now, what if you use a function on the right side (after) the %>%
and the first argument of THAT function isn't for my data?
We can use a period .
as a placeholder for whatever is being passed on the left side.
For example, if I want to pipe %>%
mtcars
into the lm()
function I can use data = .
to do that.
mtcars %>% lm(mpg ~ hp, data = .)
dplyr
first%>%
your data into a plot/analysisWhen it comes to coding, here are some helpful steps:
Do one thing at a time
Chain multiple operations together using the pipe %>%
Use readable object and variable names
Subset a dataset (i.e., select variables) by name, not by "magic numbers"
Note that you need to use the assignment operator <-
to store changes!
select()
?
to find helpNow that you've mastered the logic of the %>%
we are going to look at the select()
function.
We can look up the documentation of code by putting a ?
before the function. Click run code to pull up the help page:
?select
select()
ing some BFI DataSay for example that I am working with the Big-Five Inventory (bfi_data
) data shown below:
head(bfi_data)
and I want to subset my data to only include the .id
,gender
, and neuroticism
columns.
Here is an example of what I would do:
gender_neuro <- bfi_data %>% select(.id, gender, neuroticism) gender_neuro
select()
Woah, woah, woah, wait a minute. I just realized that some of the rows for education
are NA
. I want to remove that column but keep everything else. To do that, I put a "-
" infront of my "column_name
".
non_edu_bfi_data <- bfi_data %>% select(-education) non_edu_bfi_data
select()
question("What do I put infront of my column name to remove it using `select()`?", answer("`+`"), answer("`remove()`"), answer("`-`", correct = TRUE), answer("`!=`"), random_answer_order = TRUE, allow_retry = TRUE )
arrange()
Great job on select()
ing data!
When looking at data sometimes we might want our data in a certain order. This is where arrange()
comes in!
Looking at the bfi_data
again, say I wanted to order my data in the descending order based off of the age
column.
desc_bfi <- bfi_data %>% arrange(desc(age)) desc_bfi
select()
ingNow you try!
From the bfi_data
data frame, select .id
,gender
, and extraversion
and name it gender_extra
gender_extra <- bfi_data %>% select(.id, gender, extraversion)
grade_code()
arrange()
ingarrange()
our bfi_data
to the descending order of the .id
column and call it desc_bfi
.
desc_bfi <- bfi_data %>% arrange(desc(.id))
grade_code()
R
Here are the concepts we'll be exploring in this lesson:
filter()
mutate()
summarize()
group_by()
By the end of this lesson, you will be able to:
filter()
and mutate()
Arithmetic operators allow us to carry out mathematical operations:
| Operator | Description | |:--------:|-------------------------------------------| | + | Add | | - | Subtract | | * | Multiply | | / | Divide | | ^ | Exponent | | %/% | Integer division | | %% | Modulus (remainder from integer division) |
Relational operators allow us to compare values:
| Operator | Description | |:--------:|-------------------------------------------| | < | Less than | | > | Greater than | | <= | Less than or equal to | | >= | Greater than or equal to | | == | Equal to | | != | Not equal to | | %% | Modulus (remainder from integer division) |
Logical operators allow us to carry out boolean operations:
| Operator | Description | |:--------:|--------------------| | ! | Not | | \| | Or (element_wise) | | & | And (element-wise) | | \|\| | Or | | && | And |
|
and ||
is that ||
evaluates only the first element of the two vectors, whereas |
evaluates element-wise.filter()
Great job so far!
Lets talk about filter()
ing. In our data, sometimes we want to subset rows based off of certain criteria.
Looking back at the bfi_data
, say I wanted to look at entries that have openness
scores of greater than or equal to 3:
openness_cut_off <- bfi_data %>% filter(openness >= 3) openness_cut_off
filter()
ing with multiple-criteriaSay I was only interested in looking at ages 20-29.
HINT: the &
operator is used to check if two logical operations are TRUE
age_bfi <- bfi_data %>% filter(age > 20 & age < 29)
grade_code()
%in%
with filter()
Now, there is a special use case when trying to filter out specific variables.
Lets demonstrate using the starwars
data set (from dplyr
)
head(starwars)
I want to select only eye_color
that are blue
and red
. Some might think that the best way to do it is like so:
starwars %>% filter(eye_color == c("blue","red"))
Great! Except, if we look at things under the hood of what R
is doing its not what we hoped.
R
cycles through the c("blue","red")
vector list looking at the entries. In other words it checks if the first row has "blue"
then checks the second row if it has "red"
and so on. This can be an issue.
So to specifically get what we want we use %in%
:
starwars %>% filter(eye_color %in% c("blue","red"))
If we compare the output of both of the code chunks above, you'll see that using %in%
yielded 24 rows while the other method only yielded 15.
filter()
quiz( question("What line of code would I use to `filter()` ages between 20 and 50?", answer("`filter(20 < column > 50`"), answer("`filter(column == 20 - 50)`"), answer("`filter(column > 20 & column < 50)`", correct = TRUE), answer("`filter(column[20:50])`"), random_answer_order = TRUE, allow_retry = TRUE ) )
mutate()
mutate
adds new variables and preserves existing ones to a data frame.
You're probably going to be using it a lot.
Here is the general use case of mutate()
modified_data <- data %>% mutate(new_column = any_operations(existing_variables))
Looking at mtcars
say I wanted to divide wt
by mpg
.
mtcars %>% mutate(wt_mpg = wt / mpg)
We can even use functions:
mtcars %>% mutate(mean_mpg = mean(mpg))
filter()
ingGive it a go!
With bfi_data
filter()
the age
column to only include ages less than 50 and call it age_bfi
.
age_bfi <- bfi_data %>% filter(age < 50)
grade_code()
mutate()
ingUsing mtcars
, create a column called wt_disp
that multiplies the wt
and disp
columns with mutate()
.
mtcars %>% mutate(wt_disp = wt * disp )
grade_code()
summarize()
Like mutate()
, the summarize()
function also creates new columns, but the calculations that make the new columns must reduce down to a single number.
For example, let's compute the mean standard deviation of age
in bfi_data
.
age_bfi <- bfi_data %>% summarize(mu = mean(age), sigma = sd(age)) age_bfi
Notice that all other columns were dropped. This is necessary, because there’s no obvious way to compress the other columns down to a single row. This is unlike mutate(), which keeps all columns, and more like transmute(), which drops all other columns.
As it is, this is hardly useful. (Though it is useful for creating Table 1 in your papers.) But summarizing is more useful in the context of grouping, coming up next.
group_by()
The true power of dplyr
lies in its ability to group a tibble
, with the group_by()
function. As usual, this function takes in a tibble
and returns a (grouped) tibble.
starwars %>% group_by(species, eye_color)
The only thing different from a regular tibble
is the indication of grouping variables above the tibble.
This means that the tibble
is recognized as having “chunks” defined by unique combinations of species
and eye_color
.
Notice that the data frame isn’t rearranged by chunk! The grouping is something stored internally about the grouped tibble.
Now that the tibble
is grouped, operations that you do on a grouped tibble
will be done independently within each chunk, as if no other chunks exist.
summarize()
ingGet the mean and standard deviation of conscientious
in bfi_data
using the summarize()
function.
bfi_data %>% summarize(mu = mean(conscientious), sigma = sd(conscientious))
grade_code()
group_by()
You can also create new variables and group by that variable simultaneously.
Try splitting height
by “small” and “large” using 60 as a threshold:
starwars %>% group_by(small_height = height < 60)
grade_code()
We’ve seen cases of transforming variables using mutate() and summarize(), both with and without group_by(). How can you know what combination to use? Here’s a summary based on one of three types of functions.
| Function type | Explanation | Examples | In dplyr |
|:--------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------:|:---------------------------------:|
| Vectorized functions | These take a vector, and operate on each component independently to return a vector of the same length. In other words, they work element-wise. | cos()
, sin()
, log()
, exp()
, round()
| mutate()
|
| Aggregate functions | These take a vector, and return a vector of length 1 | mean()
, sd()
, length()
| summarize()
, esp with group_by()
. |
| Window Functions | these take a vector, and return a vector of the same length that depends on the vector as a whole. | lag()
, rank()
, cumsum()
| mutate()
, esp with group_by()
|
Great! You've made it to the end. Pat yourself on the back!
In your data analysis, you're more than likely going to use a combination of the functions shown in this tutorial.
Lets practice!
Looking at our bfi_data
, I want to get the mean and standard deviation of openness
all 40-49 year old participants grouped by their education.
bfi_data %>% group_by(education) %>% filter(age >= 40 & age <= 49) %>% summarize(mean = mean(openness), sd = sd(openness))
grade_code()
I want to subset female neuroticism
scores with level 3 education.
bfi_data %>% filter(gender == "female") %>% filter(education == 3) %>% select(neuroticism)
grade_code()
Using mtcars
I want to group by cyl
, and find the mean of mpg
, disp
, and wt
for each group of cyl
.
mtcars %>% group_by(cyl) %>% summarize(mean_mpg = mean(mpg), mean_disp = mean(disp), mean_wt = mean(wt))
grade_code()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.