options(htmltools.dir.version = FALSE) # see: https://github.com/yihui/xaringan # install.packages("xaringan") # see: # https://github.com/yihui/xaringan/wiki # https://github.com/gnab/remark/wiki/Markdown options(width=110) options(digits = 4)
.pull-left3[
dplyr is a package for managing dataframes.
Anytime you want to slice, dice, aggregate, or manipulate a dataframe, there is almost certainly a way to do it in dplyr.
knitr::include_graphics("images/dplyr_hex.png")
]
.pull-right3[
Can you calculate the mean survival times for each treatment separated by gender and time?
I need to know the mean birth rate only for countries in Africa from 1980 to 1980.
What percent of female patients had adverse events to drug X during weeks 5 through 10?
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
knitr::include_graphics("images/data_wrangling_ss.png")
.pull-left3[
dplyr is a combination of 3 things:
objects
like dataframesverbs
that do things to objects.pipes
%>%
that string together objects and verbsknitr::include_graphics("images/pipe.jpg")
]
.pull-right3[
knitr::include_graphics("images/sequential.png")
dplyr is meant to be sequential and work like language
Take data X, then do Y, then do Z...
Here's the basic structure of dplyr
in action
data %>% # Start with data, and THEN VERB1 %>% # Do VERB1, (and THEN) VERB2 %>% ... # Do VERB2, (and THEN)
]
.pull-left3[
knitr::include_graphics("http://tailandfur.com/wp-content/uploads/2016/09/40-Funny-and-Cute-Ideas-Of-Chicken-Pictures-11.jpg")
]
.pull-right3[
# Show me the first few rows head(ChickWeight) #What are the column names again? names(ChickWeight)
]
.pull-left4[
From the ChickWeight dataframe, calculate the mean weight and time for each diet
]
.pull-right4[
library(dplyr) x <- ChickWeight %>% # Start with ChickWeight group_by(Diet) %>% # Group by Diet summarise( # Get ready to summarise.... weight.mean = mean(weight), # Mean weight time.mean = mean(Time), # Mean time N = n() # Number of cases ) x
]
| verb| action| example |
|:---|:----|:----------------|
| filter()
| Select rows based on some criteria| filter(age > 40 & sex == "m")
|
| arrange()
| Sort rows| arrange(date, group)
|
| select()
| Select columns (and ignore all others)| select(age, sex)
|
| rename()
| Rename columns| rename(DATE_MONTHS_X24, date
)|
| mutate()
| Add new columns| mutate(height.m = height.cm / 100)
|
| case_when()
| Recode values of a column| sex.n = case_when(sex == 0 ~ "m", sex == 1 ~ "f")
|
| group_by(), summarise()
| Group data and then calculate summary statistics|group_by(treatment) %>% summarise(...)
|
mutate()
Add a column called weight_d_time
that is weight divided by time
library(dplyr) x <- ChickWeight %>% # Start with the ChickWeight data mutate( # Create new columns... weight_d_time = weight / Time ) head(x) # Print the result
Add a column called weight_d_time
that is weight divided by time AND time_d
that is time in days
x <- ChickWeight %>% # Start with the ChickWeight data mutate( # Create new columns... weight_d_time = weight / Time, # weight_d_time is weight divided by Time time_d = Time * 7 # time_d is Time times 7 ) head(x) # Print the result
case_when()
.pull-left3[
data %>% mutate( var_new = case_when( var_old == OLD_A ~ NEW_A, var_old == OLD_B ~ NEW_B )
For example, in a dataset, the column sex
might be coded with 1s and 0s.
You might want to create a new column sex_new
where 1 = "female" and 0 = "male":
| sex_car| sex_new| |------:|----:| | 1| "female"| | 0| "male"|
]
.pull-right3[
To change the value of 1 to "female"
, and 0 to "male"
, you can use case_when()
:
# Add a column sex_new to data data <- data %>% mutate( sex_char = case_when( sex == 1 ~ "female", sex == 0 ~ "male" ) )
You can think about the code above as follows:
sex_char
wheresex == 1
, then set the value to "female"
sex == 0
, then set the value to "male"
]
.pull-left3[
case_when()
examplesChickWeight
set.seed(104) ChickWeight <- ChickWeight[sample(nrow(ChickWeight)),] head(ChickWeight)
Create a new variable Diet_name which shows Diet in text format. Here is a table of the values
| Diet| Diet_name| |------:|----:| | 1| "fruit"| | 2| "vegetables"| | 3| "meat"| | 4| "grains"|
]
.pull-right3[
ChickWeight <- ChickWeight %>% # Start with the ChickWeight data mutate( Diet_name = case_when( Diet == 1 ~ "fruit", Diet == 2 ~ "vegetables", Diet == 3 ~ "meat", Diet == 4 ~ "grains" ) ) head(ChickWeight) # Print final result!
]
.pull-left3[
Two of the most powerful dplyr verbs are group_by()
and summarise()
.
When you combine these two, you can easily calculate summary statistics across groups of data
Q1: What is the average beard length of pirates on the Jolly Roger and the Slippery Seal?
Q2: How many patients were on each treatment arm, and for each arm, what was the mean success rate for the primary measure
full_recovery
and the secondary measurefeelbetter
?
]
.pull-right3[
q1 <- pirates %>% group_by(ship) %>% summarise( beard_mean = mean(beardlength) ) q2 <- pirates %>% group_by(arm) %>% summarise( N = n(), primary_success = mean(full_recovery), secondary_success = mean(feelbetter) )
]
For each Diet, calculate the mean weight
ChickWeight %>% # Start with the ChickWeight data group_by(Diet) %>% # Group the data by Diet summarise( # Now summarise.... weight.mean = mean(weight) # Mean weight )
For each time period less than 10, calculate the mean weight
ChickWeight %>% # Start with the ChickWeight data filter(Time < 10) %>% # Only Time periods less than 10 group_by(Time) %>% # Group the data by Diet summarise( # Now summarise.... weight.mean = mean(weight) # Mean weight )
For each Diet, calculate the mean weight, maximum time, and the number of chicks on each diet:
ChickWeight %>% # Start with the ChickWeight data group_by(Diet) %>% # Group the data by Diet summarise( # Now summarise.... weight.mean = mean(weight), # Mean weight time.max = max(Time), # Max time N = n() # Number of observations )
| verb| action| example |
|:---|:----|:----------------|
| sample_n()
| Select a random sample of n rows| sample_n(10)
|
| sample_frac()
| Select a random fraction of rows| sample_frac(.20)
|
| first(), last()
| Give the first (or last) observation| first(), last()
|
sample_n()
Give me a random sample of 10 rows from the ChickWeight dataframe, but only show me the values for Chick and weight
# Give me a random sample of 10 rows, but only show me columns Chick and weight ChickWeight %>% select(Chick, weight) %>% sample_n(10)
.pull-left3[
dplyr operations (almost) always return a dataframe which you can assign to a new object:
Send me a text file with the average weight for each time period and nothing else!!
]
.pull-right3[
# Create a new object called time_agg time_agg <- ChickWeight %>% group_by(Time) %>% summarise( weight.mean = mean(weight) ) head(time_agg) # Make sure it looks good # save the aggregated data as a .csv file for my colleague #write_csv(time_agg, path = "data/time_agg.csv")
]
.pull-left3[
dplyr is great for elegantly performing sequential operations on data.
The 'pipe' operator %>%
helps you string multiple objects (like dataframes) and verbs (summarise, order, aggregate...) together.
knitr::include_graphics("images/dplyr_hex.png")
]
.pull-right3[
Basic structure of dplyr commands:
data %>% # Start with data, AND THEN... VERB1 %>% # Do VERB1, AND THEN... VERB2 %>% # Do VERB2, AND THEN... VERB3 %>% # Do VERB3, AND THEN... group_by(x, y) %>% # Group by variables x, y summarise( VAR_A_New = fun(X), VAR_B_New = fun(Y) ) )
]
.pull-left3[
tidyr
is a package for 'tidying' data.
'tidying' data means getting it into a different format (usually moving data between rows and columns)
knitr::include_graphics("images/tidyr_hex.png")
]
.pull-right3[
Right now there are different rows for each patient, I need every patient's data to be in a single row.
Can you restructure the data so data in different columns are shown in a single column?
]
There are two important functions in tidyr
, gather()
and spread()
:
.pull-left3[
gather()
: Move data from columns into many rows
knitr::include_graphics("images/gather_ss.png")
]
.pull-right3[
spread()
: Move data from several rows into columns
knitr::include_graphics("images/spread_ss.png")
]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.