options(htmltools.dir.version = FALSE)
# see: https://github.com/yihui/xaringan
# install.packages("xaringan")
# see: 
# https://github.com/yihui/xaringan/wiki
# https://github.com/gnab/remark/wiki/Markdown
options(width=110)
options(digits = 4)

dplyr

.pull-left3[

dplyr is a package for managing dataframes.

Anytime you want to slice, dice, aggregate, or manipulate a dataframe, there is almost certainly a way to do it in dplyr.

knitr::include_graphics("images/dplyr_hex.png")

]

.pull-right3[

Questions you can answer with dplyr

Can you calculate the mean survival times for each treatment separated by gender and time?

I need to know the mean birth rate only for countries in Africa from 1980 to 1980.

What percent of female patients had adverse events to drug X during weeks 5 through 10?

]

dplyr CheatSheet!

https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

knitr::include_graphics("images/data_wrangling_ss.png")

.pull-left3[

dplyr

dplyr is a combination of 3 things:

  1. objects like dataframes
  2. verbs that do things to objects.
  3. pipes %>% that string together objects and verbs


knitr::include_graphics("images/pipe.jpg")

]

.pull-right3[




knitr::include_graphics("images/sequential.png")

dplyr is meant to be sequential and work like language

Take data X, then do Y, then do Z...

Here's the basic structure of dplyr in action

data %>%           # Start with data, and THEN
     VERB1 %>%     # Do VERB1, (and THEN)
     VERB2 %>% ... # Do VERB2, (and THEN)

]


ChickWeight dataframe

.pull-left3[

knitr::include_graphics("http://tailandfur.com/wp-content/uploads/2016/09/40-Funny-and-Cute-Ideas-Of-Chicken-Pictures-11.jpg")

]

.pull-right3[

# Show me the first few rows
head(ChickWeight)

#What are the column names again?
names(ChickWeight)

]


.pull-left4[

dplyr

Question:

From the ChickWeight dataframe, calculate the mean weight and time for each diet

]

.pull-right4[

Answer:

library(dplyr)

x <- ChickWeight %>%  # Start with ChickWeight
  group_by(Diet) %>%  # Group by Diet
  summarise(          # Get ready to summarise....
    weight.mean = mean(weight), # Mean weight
    time.mean = mean(Time),     # Mean time
    N = n()                     # Number of cases
  )


x

]


Common dplyr verbs

| verb| action| example | |:---|:----|:----------------| | filter()| Select rows based on some criteria| filter(age > 40 & sex == "m")| | arrange()| Sort rows| arrange(date, group)| | select()| Select columns (and ignore all others)| select(age, sex)| | rename()| Rename columns| rename(DATE_MONTHS_X24, date)| | mutate()| Add new columns| mutate(height.m = height.cm / 100)| | case_when()| Recode values of a column| sex.n = case_when(sex == 0 ~ "m", sex == 1 ~ "f")| | group_by(), summarise()| Group data and then calculate summary statistics|group_by(treatment) %>% summarise(...) |


To add a column to a dataframe, used mutate()

Add a column called weight_d_time that is weight divided by time

library(dplyr)

x <- ChickWeight %>%          # Start with the ChickWeight data
      mutate(                 # Create new columns...
            weight_d_time = weight / Time
            )

head(x)   # Print the result

You can create many new columns sequentially

Add a column called weight_d_time that is weight divided by time AND time_d that is time in days

x <- ChickWeight %>%          # Start with the ChickWeight data
      mutate(                 # Create new columns...
            weight_d_time = weight / Time,  # weight_d_time is weight divided by Time
            time_d = Time * 7               # time_d is Time times 7
            )

head(x)   # Print the result

To recode values, use case_when()

.pull-left3[

data %>%
  mutate(
  var_new = case_when(
    var_old == OLD_A ~ NEW_A,
    var_old == OLD_B ~ NEW_B
  )

For example, in a dataset, the column sex might be coded with 1s and 0s.

You might want to create a new column sex_new where 1 = "female" and 0 = "male":

| sex_car| sex_new| |------:|----:| | 1| "female"| | 0| "male"|

]

.pull-right3[

To change the value of 1 to "female", and 0 to "male", you can use case_when():

# Add a column sex_new to data 

data <- data %>% 
  mutate(
        sex_char = case_when(
          sex == 1 ~ "female",
          sex == 0 ~ "male"
        )
  )

You can think about the code above as follows:

]


.pull-left3[

More case_when() examples

ChickWeight

set.seed(104)
ChickWeight <- ChickWeight[sample(nrow(ChickWeight)),]
head(ChickWeight)

Create a new variable Diet_name which shows Diet in text format. Here is a table of the values

| Diet| Diet_name| |------:|----:| | 1| "fruit"| | 2| "vegetables"| | 3| "meat"| | 4| "grains"|

]

.pull-right3[

ChickWeight <- ChickWeight %>%           # Start with the ChickWeight data
                 mutate(
                   Diet_name = case_when(
                     Diet == 1 ~ "fruit",
                     Diet == 2 ~ "vegetables",
                     Diet == 3 ~ "meat",
                     Diet == 4 ~ "grains"
                   )

 )

head(ChickWeight)   # Print final result!

]


group_by(), summarise()

.pull-left3[

Two of the most powerful dplyr verbs are group_by() and summarise().

When you combine these two, you can easily calculate summary statistics across groups of data

Q1: What is the average beard length of pirates on the Jolly Roger and the Slippery Seal?

Q2: How many patients were on each treatment arm, and for each arm, what was the mean success rate for the primary measure full_recovery and the secondary measure feelbetter?

]

.pull-right3[

q1 <- pirates %>%
  group_by(ship) %>%
  summarise(
    beard_mean = mean(beardlength)
  )


q2 <- pirates %>%
  group_by(arm) %>%
  summarise(
    N = n(),
    primary_success = mean(full_recovery),
    secondary_success = mean(feelbetter)
  )

]


Ex: Chick 3

For each Diet, calculate the mean weight

ChickWeight %>%           # Start with the ChickWeight data
  group_by(Diet) %>%      # Group the data by Diet
  summarise(              # Now summarise....
    weight.mean = mean(weight) # Mean weight
  )

You can easily filter data before grouping and summarising

For each time period less than 10, calculate the mean weight

ChickWeight %>%                # Start with the ChickWeight data
  filter(Time < 10) %>%        # Only Time periods less than 10
  group_by(Time) %>%           # Group the data by Diet
  summarise(                   # Now summarise....
    weight.mean = mean(weight) # Mean weight
  )

You can calculate as many summary columns as you want!

For each Diet, calculate the mean weight, maximum time, and the number of chicks on each diet:

ChickWeight %>%           # Start with the ChickWeight data
  group_by(Diet) %>%      # Group the data by Diet
  summarise(              # Now summarise....
    weight.mean = mean(weight), # Mean weight
    time.max = max(Time),       # Max time
    N = n()                     # Number of observations
  )

Other dplyr verbs

| verb| action| example | |:---|:----|:----------------| | sample_n()| Select a random sample of n rows| sample_n(10)| | sample_frac()| Select a random fraction of rows| sample_frac(.20)| | first(), last()| Give the first (or last) observation| first(), last()|


Getting random values from a dataframe with sample_n()

Give me a random sample of 10 rows from the ChickWeight dataframe, but only show me the values for Chick and weight

# Give me a random sample of 10 rows, but only show me columns Chick and weight

ChickWeight %>% 
  select(Chick, weight) %>%
  sample_n(10)

.pull-left3[

dplyr

dplyr operations (almost) always return a dataframe which you can assign to a new object:

Send me a text file with the average weight for each time period and nothing else!!

]

.pull-right3[

# Create a new object called time_agg

time_agg <- ChickWeight %>%
  group_by(Time) %>%
  summarise(
    weight.mean = mean(weight)
  )

head(time_agg) # Make sure it looks good

# save the aggregated data as a .csv file for my colleague
#write_csv(time_agg, path = "data/time_agg.csv")

]


.pull-left3[

dplyr summary

dplyr is great for elegantly performing sequential operations on data.

The 'pipe' operator %>% helps you string multiple objects (like dataframes) and verbs (summarise, order, aggregate...) together.

knitr::include_graphics("images/dplyr_hex.png")

]

.pull-right3[



Basic structure of dplyr commands:

data %>%    # Start with data, AND THEN...
  VERB1 %>% # Do VERB1, AND THEN...
  VERB2 %>% # Do VERB2, AND THEN...
  VERB3 %>% # Do VERB3, AND THEN...
  group_by(x, y) %>%  # Group by variables x, y
    summarise(
      VAR_A_New = fun(X),
      VAR_B_New = fun(Y)
    )
  )

Questions?

]


.pull-left3[

tidyr

tidyr is a package for 'tidying' data.

'tidying' data means getting it into a different format (usually moving data between rows and columns)

knitr::include_graphics("images/tidyr_hex.png")

]

.pull-right3[



Questions you can answer with tidyr

Right now there are different rows for each patient, I need every patient's data to be in a single row.

Can you restructure the data so data in different columns are shown in a single column?

]


tidyr

There are two important functions in tidyr, gather() and spread():

.pull-left3[

gather(): Move data from columns into many rows

knitr::include_graphics("images/gather_ss.png")

]

.pull-right3[

spread(): Move data from several rows into columns

knitr::include_graphics("images/spread_ss.png")

]



therbootcamp/BaselRBootcamp2017 documentation built on May 3, 2019, 10:45 p.m.