inst/book/07-dplyr.md

Data Wrangling {#dplyr}

A cute golden retriever labelled 'iris & mtcars' and a scary werewolf labelled 'Real world data'

Learning Objectives {#ilo-dplyr}

Basic {-}

  1. Be able to use the 6 main dplyr one-table verbs: (video){class="video"}
  2. Be able to wrangle data by chaining tidyr and dplyr functions (video){class="video"}
  3. Be able to use these additional one-table verbs: (video){class="video"}

Intermediate {-}

  1. Fine control of select() operations (video){class="video"}
  2. Use window functions (video){class="video"}

Setup {#setup-dplyr}

# libraries needed for these examples
library(tidyverse)
library(lubridate)
library(reprores)
set.seed(8675309) # makes sure random numbers are reproducible

The disgust dataset {#data-disgust}

These examples will use data from reprores::disgust, which contains data from the Three Domain Disgust Scale. Each participant is identified by a unique user_id and each questionnaire completion has a unique id. Look at the Help for this dataset to see the individual questions.

data("disgust", package = "reprores")

#disgust <- read_csv("https://psyteachr.github.io/reprores/data/disgust.csv")

Six main dplyr verbs

Most of the data wrangling you'll want to do with psychological data will involve the tidyr functions you learned in Chapter 4 and the six main dplyr verbs: select, filter, arrange, mutate, summarise, and group_by.

select() {#select}

Select columns by name or number.

You can select each column individually, separated by commas (e.g., col1, col2). You can also select all columns between two columns by separating them with a colon (e.g., start_col:end_col).

moral <- disgust %>% select(user_id, moral1:moral7)
names(moral)
## [1] "user_id" "moral1"  "moral2"  "moral3"  "moral4"  "moral5"  "moral6" 
## [8] "moral7"

You can select columns by number, which is useful when the column names are long or complicated.

sexual <- disgust %>% select(2, 11:17)
names(sexual)
## [1] "user_id" "sexual1" "sexual2" "sexual3" "sexual4" "sexual5" "sexual6"
## [8] "sexual7"

You can use a minus symbol to unselect columns, leaving all of the other columns. If you want to exclude a span of columns, put parentheses around the span first (e.g., -(moral1:moral7), not -moral1:moral7).

pathogen <- disgust %>% select(-id, -date, -(moral1:sexual7))
names(pathogen)
## [1] "user_id"   "pathogen1" "pathogen2" "pathogen3" "pathogen4" "pathogen5"
## [7] "pathogen6" "pathogen7"

Select helpers {#select_helpers}

You can select columns based on criteria about the column names.

starts_with() {#starts_with}

Select columns that start with a character string.

u <- disgust %>% select(starts_with("u"))
names(u)
## [1] "user_id"
ends_with() {#ends_with}

Select columns that end with a character string.

firstq <- disgust %>% select(ends_with("1"))
names(firstq)
## [1] "moral1"    "sexual1"   "pathogen1"
contains() {#contains}

Select columns that contain a character string.

pathogen <- disgust %>% select(contains("pathogen"))
names(pathogen)
## [1] "pathogen1" "pathogen2" "pathogen3" "pathogen4" "pathogen5" "pathogen6"
## [7] "pathogen7"
num_range() {#num_range}

Select columns with a name that matches the pattern prefix.

moral2_4 <- disgust %>% select(num_range("moral", 2:4))
names(moral2_4)
## [1] "moral2" "moral3" "moral4"

::: {.info data-latex=""} Use width to set the number of digits with leading zeros. For example, num_range('var_', 8:10, width=2) selects columns var_08, var_09, and var_10. :::

filter() {#filter}

Select rows by matching column criteria.

Select all rows where the user_id is 1 (that's Lisa).

disgust %>% filter(user_id == 1)
id user_id date moral1 moral2 moral3 moral4 moral5 moral6 moral7 sexual1 sexual2 sexual3 sexual4 sexual5 sexual6 sexual7 pathogen1 pathogen2 pathogen3 pathogen4 pathogen5 pathogen6 pathogen7 1 1 2008-07-10 2 2 1 2 1 1 1 3 1 1 2 1 2 2 3 2 3 3 2 3 3

::: {.warning data-latex=""} Remember to use == and not = to check if two things are equivalent. A single = assigns the righthand value to the lefthand variable and (usually) evaluates to TRUE. :::

You can select on multiple criteria by separating them with commas.

amoral <- disgust %>% filter(
  moral1 == 0, 
  moral2 == 0,
  moral3 == 0, 
  moral4 == 0,
  moral5 == 0,
  moral6 == 0,
  moral7 == 0
)

You can use the symbols &, |, and ! to mean "and", "or", and "not". You can also use other operators to make equations.

# everyone who chose either 0 or 7 for question moral1
moral_extremes <- disgust %>% 
  filter(moral1 == 0 | moral1 == 7)

# everyone who chose the same answer for all moral questions
moral_consistent <- disgust %>% 
  filter(
    moral2 == moral1 & 
    moral3 == moral1 & 
    moral4 == moral1 &
    moral5 == moral1 &
    moral6 == moral1 &
    moral7 == moral1
  )

# everyone who did not answer 7 for all 7 moral questions
moral_no_ceiling <- disgust %>%
  filter(moral1+moral2+moral3+moral4+moral5+moral6+moral7 != 7*7)

Match operator (%in%) {#match-operator}

Sometimes you need to exclude some participant IDs for reasons that can't be described in code. The match operator (%in%) is useful here for testing if a column value is in a list. Surround the equation with parentheses and put ! in front to test that a value is not in the list.

no_researchers <- disgust %>%
  filter(!(user_id %in% c(1,2)))

Dates {#dates}

You can use the lubridate package to work with dates. For example, you can use the year() function to return just the year from the date column and then select only data collected in 2010.

disgust2010 <- disgust %>%
  filter(year(date) == 2010)
(\#tab:dates-year)Rows 1-6 from `disgust2010` id user_id date moral1 moral2 moral3 moral4 moral5 moral6 moral7 sexual1 sexual2 sexual3 sexual4 sexual5 sexual6 sexual7 pathogen1 pathogen2 pathogen3 pathogen4 pathogen5 pathogen6 pathogen7 6902 5469 2010-12-06 0 1 3 4 1 0 1 3 5 2 4 6 6 5 5 2 4 4 2 2 6 6158 6066 2010-04-18 4 5 6 5 5 4 4 3 0 1 6 3 5 3 6 5 5 5 5 5 5 6362 7129 2010-06-09 4 4 4 4 3 3 2 4 2 1 3 2 3 6 5 2 0 4 5 5 4 6302 39318 2010-05-20 2 4 1 4 5 6 0 1 0 0 1 0 0 1 3 2 3 2 3 2 4 5429 43029 2010-01-02 1 1 1 3 6 4 2 2 0 1 4 6 6 6 4 6 6 6 6 6 4 6732 71955 2010-10-15 2 5 3 6 3 2 5 4 3 3 6 6 6 5 4 2 6 5 6 6 3

Or select data from at least 5 years ago. You can use the range function to check the minimum and maximum dates in the resulting dataset.

disgust_5ago <- disgust %>%
  filter(date < today() - dyears(5))

range(disgust_5ago$date)
## [1] "2008-07-10" "2017-04-04"

arrange() {#arrange}

Sort your dataset using arrange(). You will find yourself needing to sort data in R much less than you do in Excel, since you don't need to have rows next to each other in order to, for example, calculate group means. But arrange() can be useful when preparing data from display in tables.

disgust_order <- disgust %>%
  arrange(date, moral1)
(\#tab:arrange)Rows 1-6 from `disgust_order` id user_id date moral1 moral2 moral3 moral4 moral5 moral6 moral7 sexual1 sexual2 sexual3 sexual4 sexual5 sexual6 sexual7 pathogen1 pathogen2 pathogen3 pathogen4 pathogen5 pathogen6 pathogen7 1 1 2008-07-10 2 2 1 2 1 1 1 3 1 1 2 1 2 2 3 2 3 3 2 3 3 3 155324 2008-07-11 2 4 3 5 2 1 4 1 0 1 2 2 6 1 4 3 1 0 4 4 2 6 155386 2008-07-12 2 4 0 4 0 0 0 6 0 0 6 4 4 6 4 5 5 1 6 4 2 7 155409 2008-07-12 4 5 5 4 5 1 5 3 0 1 5 2 0 0 5 5 3 4 4 2 6 4 155366 2008-07-12 6 6 6 3 6 6 6 0 0 0 0 0 0 3 4 4 5 5 4 6 0 5 155370 2008-07-12 6 6 4 6 6 6 6 2 6 4 3 6 6 6 6 6 6 2 4 4 6

Reverse the order using desc()

disgust_order_desc <- disgust %>%
  arrange(desc(date))
(\#tab:arrange-desc)Rows 1-6 from `disgust_order_desc` id user_id date moral1 moral2 moral3 moral4 moral5 moral6 moral7 sexual1 sexual2 sexual3 sexual4 sexual5 sexual6 sexual7 pathogen1 pathogen2 pathogen3 pathogen4 pathogen5 pathogen6 pathogen7 39456 356866 2017-08-21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 39447 128727 2017-08-13 2 4 1 2 2 5 3 0 0 1 0 0 2 1 2 0 2 1 1 1 1 39371 152955 2017-06-13 6 6 3 6 6 6 6 1 0 0 2 1 4 4 5 0 5 4 3 6 3 39342 48303 2017-05-22 4 5 4 4 6 4 5 2 1 4 1 1 3 1 5 5 4 4 4 4 5 39159 151633 2017-04-04 4 5 6 5 3 6 2 6 4 0 4 0 3 6 4 4 6 6 6 6 4 38942 370464 2017-02-01 1 5 0 6 5 5 5 0 0 0 0 0 0 0 5 0 3 3 1 6 3

mutate() {#mutate}

Add new columns. This is one of the most useful functions in the tidyverse.

Refer to other columns by their names (unquoted). You can add more than one column in the same mutate function, just separate the columns with a comma. Once you make a new column, you can use it in further column definitions e.g., total below).

disgust_total <- disgust %>%
  mutate(
    pathogen = pathogen1 + pathogen2 + pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7,
    moral = moral1 + moral2 + moral3 + moral4 + moral5 + moral6 + moral7,
    sexual = sexual1 + sexual2 + sexual3 + sexual4 + sexual5 + sexual6 + sexual7,
    total = pathogen + moral + sexual,
    user_id = paste0("U", user_id)
  )
(\#tab:mutate)Rows 1-6 from `disgust_total` id user_id date moral1 moral2 moral3 moral4 moral5 moral6 moral7 sexual1 sexual2 sexual3 sexual4 sexual5 sexual6 sexual7 pathogen1 pathogen2 pathogen3 pathogen4 pathogen5 pathogen6 pathogen7 pathogen moral sexual total 1199 U0 2008-10-07 5 6 4 6 5 5 6 4 0 1 0 1 4 5 6 1 6 5 4 5 6 33 37 15 85 1 U1 2008-07-10 2 2 1 2 1 1 1 3 1 1 2 1 2 2 3 2 3 3 2 3 3 19 10 12 41 1599 U2 2008-10-27 1 1 1 1 NA NA 1 1 NA 1 NA 1 NA NA NA NA 1 NA NA NA NA NA NA NA NA 13332 U2118 2012-01-02 0 1 1 1 1 2 1 4 3 0 6 0 3 5 5 6 4 6 5 5 4 35 7 21 63 23 U2311 2008-07-15 4 4 4 4 4 4 4 2 1 2 1 1 1 5 5 5 4 4 5 4 3 30 28 13 71 1160 U3630 2008-10-06 1 5 NA 5 5 5 1 0 5 0 2 0 1 0 6 3 1 1 3 1 0 15 NA 8 NA

::: {.warning data-latex=""} You can overwrite a column by giving a new column the same name as the old column (see user_id) above. Make sure that you mean to do this and that you aren't trying to use the old column value after you redefine it. :::

summarise() {#summarise}

Create summary statistics for the dataset. Check the Data Wrangling Cheat Sheet or the Data Transformation Cheat Sheet for various summary functions. Some common ones are: mean(), sd(), n(), sum(), and quantile().

disgust_summary<- disgust_total %>%
  summarise(
    n = n(),
    q25 = quantile(total, .25, na.rm = TRUE),
    q50 = quantile(total, .50, na.rm = TRUE),
    q75 = quantile(total, .75, na.rm = TRUE),
    avg_total = mean(total, na.rm = TRUE),
    sd_total  = sd(total, na.rm = TRUE),
    min_total = min(total, na.rm = TRUE),
    max_total = max(total, na.rm = TRUE)
  )
(\#tab:summarise)All rows from `disgust_summary` n q25 q50 q75 avg_total sd_total min_total max_total 20000 59 71 83 70.6868 18.24253 0 126

group_by() {#group_by}

Create subsets of the data. You can use this to create summaries, like the mean value for all of your experimental groups.

Here, we'll use mutate to create a new column called year, group by year, and calculate the average scores.

disgust_groups <- disgust_total %>%
  mutate(year = year(date)) %>%
  group_by(year) %>%
  summarise(
    n = n(),
    avg_total = mean(total, na.rm = TRUE),
    sd_total  = sd(total, na.rm = TRUE),
    min_total = min(total, na.rm = TRUE),
    max_total = max(total, na.rm = TRUE),
    .groups = "drop"
  )
(\#tab:group-by)All rows from `disgust_groups` year n avg_total sd_total min_total max_total 2008 2578 70.29975 18.46251 0 126 2009 2580 69.74481 18.61959 3 126 2010 1514 70.59238 18.86846 6 126 2011 6046 71.34425 17.79446 0 126 2012 5938 70.42530 18.35782 0 126 2013 1251 71.59574 17.61375 0 126 2014 58 70.46296 17.23502 19 113 2015 21 74.26316 16.89787 43 107 2016 8 67.87500 32.62531 0 110 2017 6 57.16667 27.93862 21 90

::: {.warning data-latex=""} If you don't add .groups = "drop" at the end of the summarise() function, you will get the following message: "summarise() ungrouping output (override with .groups argument)". This just reminds you that the groups are still in effect and any further functions will also be grouped.

Older versions of dplyr didn't do this, so older code will generate this warning if you run it with newer version of dplyr. Older code might ungroup() after summarise() to indicate that groupings should be dropped. The default behaviour is usually correct, so you don't need to worry, but it's best to explicitly set .groups in a summarise() function after group_by() if you want to "keep" or "drop" the groupings. :::

You can use filter after group_by. The following example returns the lowest total score from each year (i.e., the row where the rank() of the value in the column total is equivalent to 1).

disgust_lowest <- disgust_total %>%
  mutate(year = year(date)) %>%
  select(user_id, year, total) %>%
  group_by(year) %>%
  filter(rank(total) == 1) %>%
  arrange(year)
(\#tab:group-by-filter)All rows from `disgust_lowest` user_id year total U236585 2009 3 U292359 2010 6 U245384 2013 0 U206293 2014 19 U407089 2015 43 U453237 2016 0 U356866 2017 21

You can also use mutate after group_by. The following example calculates subject-mean-centered scores by grouping the scores by user_id and then subtracting the group-specific mean from each score. Note the use of gather to tidy the data into a long format first.

disgust_smc <- disgust %>%
  gather("question", "score", moral1:pathogen7) %>%
  group_by(user_id) %>%
  mutate(score_smc = score - mean(score, na.rm = TRUE)) %>% 
  ungroup()

::: {.warning data-latex=""} Use ungroup() as soon as you are done with grouped functions, otherwise the data table will still be grouped when you use it in the future. :::

(\#tab:group-by-mutate)Rows 1-6 from `disgust_smc` id user_id date question score score_smc 1199 0 2008-10-07 moral1 5 0.9523810 1 1 2008-07-10 moral1 2 0.0476190 1599 2 2008-10-27 moral1 1 0.0000000 13332 2118 2012-01-02 moral1 0 -3.0000000 23 2311 2008-07-15 moral1 4 0.6190476 1160 3630 2008-10-06 moral1 1 -1.2500000

All Together {#all-together}

A lot of what we did above would be easier if the data were tidy, so let's do that first. Then we can use group_by to calculate the domain scores.

After that, we can spread out the 3 domains, calculate the total score, remove any rows with a missing (NA) total, and calculate mean values by year.

disgust_tidy <- reprores::disgust %>%
  gather("question", "score", moral1:pathogen7) %>%
  separate(question, c("domain","q_num"), sep = -1) %>%
  group_by(id, user_id, date, domain) %>%
  summarise(score = mean(score), .groups = "drop")
(\#tab:all-tidy)Rows 1-6 from `disgust_tidy` id user_id date domain score 1 1 2008-07-10 moral 1.428571 1 1 2008-07-10 pathogen 2.714286 1 1 2008-07-10 sexual 1.714286 3 155324 2008-07-11 moral 3.000000 3 155324 2008-07-11 pathogen 2.571429 3 155324 2008-07-11 sexual 1.857143
disgust_scored <- disgust_tidy %>%
  spread(domain, score) %>%
  mutate(
    total = moral + sexual + pathogen,
    year = year(date)
  ) %>%
  filter(!is.na(total)) %>%
  arrange(user_id) 
(\#tab:all-scored)Rows 1-6 from `disgust_scored` id user_id date moral pathogen sexual total year 1199 0 2008-10-07 5.285714 4.714286 2.142857 12.142857 2008 1 1 2008-07-10 1.428571 2.714286 1.714286 5.857143 2008 13332 2118 2012-01-02 1.000000 5.000000 3.000000 9.000000 2012 23 2311 2008-07-15 4.000000 4.285714 1.857143 10.142857 2008 7980 4458 2011-09-05 3.428571 3.571429 3.000000 10.000000 2011 552 4651 2008-08-23 3.857143 4.857143 4.285714 13.000000 2008
disgust_summarised <- disgust_scored %>%
  group_by(year) %>%
  summarise(
    n = n(),
    avg_pathogen = mean(pathogen),
    avg_moral = mean(moral),
    avg_sexual = mean(sexual),
    first_user = first(user_id),
    last_user = last(user_id),
    .groups = "drop"
  )
(\#tab:all-summarised)Rows 1-6 from `disgust_summarised` year n avg_pathogen avg_moral avg_sexual first_user last_user 2008 2392 3.697265 3.806259 2.539298 0 188708 2009 2410 3.674333 3.760937 2.528275 6093 251959 2010 1418 3.731412 3.843139 2.510075 5469 319641 2011 5586 3.756918 3.806506 2.628612 4458 406569 2012 5375 3.740465 3.774591 2.545701 2118 458194 2013 1222 3.771920 3.906944 2.549100 7646 462428 2014 54 3.759259 4.000000 2.306878 11090 461307 2015 19 3.781955 4.451128 2.375940 102699 460283 2016 8 3.696429 3.625000 2.375000 4976 453237 2017 6 3.071429 3.690476 1.404762 48303 370464

Additional dplyr one-table verbs

Use the code examples below and the help pages to figure out what the following one-table verbs do. Most have pretty self-explanatory names.

rename() {#rename}

You can rename columns with rename(). Set the argument name to the new name, and the value to the old name. You need to put a name in quotes or backticks if it doesn't follow the rules for a good variable name (contains only letter, numbers, underscores, and full stops; and doesn't start with a number).

sw <- starwars %>%
  rename(Name = name,
         Height = height,
         Mass = mass,
         `Hair Colour` = hair_color,
         `Skin Colour` = skin_color,
         `Eye Colour` = eye_color,
         `Birth Year` = birth_year)

names(sw)
##  [1] "Name"        "Height"      "Mass"        "Hair Colour" "Skin Colour"
##  [6] "Eye Colour"  "Birth Year"  "sex"         "gender"      "homeworld"  
## [11] "species"     "films"       "vehicles"    "starships"

::: {.try data-latex=""} Almost everyone gets confused at some point with rename() and tries to put the original names on the left and the new names on the right. Try it and see what the error message looks like. :::

distinct() {#distinct}

Get rid of exactly duplicate rows with distinct(). This can be helpful if, for example, you are merging data from multiple computers and some of the data got copied from one computer to another, creating duplicate rows.

# create a data table with duplicated values
dupes <- tibble(
  id = c( 1,   2,   1,   2,   1,   2),
  dv = c("A", "B", "C", "D", "A", "B")
)

distinct(dupes)
id dv 1 A 2 B 1 C 2 D

count() {#count}

The function count() is a quick shortcut for the common combination of group_by() and summarise() used to count the number of rows per group.

starwars %>%
  group_by(sex) %>%
  summarise(n = n(), .groups = "drop")
sex n female 16 hermaphroditic 1 male 60 none 6 NA 4
count(starwars, sex)
sex n female 16 hermaphroditic 1 male 60 none 6 NA 4

slice() {#slice}

slice(starwars, 1:3, 10)
name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships Luke Skywalker 172 77 blond fair blue 19 male masculine Tatooine Human The Empire Strikes Back, Revenge of the Sith , Return of the Jedi , A New Hope , The Force Awakens Snowspeeder , Imperial Speeder Bike X-wing , Imperial shuttle C-3PO 167 75 NA gold yellow 112 none masculine Tatooine Droid The Empire Strikes Back, Attack of the Clones , The Phantom Menace , Revenge of the Sith , Return of the Jedi , A New Hope R2-D2 96 32 NA white, blue red 33 none masculine Naboo Droid The Empire Strikes Back, Attack of the Clones , The Phantom Menace , Revenge of the Sith , Return of the Jedi , A New Hope , The Force Awakens Obi-Wan Kenobi 182 77 auburn, white fair blue-gray 57 male masculine Stewjon Human The Empire Strikes Back, Attack of the Clones , The Phantom Menace , Revenge of the Sith , Return of the Jedi , A New Hope Tribubble bongo Jedi starfighter , Trade Federation cruiser, Naboo star skiff , Jedi Interceptor , Belbullab-22 starfighter

pull() {#pull}

starwars %>%
  filter(species == "Droid") %>%
  pull(name)
## [1] "C-3PO"  "R2-D2"  "R5-D4"  "IG-88"  "R4-P17" "BB8"

Window functions {#window}

Window functions use the order of rows to calculate values. You can use them to do things that require ranking or ordering, like choose the top scores in each class, or accessing the previous and next rows, like calculating cumulative sums or means.

The dplyr window functions vignette has very good detailed explanations of these functions, but we've described a few of the most useful ones below.

Ranking functions

grades <- tibble(
  id = 1:5,
  "Data Skills" = c(16, 17, 17, 19, 20), 
  "Statistics"  = c(14, 16, 18, 18, 19)
) %>%
  gather(class, grade, 2:3) %>%
  group_by(class) %>%
  mutate(row_number = row_number(),
         rank       = rank(grade),
         min_rank   = min_rank(grade),
         dense_rank = dense_rank(grade),
         quartile   = ntile(grade, 4),
         percentile = ntile(grade, 100))
(\#tab:unnamed-chunk-1)All rows from `grades` id class grade row_number rank min_rank dense_rank quartile percentile 1 Data Skills 16 1 1.0 1 1 1 1 2 Data Skills 17 2 2.5 2 2 1 2 3 Data Skills 17 3 2.5 2 2 2 3 4 Data Skills 19 4 4.0 4 3 3 4 5 Data Skills 20 5 5.0 5 4 4 5 1 Statistics 14 1 1.0 1 1 1 1 2 Statistics 16 2 2.0 2 2 1 2 3 Statistics 18 3 3.5 3 3 2 3 4 Statistics 18 4 3.5 3 3 3 4 5 Statistics 19 5 5.0 5 4 4 5

::: {.try data-latex=""} What are the differences among row_number(), rank(), min_rank(), dense_rank(), and ntile()? Why doesn't row_number() need an argument? What would happen if you gave it the argument grade or class? What do you think would happen if you removed the group_by(class) line above? What if you added id to the grouping? What happens if you change the order of the rows? * What does the second argument in ntile() do? :::

You can use window functions to group your data into quantiles.

sw_mass <- starwars %>%
  group_by(tertile = ntile(mass, 3)) %>%
  summarise(min = min(mass),
            max = max(mass),
            mean = mean(mass),
            .groups = "drop")
(\#tab:unnamed-chunk-2)All rows from `sw_mass` tertile min max mean 1 15 68 45.6600 2 74 82 78.4100 3 83 1358 171.5789 NA NA NA NA

::: {.try data-latex=""} Why is there a row of NA values? How would you get rid of them? :::

Offset functions

The function lag() gives a previous row's value. It defaults to 1 row back, but you can change that with the n argument. The function lead() gives values ahead of the current row.

lag_lead <- tibble(x = 1:6) %>%
  mutate(lag = lag(x),
         lag2 = lag(x, n = 2),
         lead = lead(x, default = 0))
(\#tab:unnamed-chunk-3)All rows from `lag_lead` x lag lag2 lead 1 NA NA 2 2 1 NA 3 3 2 1 4 4 3 2 5 5 4 3 6 6 5 4 0

You can use offset functions to calculate change between trials or where a value changes. Use the order_by argument to specify the order of the rows. Alternatively, you can use arrange() before the offset functions.

trials <- tibble(
  trial = sample(1:10, 10),
  cond = sample(c("exp", "ctrl"), 10, T),
  score = rpois(10, 4)
) %>%
  mutate(
    score_change = score - lag(score, order_by = trial),
    change_cond = cond != lag(cond, order_by = trial, 
                              default = "no condition")
  ) %>%
  arrange(trial)
(\#tab:offset-adv)All rows from `trials` trial cond score score_change change_cond 1 ctrl 8 NA TRUE 2 ctrl 4 -4 FALSE 3 exp 6 2 TRUE 4 ctrl 2 -4 TRUE 5 ctrl 3 1 FALSE 6 ctrl 6 3 FALSE 7 ctrl 2 -4 FALSE 8 exp 4 2 TRUE 9 ctrl 4 0 TRUE 10 exp 3 -1 TRUE

::: {.try data-latex=""} Look at the help pages for lag() and lead().

Cumulative aggregates

cumsum(), cummin(), and cummax() are base R functions for calculating cumulative means, minimums, and maximums. The dplyr package introduces cumany() and cumall(), which return TRUE if any or all of the previous values meet their criteria.

cumulative <- tibble(
  time = 1:10,
  obs = c(2, 2, 1, 2, 4, 3, 1, 0, 3, 5)
) %>%
  mutate(
    cumsum = cumsum(obs),
    cummin = cummin(obs),
    cummax = cummax(obs),
    cumany = cumany(obs == 3),
    cumall = cumall(obs < 4)
  )
(\#tab:unnamed-chunk-4)All rows from `cumulative` time obs cumsum cummin cummax cumany cumall 1 2 2 2 2 FALSE TRUE 2 2 4 2 2 FALSE TRUE 3 1 5 1 2 FALSE TRUE 4 2 7 1 2 FALSE TRUE 5 4 11 1 4 FALSE FALSE 6 3 14 1 4 TRUE FALSE 7 1 15 1 4 TRUE FALSE 8 0 15 0 4 TRUE FALSE 9 3 18 0 4 TRUE FALSE 10 5 23 0 5 TRUE FALSE

::: {.try data-latex=""} What would happen if you change cumany(obs == 3) to cumany(obs > 2)? What would happen if you change cumall(obs < 4) to cumall(obs < 2)? * Can you think of circumstances in your own data where you might need to use cumany() or cumall()? :::

Glossary {#glossary-dplyr}

term definition [data wrangling](https://psyteachr.github.io/glossary/d.html#data-wrangling){class="glossary" target="_blank"} The process of preparing data for visualisation and statistical analysis.

Further Resources {#resources-dplyr}



PsyTeachR/reprores-v2 documentation built on Sept. 26, 2022, 10:06 a.m.