library(tidyquintro)
library(learnr)
library(gradethis)

knitr::opts_chunk$set(echo = FALSE,
                 exercise.warn_invisible = FALSE)

# enable code checking
tutorial_options(exercise.checker = grade_learnr)

Summarising the whole dataset

Summarising takes some practise to get right. So it's best to just give it a go!

First start by trying to summarise a single column, bill_length_mm by calculating its mean.

penguins |> 
  summarise(_(_, na.rm = _))
penguins |> 
  summarise(mean(bill_length_mm, na.rm = TRUE))
grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)
Did you remember to place the function first, then the colum name inside the function?

Summarise two columns

Often, we'd like to summarise several columns at once. Get the mean for both bill_depth_mm and bill_length_mm by summarising each.

penguins |> 
  summarise(bill_length_mm = mean(__, na.rm = _),
            bill_depth_mm = mean(__, na.rm = _))
penguins |> 
  summarise(bill_length_mm = mean(bill_length_mm, na.rm = TRUE),
            bill_depth_mm = mean(bill_depth_mm, na.rm = TRUE))
grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)
Make sure the correct column names go to the correct summary!

Summarise across many columns

Even more often, we'd like to summarise a collection of columns. In the tidyverse we do this with the across function, summarising across multiple columns at once using tidy-selectors. Get the mean of all the columns starting with "bill"

penguins |> 
  summarise(across(__, .fns = mean, na.rm = TRUE)
penguins |> 
  summarise(across(starts_with("bill"), .fns = mean, na.rm = TRUE))
grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)
Remember to use the tidy selectors like ends_with, contains, and starts_with
the expectation here is to use tidy-selector starts_with

Summarise across many columns with several functions

Even more often, we'd like to summarise a collection of columns. In the tidyverse we do this with the across function, summarising across multiple columns at once using tidy-selectors. Get the descriptive statistics of all the columns starting with "bill" (mean, sd, min and max)

penguins |> 
  summarise(across(__, .fns = list(mean = mean,
                                   _ = _,
                                   _ = _,
                                   _ = _), 
                     na.rm = TRUE)
  )
penguins |> 
  summarise(across(starts_with("bill"), 
                   .fns = list(mean = mean,
                               sd = sd,
                               min = min,
                               max = max), 
                   na.rm = TRUE)
  )
grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)
The expectation here is to name the output with the exact same name as the function
be sure to use all small letters here

Summarising grouped data

Tidyverse summaries become even more powerful when paired with grouped data. These groupings make it possible to aggregate data given the groups, or get summaries across meaningful groups in the data.

Start out slow, by grouping the data by species and getting the mean of the bill_length_mm column

penguins |> 
  group_by() |> 
  summarise(_(_, na.rm = _))
penguins |> 
  group_by(species) |> 
  summarise(mean(bill_length_mm, na.rm = TRUE))
grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)
Did you remember to place the function first, then the colum name inside the function?

Summarise two columns

maybe the islands play a larger role? Group the data by island instead, and take the summary of two columns

penguins |> 
  group_by(_) |> 
  summarise(bill_length_mm = mean(__, na.rm = _),
            bill_depth_mm = mean(__, na.rm = _))
penguins |> 
  group_by(island) |> 
  summarise(bill_length_mm = mean(bill_length_mm, na.rm = TRUE),
            bill_depth_mm = mean(bill_depth_mm, na.rm = TRUE))
grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)
Make sure the correct column names go to the correct summary!

Summarise across many columns

Acutally, I'm convinced that both species and island make meaningful groups here. Group the data by both, and grab the mean of all bill measurements

penguins |> 
  group_by(_) |> 
  summarise(across(__, .fns = mean, na.rm = TRUE)
penguins |> 
  group_by(species, island) |> 
  summarise(across(starts_with("bill"), .fns = mean, na.rm = TRUE))
grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)
Remember to use the tidy selectors like ends_with, contains, and starts_with
the expectation here is to use tidy-selector starts_with

Summarise across many columns with several functions

Even more often, we'd like to summarise a collection of columns. In the tidyverse we do this with the across function, summarising across multiple columns at once using tidy-selectors. Get the descriptive statistics of all the columns starting with "bill" (mean, sd, min and max)

penguins |> 
  summarise(across(__, .fns = list(mean = mean,
                                   _ = _,
                                   _ = _,
                                   _ = _), 
                     na.rm = TRUE)
  )
penguins |> 
  summarise(across(starts_with("bill"), 
                   .fns = list(mean = mean,
                               sd = sd,
                               min = min,
                               max = max), 
                   na.rm = TRUE)
  )
grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)
The expectation here is to name the output with the exact same name as the function
be sure to use all small letters here

Play around

The best way to get a feeling for how things work is to just play around with it. Adapt the code below and just try different things. See what happens, look at the possible errors etc.

penguins |> 
  group_by(_) |> 
  summarise(across(__, 
                     .fns = list(), 
                     na.rm = TRUE)
  )


Athanasiamo/tidyquintro documentation built on Oct. 11, 2022, 7:15 p.m.