In Athanasiamo/tidyquintro: Quick Intro to Tidyverse

library(tidyquintro)
library(learnr)
library(gradethis)

knitr::opts_chunk$set(echo = FALSE,
                 exercise.warn_invisible = FALSE)

# enable code checking
tutorial_options(exercise.checker = grade_learnr)

Subsetting rows - filter

Let us start with some exercises in filtering, i.e. subsetting rows. Fill in the code below so that you subset the data by the species column, so you only have the gentoo's in your output.

filter(penguins, __ == "Gentoo")

filter(penguins, species == "Gentoo")

grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)

the column names is 'species'

When evaluating something as TRUE or FALSE, remember to use '==' and not '='

Subset evaluating numerical columns

When we are subsetting based on numerical columns, we can use arithmetic evaluations. Complete the code below so you are left with only data where the flipper length is larger that 180.

filter(penguins, flipper_length_mm _ 180)

filter(penguins, flipper_length_mm > 180)

grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)

arithmetic evaluations can be done with '==', '>', '<'

Subset evaluating numerical columns 2

The above code will not include any row where flipper length is exactly 180. For this to happen you have to indicate that it can be larger or equal to 180.

filter(penguins, flipper_length_mm >_ 180)

filter(penguins, flipper_length_mm >= 180)

grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)

arithmetic evaluations can also be done with '>=' (larger than) and '<=' (smaller than)

Subset with multiple conditions

We can add several conditions when we are evaluating. Using a comma (','), each expression must be TRUE for the end result. Here, choose all data where flipper length is larger or equal to 180, and the species is "Gentoo"

filter(penguins, 
       flipper_length_mm __ 180,
       ____ == "Gentoo")

filter(penguins, 
       flipper_length_mm >= 180,
       species == "Gentoo")

grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)

make sure each expressions works individually, if you are not succeeding

separate the different expressions with a comma

Subsetting columns - select

Subsetting columns is a great way to reduce karge datasets to more manageable sizes. Using the select() function from dplyr, select the first, second, fourth and sixth column from the penguins dataset using their numerical values.

select(penguins, _, _, _, _)

select(penguins, 1, 2, 4, 6)

grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)

Each column number should be separated by a comma

Subsetting ranges

Sometimes we want to subset whole ranges, and maybe a couple of extra columns. We can do this usind the colon. Complete the code below so you select columns 1 through 4, and also column 6.

select(penguins, _:_, _)

select(penguins, 1:4, 6)

grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)

Subsetting ranges

While using numbers for the columns can be convenient, in most cases you'll likely want to base your selection on the names of column. The syntax you learned above works exactly the same for column names. Take the same code as before, but this time instead of using the index numbers for the column, use the column names.

Column 1 is species, column 4 is bill_depth_mm, and column 6 is body_mass_g

select(penguins, _:_, _)

select(penguins, species:bill_depth_mm, body_mass_g)

grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)

Subsetting based on naming convention

Other times, is might be handy to grab columns based on their naming. If you are lucky, your dataset has some overarching naming convention, that makes it possible to grab columns based on their names.

Complete the code below so that you are selecting species, island and all the columns starting with "bill".

select(penguins, _, _, starts_with("_"))

select(penguins, species, island, starts_with("bill"))

grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)

Subsetting based on naming convention 2

Now we lost flipper length! to make sure we keep flipper length, instead select columns what end with "mm".

select(penguins, _, _, ends_with("_"))

select(penguins, species, island, ends_with("mm"))

grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)

Selecting based on column type

We should get a better idea of what columns in our data are coded as what. Particularly factors, what columns are factors in this data set?

Complete the code to select only columns that are factors.

select(penguins, where(is._))

select(penguins, where(is.factor))

grade_code(
  correct = random_praise(),
  incorrect = random_encouragement()
)

the function to checking if a vector is a function is `is.vector`

Quiz

quiz(
  question("What functions can you use to subset a data set by rows?",
    answer("dplyr's `filter()`", correct = TRUE),
    answer("dplyr's `select()`"),
    answer("`subset()`", correct = TRUE),
    allow_retry = TRUE
  ),
  question("What functions can you use to subset a data set by columns",
    answer("dplyr's `filter()`"),
    answer("dplyr's `select()`", correct = TRUE),
    answer("`subset()`", correct = TRUE),
    allow_retry = TRUE
  ),
  question("If you want to select all columns in data 'df' that contains the string 'something', you can do that by",
    answer("`df[grepl('something', names(df))]`", correct = TRUE),
    answer("`select(df, starts_with('something')`"),
    answer("`df[,'something']`"),
    answer("`select(df, contains('something')`", correct = TRUE),
    allow_retry = TRUE
  ),
    question("If you want to subset rows so that you only have those below 18 years of age, you can do that by",
    answer("`df$age < 18`"),
    answer("`filter(df, age < 18)`", correct = TRUE),
    answer("`df[df$age < 18,]`", correct = TRUE),
    answer("`filter(df, age <= 18)`"),
    allow_retry = TRUE
  )
)