library(learnr) library(gradethis) knitr::opts_chunk$set( echo = FALSE, exercise.warn_invisible = FALSE ) # enable code checking tutorial_options( exercise.checker = grade_learnr, exercise.lines = 8, exercise.reveal_solution = TRUE )
Subsetting columns is a great way to reduce karge datasets to more manageable sizes.
Using the select()
function from dplyr, select the first, second, fourth and sixth column from the penguins dataset
using their numerical values.
select(penguins, _, _, _, _)
select(penguins, 1, 2, 4, 6)
grade_code( correct = random_praise(), incorrect = random_encouragement() )
Each column number should be separated by a comma
Sometimes we want to subset whole ranges, and maybe a couple of extra columns. We can do this usind the colon. Complete the code below so you select columns 1 through 4, and also column 6.
select(penguins, _:_, _)
select(penguins, 1:4, 6)
grade_code( correct = random_praise(), incorrect = random_encouragement() )
While using numbers for the columns can be convenient, in most cases you'll likely want to base your selection on the names of column. The syntax you learned above works exactly the same for column names. Take the same code as before, but this time instead of using the index numbers for the column, use the column names.
Column 1 is species
, column 4 is bill_depth_mm
, and column 6 is body_mass_g
select(penguins, _:_, _)
select(penguins, species:bill_depth_mm, body_mass_g)
grade_code( correct = random_praise(), incorrect = random_encouragement() )
Other times, is might be handy to grab columns based on their naming. If you are lucky, your dataset has some overarching naming convention, that makes it possible to grab columns based on their names.
Complete the code below so that you are selecting species, island and all the columns starting with "bill".
select(penguins, _, _, starts_with("_"))
select(penguins, species, island, starts_with("bill"))
grade_code( correct = random_praise(), incorrect = random_encouragement() )
Now we lost flipper length! to make sure we keep flipper length, instead select columns what end with "mm".
select(penguins, _, _, ends_with("_"))
select(penguins, species, island, ends_with("mm"))
grade_code( correct = random_praise(), incorrect = random_encouragement() )
Take the same code below, and add to it the tidy-selector everything()
, what does it do?
select(penguins, _, _, ends_with("_"))
select(penguins, species, island, ends_with("mm"), everything())
grade_code( correct = random_praise(), incorrect = random_encouragement() )
everything()
This function is a tidyselector that select all columns not yet mentioned. It's a very convenient way of re-arranging your columns, so that you keep everything, but the columns you are most interested in are at the beginning of the data.
We should get a better idea of what columns in our data are coded as what. Particularly factors, what columns are factors in this data set?
Complete the code to select only columns that are factors.
select(penguins, where(is._))
select(penguins, where(is.factor))
grade_code( correct = random_praise(), incorrect = random_encouragement() )
the function to checking if a vector is a function is `is.vector`
Select only columns that are integerr
select(penguins, where(is._))
select(penguins, where(is.integer))
grade_code( correct = random_praise(), incorrect = random_encouragement() )
Try using the `is.integer` function.
Select only columns that are integer
select(penguins, _, _, where(_))
select(penguins, island, species, where(is.numeric))
grade_code( correct = random_praise(), incorrect = random_encouragement() )
Try using the `is.numeric` function.
Let us start with some exercises in filtering, i.e. subsetting rows.
Fill in the code below so that you subset the data by the species
column, so you only have the gentoo's in your output.
filter(penguins, __ __ "Gentoo")
filter(penguins, species == "Gentoo")
grade_code( correct = random_praise(), incorrect = random_encouragement() )
the column names is 'species'
When evaluating something as TRUE or FALSE, remember to use '==' and not '='
When we are subsetting based on numerical columns, we can use arithmetic evaluations. Complete the code below so you are left with only data where the flipper length is larger than 180.
filter(penguins, flipper_length_mm _ 180)
filter(penguins, flipper_length_mm > 180)
grade_code( correct = random_praise(), incorrect = random_encouragement() )
arithmetic evaluations can be done with '==', '>', '<'
The above code will not include any row where flipper length is exactly 180. For this to happen you have to indicate that it can be larger or equal to 180.
filter(penguins, flipper_length_mm >_ 180)
filter(penguins, flipper_length_mm >= 180)
grade_code( correct = random_praise(), incorrect = random_encouragement() )
arithmetic evaluations can also be done with '>=' (larger than or equal to) and '<=' (smaller than or equal to)
Using a comma (','), each expression must be TRUE for the end result. Choose all data where flipper length is larger or equal to 180, and the species is "Gentoo"
filter(penguins, flipper_length_mm __ 180_ ____ == "Gentoo")
filter(penguins, flipper_length_mm >= 180, species == "Gentoo")
grade_code( correct = random_praise(), incorrect = random_encouragement() )
make sure each expressions works individually, if you are not succeeding
separate the different expressions with a comma
Do the same using the
&
(and) sign.
filter(penguins, flipper_length_mm >= 180, species == "Gentoo")
filter(penguins, flipper_length_mm >= 180 & species == "Gentoo")
grade_code( correct = random_praise(), incorrect = random_encouragement() )
make sure each expressions works individually, if you are not succeeding
separate the different expressions with a `&`
Filter the penguins data so that you have either chinstrap penguins, or penguins with body mass below or equal to 3 kilos.
filter(penguins, species __ "Chinstrap" _ body_mass_g __ 3000)
filter(penguins, species == "Chinstrap" | body_mass_g < 3000 )
grade_code( correct = random_praise(), incorrect = random_encouragement() )
make sure each expressions works individually, if you are not succeeding
separate the different expressions with a `|`
Create an object named
gentoos
that contains only data from the speces "Gentoo" in the penguins data set.
gentoos <- filter(penguins, __ == __)
gentoos <- filter(penguins, species == "Gentoo")
grade_code( correct = random_praise(), incorrect = random_encouragement() )
make sure you spell `species` in small letters and `Gentoo` with capital G! R is case-sensitive.
Create another object with only penguins that are over 4 kilos, and call it
large_penguins
.
__ <- filter(penguins, body_mass_g _ ___)
large_penguins <- filter(penguins, body_mass_g > 4000)
grade_code( correct = random_praise(), incorrect = random_encouragement() )
Create a third object with observations from the Dream island, and keep only the columns island, and the bill measurements, and call it
dream_penguins
. Do all this by chaining the commands with the pipe.
__ <- penguins __ filter(__ == "_") __ select(_, __("bill"))
dream_penguins <- penguins %>% filter(island == "Dream") %>% select(island, starts_with("bill"))
grade_code( correct = random_praise(), incorrect = random_encouragement() )
make sure you spell `species` in small letters and `Gentoo` with capital G! R is case-sensitive.
Arrange the penguins data by body mass.
penguins %>% __(body_mass_g)
penguins %>% arrange(body_mass_g)
grade_code( correct = random_praise(), incorrect = random_encouragement() )
use the `arrange` function
Arrange the penguins data by descending order of flipper length.
penguins %>% __(__(flipper_length_mm))
penguins %>% arrange(desc(flipper_length_mm))
grade_code( correct = random_praise(), incorrect = random_encouragement() )
use the `arrange` function
You can arrange on multiple columns. Try arranging the data by ascending island and descending flipper length.
penguins %>% __(island, __(flipper_length_mm))
penguins %>% arrange(island, desc(flipper_length_mm))
grade_code( correct = random_praise(), incorrect = random_encouragement() )
use two arguments, with a comma in between.
quiz( question("What functions can you use to subset a data set by rows?", answer("dplyr's `filter()`", correct = TRUE), answer("dplyr's `select()`"), answer("`subset()`", correct = TRUE), allow_retry = TRUE ), question("What functions can you use to subset a data set by columns", answer("dplyr's `filter()`"), answer("dplyr's `select()`", correct = TRUE), answer("`subset()`", correct = TRUE), allow_retry = TRUE ), question("If you want to select all columns in data 'df' that contains the string 'something', you can do that by", answer("`df[grepl('something', names(df))]`", correct = TRUE), answer("`select(df, starts_with('something')`"), answer("`df[,'something']`"), answer("`select(df, contains('something')`", correct = TRUE), allow_retry = TRUE ), question("If you want to subset rows so that you only have those below 18 years of age, you can do that by", answer("`df$age < 18`"), answer("`filter(df, age < 18)`", correct = TRUE), answer("`df[df$age < 18,]`", correct = TRUE), answer("`filter(df, age <= 18)`"), allow_retry = TRUE ) )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.