library(tidyverse) library(lubridate) library(stringr) library(learnr) library(skimr) library(shiny) library(PPBDS.data) knitr::opts_chunk$set(echo = FALSE, message = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage="local") # Set up stringr-objects library(dslabs) murders <- as_tibble(murders) states <- murders$state states2 <- murders %>% select(state, abb) # Set up cars # DK: Creating a column called "type" is a bad idea since it is an R function # name. Also, why are we using this random R package? library(fueleconomy) lexus_2000 <- vehicles %>% filter(year == 2000, make == "Lexus") %>% select(id, make, model, class, drive) lexus_1999 <- vehicles %>% filter(year == 1999, make == "Lexus") %>% select(id, make, model, class, trans, drive) lexus_1998 <- vehicles %>% filter(year == 1998, make == "Lexus") %>% select(id, make, model, class, trans, drive) %>% rename("type" = class) lexus_mileage <- vehicles %>% filter(year == 2000, make == "Lexus") %>% select(id, hwy, cty) %>% slice(3:9) # Tidy section setup library(babynames) cases <- tribble( ~Country, ~"2011", ~"2012", ~"2013", "FR", 7000, 6900, 7000, "DE", 5800, 6000, 6200, "US", 15000, 14000, 13000 ) cases2 <- tribble( ~city, ~country, ~continent, ~"2011", ~"2012", ~"2013", "Paris", "FR", "Europe", 7000, 6900, 7000, "Berlin", "DE", "Europe", 5800, 6000, 6200, "Chicago", "US", "North America", 15000, 14000, 13000 ) pollution <- tribble( ~city, ~size, ~amount, "New York", "large", 23, "New York", "small", 14, "London", "large", 22, "London", "small", 16, "Beijing", "large", 121, "Beijing", "small", 121 ) # Needed for later sections of the tutorial library(fivethirtyeight) library(nycflights13) library(ggthemes)
Confirm that you have the correct version of PPBDS.data installed by pressing "Run Code."
packageVersion('PPBDS.data')
The answer should be ‘0.3.2.9004’, or a higher number. If it is not, you should upgrade your installation by issuing these commands:
remove.packages('PPBDS.data') library(remotes) remotes::install_github('davidkane9/PPBDS.data')
Strictly speaking, it should not be necessary to remove a package. Just installing it again should overwrite the current version. But weird things sometimes happen, so removing first is the safest approach.
``` {r name, echo=FALSE} question_text( "Student Name:", answer(NULL, correct = TRUE), incorrect = "Ok", try_again_button = "Modify your answer", allow_retry = TRUE )
## Email ``` {r email, echo=FALSE} question_text( "Email:", answer(NULL, correct = TRUE), incorrect = "Ok", try_again_button = "Modify your answer", allow_retry = TRUE )
This tutorial uses the following libraries:
library(fivethirtyeight)
library(nycflights13)
library(dslabs)
library(fueleconomy)
library(babynames)
library(ggthemes)
If you have not installed these packages, you will like encounter issues when attempting the tutorial, so make sure to do so. If you don't remember how to do so, reference the install.packages()
function discussed in The Primer.
The first few exercises focus on various functions that can be used to manipulate strings.
The states
character vector comes installed with R and contains the names of all US states and the District of Columbia. Have a look at it by printing it to the console.
# Type `state` and press Run Code
Use str_detect()
on states
to create a vector which will be TRUE for states
which contain the pattern "ana" and FALSE otherwise.
str_detect(..., pattern = ...)
Use str_subset()
on states
to create a vector of the names of the states that contain the pattern "ana".
str_subset(..., pattern = ...)
Use str_split()
on states
in order to help identify states which involve two or more words in their names. Set simplify
argument to TRUE. The result will be a character matrix with 51 rows and three columns.
# A " " is the simple version of a pattern which identifies words spaces. ```` ```r str_split(..., pattern = ..., simplify = TRUE)
Try again to identify states whose names consist of two or more words, this time using str_split_fixed()
. Set the n
argument to 2, which should split elements into two parts. Observe what happens to District of Columbia
.
str_split_fixed(..., pattern = ..., n = ...)
Using str_sub
, create a character vector that contains only the first three letters of each state.
str_sub(states, 1, 3)
Collapse states
using str_c()
. Separate them with a comma that is followed by a whitespace. This should create a single character object with all the states.
str_c(..., collapse = ", ")
Use str_c
to collapse states into the form state1 & state2
. Combine the first 1-25 states with states 26-50. Note that we are excluding the 51st state.
str_c(..., ..., sep = ...)
# One approach is to use brackets ([]) to subscript out the elements of `states` # which you want to them combine.
Use str_replace()
to replace the pattern North
with N.
. For example, transform North Carolina into N. Carolina.
str_replace(..., pattern = ..., replacement = ...)
Next, let's see how the above functions can be combined with regular expressions. Use str_subset()
on states
to create a vector of states that have two a's with a single intervening character in their name:
# Consider using the regex "."
str_subset(..., pattern = ...)
Use str_subset()
to identify the same pattern as in the previous question, including now only those states where the pattern occurs at the end of their name.
# Consider using the pattern "a.a$"
str_subset(..., pattern = ...)
Use str_subset()
to find states
that contain the letter "a" and then one or more characters and another a.
# Consider using the pattern "a.+a"
str_subset(..., pattern = ...)
Does capitalization matter? Repeat the previous question but replace the first letter with a capital "A".
str_subset(..., pattern = ...)
The remaining exercises in this section contain a few tasks that should improve your understanding of the above concepts. First, glimpse()
the states2
tibble. We will be building some pipes which always start with states2
.
glimpse(states2)
Start a pipe with states2
data set. Add a column state_length
that takes the str_length()
of each state
.
states2 %>% mutate(... = str_length(...))
Add arrange()
to the previous pipe so that the state with the shortest name is first.
states2 %>% mutate(... = str_length(...)) %>% arrange(...)
Change the last line of the pipe so it is by desc(state)
. Note how this arrangement differ from the one you got in exercise 15.
... %>% arrange(desc(...))
Create a new column --- called state_12
--- in the pipe which only contains the first two letters of each state name.
... %>% mutate(state_12 = str_sub(state, 1, 2))
Use str_to_upper()
and mutate()
to transform state_12
so that both of the letters are capital.
Remark: The function str_to_upper()
has not yet been introduced in the Primer. Can you still guess what it does? Have a look the help page by running ?str_to_upper
.
... %>% mutate(state_12 = str_sub(state, 1, 2)) %>% mutate(state_12 = str_to_upper(state_12))
mutate()
a new column called matches
that creates a TRUE or FALSE value if the first two letters of the state name (the state_12
column) and the abb
column match.
# You can use ifelse() to tests for conditions and assign values. There is also # a very similar function, if_else(), which does the same thing but more # carefully. See The Primer for details.
... %>% mutate(matches = ifelse(state_12 == abb, TRUE, FALSE))
Add count()
to the end of the pipe to count the number of TRUE values in matches
column.
... %>% count(matches)
quiz( question("How many state abbreviations match the first two letters of the state's name?", answer("15"), answer("42"), answer("23"), answer("19", correct = TRUE), allow_retry = FALSE))
Let's use a data set in the fueleconomy
package to better understand factors. First, glimpse()
the vehicles
data set.
Make the class
column in the vehicles
dataframe a factor instead of a character chr
variable. Do this by using the mutate()
and as.factor()
functions. Reassign this changed dataframe as vehicles_fct
.
... <- vehicles %>% mutate(... = as.factor(...))
Now use the group_by()
function to group by the class
variable. Reassign this to vehicles_fct
.
vehicles_fct <- vehicles %>% mutate(class = as.factor(class))
Create a mean_cty
variable using the mutate()
and mean()
functions on the cty
column for the vehicles_fct
dataframe. Reassign this mutated dataframe as vehicles_fct
.
vehicles_fct <- vehicles %>% mutate(class = as.factor(class)) %>% group_by(class)
Create a ggplot with the independent variable as class
, the dependent variable as mean_cty
, and the geom_point()
function. Names this plot vehicles_plot
.
vehicles_fct <- vehicles %>% mutate(class = as.factor(class)) %>% group_by(class) %>% summarize(mean_cty = mean(cty))
# The independent variable should always be on the x-axis and the dependent # variable on the y-axis
vehicles_plot <- ggplot(data = ..., mapping = aes(x = ..., y = ...)) + ...
Flip the coordinates of thevehicles_plot
graphic. Reassign the flipped graphic as vehicles_plot
.
vehicles_fct <- vehicles %>% mutate(class = as.factor(class)) %>% group_by(class) %>% summarize(mean_cty = mean(cty), .groups = "drop_last") vehicles_plot <- ggplot(data = vehicles_fct, mapping = aes(x = class, y = mean_cty)) + geom_point()
# Look at the coord_flip() function
Now use fct_reorder()
to rearrange the levels of class
variable according to the mean_cty
variable. You will need to do this within the ggplot()
function, so create a new plot here, and call it vehicles_plot_2
. Once again, flip the coordinates for this new graphic.
vehicles_fct <- vehicles %>% mutate(class = as.factor(class)) %>% group_by(class) %>% summarize(mean_cty = mean(cty))
vehicles_plot_2 <- ggplot(data = ..., mapping = aes(x = fct_reorder(..., ...), y = ...)) + ... + ...
Apply the classic theme theme_classic()
and add a title and axis titles to vehicles_plot_2
.
vehicles_fct <- vehicles %>% mutate(class = as.factor(class)) %>% group_by(class) %>% summarize(mean_cty = mean(cty)) vehicles_plot_2 <- ggplot(data = vehicles_fct, mapping = aes(x = fct_reorder(class, mean_cty), y = mean_cty)) + geom_point() + coord_flip()
Now, let's move on to lists. This is section 2.4 in the textbook. First, we'd like you to create a list. Call this list mylist
and let it have three items a
, b
, and c
. Then, let a
be a vector containing 1, 2, and 3. Let b
be a vector containing 4, 5, and 6, and let c
be a vector containing 7, 8, and 9
# Consider using the c() function to create the individual vectors for a, b, and # c # You could also use the : operator
Now, call str()
on mylist
.
mylist <- list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))
Extract the sub-list containing b and c using [
.
mylist <- list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))
Now extract a single component a
from mylist
mylist <- list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))
# Consider using [[]]
# Within the brackets, you can either put the index of a or "a"
Now, extract the number 5 from mylist
.
mylist <- list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))
Run the today()
and now()
functions.
Use functions such as ymd()
or mdy()
to convert the strings below into the proper date-time format.
date_1 <- "February 29, 2020" date_2 <- "29 February 2020" date_3 <- "2020-2-29" date_4 <- "2/29/2020 16:00:00 UTC"
Use the make_datetime()
function to create a date-time for the first moment of the year 2000.
# The ideal output should be "2000-01-01 UTC"
Run lexus_2000
in this code chunk:
lexus_2000
Run lexus_1999
in the code chunk:
lexus_1999
quiz( question("Which variable is included in lexus_1999 but not lexus_2000", answer("id"), answer("make"), answer("model"), answer("class"), answer("trans", correct = TRUE), answer("drive"), allow_retry = TRUE ) )
Think about what will happen when we try to bind the rows of these two tibbles together.
Use bind_rows()
to bind lexus_1999
and lexus_2000
:
What happens to the trans
column?
bind_rows(..., ...)
Run lexus_1998
in the following code chunk:
Consider the discrepancy between the columns of lexus_1998
and lexus_1999
. Predict, in your head, what will happen when binding the rows of the two tibbles.
Bind the two dataframes lexus_1998
and lexus_1999
:
bind_rows(..., ...)
Use the rename()
function to change the type
variable to class
in the lexus_1998
dataframe. Use the assignment operator to reassign the changed dataframe to lexus_1998
.
... <- ... %>% rename(...)
Use bind_rows
to bind lexus_1998
and lexus_1999
again. Name this new tibble lexus_two_year
:
... <- bind_rows(..., ...)
lexus_1998 <- lexus_1998 %>% rename("class" = type) lexus_two_year <- bind_rows(lexus_1998, lexus_1999)
Now bind the rows of lexus_two_year
with lexus_2000
. Use the assignment operator to call this tibble lexus
.
... <- bind_rows(..., ...)
lexus_two_year <- bind_rows(lexus_1998, lexus_1999) lexus <- bind_rows(lexus_two_year, lexus_2000)
Now, unite()
the make
and model
columns of the lexus
dataframe. Name this united column vehicle
, and make the separator between the previous columns a space.
lexus %>% unite(..., ..., ..., sep = ...)
lexus %>% unite("vehicle", ..., ..., sep = " ")
Let's return to the lexus_2000
data. glimpse()
both the lexus_2000
dataframe and the lexus_mileage
dataframe.
glimpse(... ) glimpse(...)
Now use the extraction operator $
on the id
columns of both lexus_2000
and lexus_mileage
.
lexus_2000$... lexus_mileage$...
quiz( question("Which id(s) are included in the lexus_2000 dataframe but not in the lexus_mileage dataframe?", answer("16365, 16366"), answer("15921, 15922, 16038, 16366, 16039, 15685, 15686"), answer("15801"), answer("15920, 15801", correct = TRUE), answer("15801, 15920, 15921"), answer("15685, 16039"), allow_retry = FALSE))
quiz( question("Which id(s) will be excluded when you full_join() both data sets?", answer("None", correct = TRUE), answer("16365, 16366"), answer("15920, 15801"), answer("15801, 15922"), allow_retry = FALSE))
full_join()
the lexus_2000
and lexus_mileage
dataframes by the id
columns. Visually confirm your answer to the question above.
full_join(..., ..., by = ...)
quiz( question("Which id(s) will be excluded when you inner_join() both data sets?", answer("None"), answer("16365, 16366"), answer("15920, 15801", correct = TRUE), answer("15801, 15922"), allow_retry = FALSE))
inner_join()
the lexus_2000
and lexus_mileage
dataframes by the id
columns. Visually confirm your answer to the question above.
inner_join(..., ..., by = ...)
quiz( question("Which id(s) will be excluded when you run left_join(lexus_2000, lexus_mileage, by = 'id')?", answer("None", correct = TRUE), answer("All"), answer("15920, 15801"), answer("15801, 15922"), allow_retry = FALSE))
quiz( question("Which id(s) will be excluded when you run left_join(lexus_mileage, lexus_2000, by = 'id')?", answer("None"), answer("All"), answer("15920, 15801", correct = TRUE), answer("15801, 15922"), allow_retry = FALSE))
quiz( question("Which id(s) will be excluded when you run right_join(lexus_2000, lexus_mileage, by = 'id')?", answer("None"), answer("All"), answer("15920, 15801", correct = TRUE), answer("15801, 15922"), allow_retry = FALSE))
quiz( question("Which id(s) will be excluded when you run right_join(lexus_mileage, lexus_2000, by = 'id')?", answer("None", correct = TRUE), answer("All"), answer("15920, 15801"), answer("15801, 15922"), allow_retry = FALSE))
quiz( question("Which id(s) will be excluded when you run anti_join(lexus_mileage, lexus_2000, by = 'id')?", answer("None"), answer("All", correct = TRUE), answer("15920, 15801"), answer("15801, 15922"), allow_retry = FALSE))
quiz( question("Which id(s) will be included when you run anti_join(lexus_2000, lexus_mileage, by = 'id')?", answer("None"), answer("All"), answer("15920, 15801", correct = TRUE), answer("15801, 15922"), allow_retry = FALSE))
quiz( question("Which columns will be excluded when you run semi_join(lexus_2000, lexus_mileage, by = 'id')?", answer("id, hwy, cty"), answer("hwy, cty", correct = TRUE), answer("id, class, drive"), answer("id, make, model, class, drive"), allow_retry = FALSE))
quiz( question("Which columns will be excluded when you run semi_join(lexus_mileage, lexus_2000, by = 'id')?", answer("id, hwy, cty"), answer("hwy, cty"), answer("id, class, drive"), answer("make, model, class, drive", correct = TRUE), allow_retry = FALSE))
Run table1
and table2
in the code chunk below.
question("Do the two data data sets above contain the variables **country**, **year**, **cases**, and **population**?", answer("Yes", correct = TRUE, message = "If you look closely, you will see that this is the same data set as before, but organized in a new way."), answer("No", message = "Don't be mislead by the two new column names: a variable and a column name are not necessarily the same thing."), allow_retry = FALSE)
These data sets reveal something important: you can reorganize the same set of variables, values, and observations in many different ways.
It's not hard to do. If you run the code chunks below, you can see the same data displayed in three more ways.
table3
table4a; table4b
table5
Among our tables above, only table1
is tidy.
The tidy data format works so well for R because it aligns the structure of your data with the mechanics of R, so let's try to tidy a tibble.
Run cases
in the code chunk below.
cases
quiz( question("What are the variables in cases?", answer("Country, 2011, 2012, and 2013"), answer("Country, year, and some unknown quantity (n, count, number of cases, etc.)", correct = TRUE), answer("FR, DE, and US"), allow_retry = TRUE ) )
You can use the pivot_longer()
function in the tidyr package to convert wide data to long data. Let's use the pivot_longer
function to tidy the data, pivoting all of the columns except for the Country column, and setting the new name to "year" and the new values to "cases".
# Consider using - to select the columns you want to pivot
cases %>% pivot_longer(cols = ..., names_to = ..., values_to = ...)
Try this again with cases2
. Make sure to only include the three year columns that you want to pivot.
The pollution
data set below displays the amount of small and large particulate in the air of three cities. It illustrates another common type of untidy data. Narrow data uses a literal key column and a literal value column to store multiple variables. Can you tell here which is which? Run the code chunk below.
pollution
quiz( question("Which column in pollution contains key names (i.e. variable names)?", answer("city"), answer("size", correct = TRUE), answer("amount"), allow_retry = TRUE) )
You can "spread" the keys in a key column across their own set of columns with the pivot_wider()
function in the tidyr package. With pollution, set the names_from argument to "size" and the values_from argument to "amount". In addition, set the names_prefix column to "particulate_"
... %>% pivot_wider(names_from = ..., values_from = ..., names_prefix = ...)
Let's apply pivot_wider()
to a real world inquiry. The ratio of girls to boys in babynames
is not constant across time. To explore this phenomenaon we can directly plot a ratio of boys to girls over time. To make such a plot, you would need to compute the ratio of boys to girls for each year from 1880 to 2015
Call babynames
first to see the data set.
First, create a variable called babynames_wider
. And group_by()
the year
and sex
variables.
Now take babynames_wider
and summarize()
the take the sum()
of the n
column. Call this summarized column total
. Be sure to reassign this to babynames_wider
.
babynames_wider <- babynames %>% group_by(year, sex)
But how can we plot this data? Our current iteration of babynames
places the total number of boys and girls for each year in the same column, which makes it hard to use both totals in the same calculation. Use pivot_wider()
to pivot the sex
and total
columns. Choos which should be the key/name and which should be the value. Be sure to reassign this to babynames_wider
.
babynames_wider <- babynames %>% group_by(year, sex) %>% summarize(total = sum(n), .groups = "drop_last")
babynames_wider <- babynames_wider %>% pivot_wider()
Now, mutate a column ratio
that divides M/F
.
babynames_wider <- babynames %>% group_by(year, sex) %>% summarize(total = sum(n), .groups = "drop_last") %>% pivot_wider(names_from = sex, values_from = total)
Now create a ggplot
line plot that takes year
on the x-axis and ratio
on the y-axis.
babynames_wider <- babynames %>% group_by(year, sex) %>% summarize(total = sum(n), .groups = "drop_last") %>% pivot_wider(names_from = sex, values_from = total) %>% mutate(ratio = M / F)
submission_ui
submission_server()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.