library(tidyverse) library(PPBDS.data) library(learnr) library(shiny) library(ggthemes) library(viridis) library(nycflights13) knitr::opts_chunk$set(echo = FALSE, message = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage="local")
Welcome to your first Gov 50 tutorial on Chapter 1: Visualization! We hope that this tutorial will be a great opportunity for you to learn and dive deeper into the course material. Most of these tutorial questions will be exercises in which you can put your coding skills to the test to practice all of the cool visualization techniques you read about in Chapter 1, but you will also test your knowledge in multiple choice and short answer questions. Let's get started!!
``` {r name} question_text( "Student Name:", answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Modify your answer", incorrect = "Ok" )
## Email ``` {r email} question_text( "Email:", answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Modify your answer", incorrect = "Ok" )
In the code chunk below, use library()
to load the tidyverse
package. Whenever you load a package R will also load all of the packages that the first package depends on. For example, whenever you load tidyverse
, tidyverse
also loads ggplot2
, dplyr
, tibble
, tidyr
, readr
, and purrr
.
Now use library()
to load the ggplot2
package of tidyverse.
Functions are the commands that perform tasks in R. They take in inputs called arguments and return outputs.
Use the sqrt()
function in the chunk below to compute the square root of 962.
Hit run code to examine the code that sqrt()
runs.
sqrt
Compare the code in sqrt()
to the code in another R function, lm()
. Press run code to examinelm()
's code body in the chunk below.
lm
Help pages gives us access to the documentation pages for R functions, data sets, and other objects.
Say we want to know what lm()
does. Open the help page for lm()
by typing ?lm()
below.
Code comments are text placed after a #
symbol. Nothing will be run after a #
symbol. This is useful because it lets you write human readable comments in your code..
Run the code chunk below. Afterwards, delete the #
and re-run the chunk. You should see a result.
# sqrt(961)
Objects* are where values are saved in R. We’ll show you how to assign values to objects and how to display the contents of objects. You can choose almost any name you like for an object, as long as the name does not begin with a number or a special character like +
, *
, -
, /
, ^
, !
, @
, or &
.
question("Which of these would be valid object names?", answer("today", correct = TRUE), answer("1st"), answer("+1"), answer("vars", correct = TRUE), answer("\\^_^"), answer("foo", correct = TRUE), allow_retry = TRUE, correct = "Remember that the most helpful names will remind you what you put in your object." )
Use the assignment operator <-
to save the results of rnorm(100, mean = 100, sd = 15)
to an object named data
.
What do you think would happen if you assigned data
to a new object named copy
, like this? Run the code and then inspect both data
and copy
.
data <- rnorm(100, mean = 100, sd = 15) copy <- data
R comes with many toy data sets pre-loaded. Examine the contents of iris
to see a classic toy data set. Type iris
in the line below.
A vector is a series of values. These are created using the c()
function.
question('How many types of data can you put into a single vector?', answer("1", correct = TRUE), answer("6"), answer("As many as you like"), allow_retry = TRUE)
In the chunk below, create a vector that contains the integers from one to ten.
# use the funciton c(...)
If your vector contains a sequence of contiguous integers, you can create it with the :
shortcut. Run 1:10
in the chunk below.
You can extract any element of a vector by placing a pair of brackets [ ]
behind the vector. Inside the brackets, place the number of the element that you'd like to extract. For example, vec[3]
would return the third element of the vector named vec
.
Use the chunk below to extract the fourth element of vec
.
vec <- c(1, 2, 4, 8, 16)
You can also use [ ]
to extract multiple elements of a vector. Place the vector c(1,2,5)
between the brackets below. What does R return?
vec <- c(1, 2, 4, 8, 16) vec[]
If the elements of your vector have names, you can extract them by name. To do so place a name or vector of names in the brackets behind a vector. Surround each name with quotation marks, e.g. vec2[c("alpha", "beta")]
.
Extract the element named "gamma" from the vector below.
vec2 <- c(alpha = 1, beta = 2, gamma = 3)
Below is the flights
data frame.
flights
The letter abbreviations that appear under the column names of flights
describe the type of data that is stored in each column of flights
:
int
: integers.
dbl
: doubles, or real numbers.
chr
: character vectors, or strings.
dttm
: date-times (a date + a time).
One of the most common mistakes in R is to call an object when you mean to call a character string and vice versa.
question('Which of these are object names? What is the difference between object names and character strings?', answer("foo", correct = TRUE), answer('"num"'), answer("mu", correct = TRUE), answer('"sigma"'), answer('"data"'), answer("a", correct = TRUE), allow_retry = TRUE, correct = "Character strings are surrounded by quotation marks, object names are not.")
Data frames are “spreadsheet”-type datasets.You can make a data frame with the data.frame()
function, which works similar to c()
.
Assemble the vectors below into a data frame using data.frame()
with the column names numbers
, logicals
, strings
. Assign the data frame to the object named df
.
nums <- c(1, 2, 3, 4) logs <- c(TRUE, TRUE, FALSE, TRUE) strs <- c("apple", "banana", "carrot", "duck")
Extract the strings column of the df
data frame using the $
operator.
nums <- c(1, 2, 3, 4) logs <- c(TRUE, TRUE, FALSE, TRUE) strs <- c("apple", "banana", "carrot", "duck") df <- data.frame(numbers = nums, logicals = logs, strings = strs)
Load the PPBDS.data
package using library()
. Then, look run the code below.
trains
Use the glimpse()
function to look at the trains
data set. We already loaded the tidyverse
and PPBDS.data
packages.
library(tidyverse) library(PPBDS.data)
Extract the income
variable in the trains
data set using the $
operator.
Graphics are designed to emphasize the findings and insights you want your audience to understand.
quiz( question("What are the three essential components of a graphic?", answer("the data set containing the variables in question", correct = TRUE), answer("the geometric object we can observe in a plot", correct = TRUE), answer("axes labels on a plot"), answer("the aesthetic attributes", correct = TRUE), allow_retry = TRUE ), question("What are the two important arguments that we need to provide the `ggplot()` function?", answer("`data` and `mapping`", correct = TRUE), answer("`data` and `aesthetics`"), answer("`data` and `layers`"), allow_retry = TRUE ) )
ggplot(data = trains, mapping = aes(x = att_start, y = att_end, color = treatment)) + geom_point() + facet_wrap(~party)
quiz( question("Which `data` variable is mapped to the `x`-position `aes`thetic of the points?", answer("`att_start`", correct = TRUE), answer("`att_end`"), answer("`treatment`"), allow_retry = TRUE ), question("Which `data` variable is mapped to the `y`-position `aes`thetic of the points?", answer("`att_start`"), answer("`att_end`", correct = TRUE), answer("`treatment`"), allow_retry = TRUE ), question("Which `data` variable is mapped to the `color` `aes`thetic of the points?", answer("`att_start`"), answer("`att_end`"), answer("`treatment`", correct = TRUE), allow_retry = TRUE ) )
geom_point
Scatterplots allow you to visualize the relationship between two numerical variables.
Let's create the following scatterplot.
scat_p <- ggplot(data = qscores, mapping = aes(x = rating, y = hours, size = enrollment)) + geom_point() scat_p
Load the PPBDS.data
package and look at the qscores
data set by simply typing the name of the data set.
library(...) ...
Nice! Now load the ggplot2
package using library()
. On the line below, use the ggplot()
function to create a scatterplot using the qscores
data set. Map rating
to the x-axis and hours
to the y-axis.
library(...) ggplot(data = qscores, mapping = aes(x = ..., y = ...)) +geom_point()
Awesome! Now we want to add a size
aesthetic based on the number of students enrolled in each course. Set the argument size
to enrollment
inside the aes()
function.
ggplot(data = qscores, mapping = aes(x = rating, y = hours, size = ...))
Reminder: This is what our graph should look like.
scat_p
The following plot was created using the mpg
data set. It only displays 126 points, but it visualizes a data set that contains 234 points. In this section, we will fix this issue.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()
The missing points are hidden behind other points, a phenomenon known as overplotting. Overplotting provides an incomplete picture of the data set. You cannot determine where the mass of the points fall, which makes it difficult to spot relationships in the data.
Causes of overplotting
One method to fight overplotting is to make each point semi-transparent. The code chunk below provides the code used to create the graph above. Change the transparency of the points by setting alpha = 0.2
within geom_point()
.
ggplot(data = qscores, mapping = aes(x = rating, y = hours, size = enrollment)) + geom_point()
ggplot(data = qscores, mapping = aes(x = rating, y = hours, size = enrollment)) + geom_point(alpha = ...)
geom_jitter()
geom_jitter()
is another method to deal with overplotting. It plots a scatterplot and then adds a small amount of random noise to each point in the plot.
The following scatterplot, which was created using the mpg
tibble, has overplotting. In this section, we will fix this issue using geom_jitter()
.
ggplot(data = trains, mapping = aes(x = att_start, y = att_end)) + geom_point()
The code for the graph above has been provdied for you below. Replace geom_point()
with geom_jitter()
.
ggplot(data = trains, mapping = aes(x = att_start, y = att_end)) + geom_point()
ggplot(data = trains, mapping = aes(x = att_start, y = att_end)) + geom_jitter()
As you can see, jittering the points shifted them slightly from under each other. We can adjust the amount that the points are jittered by setting the width. Set the width
to .2
within geom_jitter()
.
ggplot(data = trains, mapping = aes(x = att_start, y = att_end)) + geom_jitter(width = .2)
Now set the color
aesthetic to the party
variable.
Because color is an aesthetic, set it inside of aes().
Now use what you've learned to recreate the plot below. The graph was created using the data set diamonds
. The alpha
of the plot is 0.2, and the width
of the jitter distribution is 5. Use the labs()
function to add titles.
ggplot(data = diamonds, mapping = aes(x = depth, y = price)) + geom_jitter(width = 5, alpha = 0.2) + labs(title = "Depth and Price in Diamonds", x = "Depth", y = "Price")
geom_histogram()
A histogram is a plot that visualizes the distribution of a numerical variable. Let's create the following plot.
hist_p <- ggplot(data = qscores, mapping = aes(x = rating)) + geom_histogram(binwidth = 1, color = "white", fill = "red4", bins = 10) hist_p
Using the qscores
data set and ggplot()
, make a histogram where rating
is on the x-axis to see the distribution of Harvard course ratings.
ggplot(data = ..., mapping = aes(x = ...)) + ...
Add white vertical borders demarcating the bins by adding color = "white"
and fill = "red4"
arguments to geom_histogram()
.
ggplot(data = qscores, mapping = aes(x = rating)) + geom_histogram(color = ..., fill = ...)
Specify the number of bins to be 10 via the bins
argument in geom_histogram()
.
ggplot(data = qscores, mapping = aes(x = rating)) + geom_histogram(color = "white", fill = "red4", bins = 10)
Specify the width of the bins to be 1 via the binwidth
argument.
Reminder: This is what our graph should look like.
hist_p
ggplot(data = qscores, mapping = aes(x = rating)) + geom_histogram(color = "white", fill = "red4", bins = 10, binwidth = 1)
geom_boxplot()
A boxplot displays the five-number summary of a set of data(. The five-number summary (minimum, first quartile, median, third quartile, and maximum).
Let's create the following boxplot.
box_p <- ggplot(data = mpg, mapping = aes(x = class, y = hwy )) + geom_boxplot() + labs( title = "Highway Fuel Efficiency in Different Types of Cars") box_p
Using the mpg
tibble, which shows the fuel efficiency for different types of cars, set x = class
, and y = hwy
.
ggplot(data = mpg, mapping = aes(x = ... , y = ... )) + geom_boxplot()
Nice! Using labs()
title the plot "Highway Fuel Efficiency in Different Types of Cars".
labs(title = ...)
geom_bar()
A barplot is a plot that visualizes the distribution of a categorical variable.
Let's create the following barplot.
ggplot(data = trains, mapping = aes(x = treatment, fill = party)) + geom_bar(position = "dodge")
Let's go back to the trains
data set in the PPBDS.data
package. Use ggplot()
and geom_bar()
to plot treatment
on the x-axis.
ggplot(data = ..., mapping = aes(x = ...)) + ...
We can now map the additional variable party
by adding a fill = party
inside the aes()
aesthetic mapping.
ggplot(data = trains, mapping = aes(x = treatment, fill = ...)) + geom_bar()
Let's make our graph a side-by-side barplot. Set position
to "dodge" in geom_bar()
.
ggplot(data = trains, mapping = aes(x = treatment, fill = party)) + geom_bar(position = ...)
geom_smooth()
geom_smooth()
adds a regression line on a scatterplot.
Let's start by creating a scatterplot using the nhanes
dataset. Run the code below.
ggplot(data = nhanes, mapping = aes(x = weight, y = height)) + geom_point()
Now add geom_smooth()
to the graph. Remember you are adding a layer so you need to include +
.
ggplot(data = nhanes, mapping = aes(x = weight, y = height)) + geom_point()
Nice! As you can see, the line you just graphed represents the trend that we see in the scatterplot. See the message that R gave us? Because we didn't set the method that R uses to calculate the line, R defaulted to using "gam". Let's try setting the method to "lm" inside geom_smooth()
.
Remember to put "lm" in quotes.
Great! Now get rid of the scatterplot by deleting geom_point()
and only use geom_smooth()
. Then, set the color
aesthetic to the gender
variable.
Because color is an aesthetic, set it inside of aes().
Great! Now, use what you've learned about geom_smooth()
to recreate the plot below.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_smooth(color = "red") + labs(title = "City vs. Highway Fuel Economy", x = "City Miles Per Gallon", y = "Highway Miles Per Gallon")
geom_density()
geom_density()
is used to make a density plot, a smoothed version of a histogram. It is a useful alternative to the histogram that displays continuous data in a smooth distribution.
Let's create the following density plot.
dens_p <- ggplot(data = cces, mapping = aes(x = age, color = ideology, fill = ideology)) + geom_density(alpha = .3, position = "fill") + xlim(20, 90) dens_p
Use the dataset cces
to make a density plot with age
on the x axis and ideology
set to the aesthetics color
and fill
.
ggplot(data = cces, mapping = aes(x = ..., color = ..., fill = ...))
Great! As you can see, we have a lot of overlapping going on. Try setting the alpha
parameter in geom_density()
to 0.3.
Nice! Now try setting the position
parameter in geom_density()
to "fill". Then try setting it to "stack".
Great! Now add a limit to the x axis with an upper bound of 90 and a lower bound of 20.
Reminder: Your graph should look something like this.
dens_p
Use xlim() to add a limit on the x axis
tidyverse
Now it's time to learn more about data wrangling using the tidyverse
. We'll go over the following functions in this tutorial:
Before we get started, load the tidyverse
collection of packages and the package PPBDS.data
.
Use library()
Let's take a glimpse()
at the data set cces
. Press "Run Code".
glimpse(cces)
Using the pipe operator %>%
, add filter()
. Within filter()
use the argument state == "Massachusetts"
and gender == "Female"
.
cces %>% filter(state == "...", gender == "...")
We now want to organize our code by descending order of age. Copy and paste your code from above. Use %>%
to add arrange()
. Within arrange()
, use the argument desc(age)
.
cces %>% filter(state == "Massachusetts", gender == "Female") %>% arrange(desc(age))
Great! Now continue the code using %>%
again to add summarize()
. We want to find the mean and median ages from our filtered data.
For this section, we will be focusing on a really big data set called nhanes
. Let's say we wanted to make a graph using data on height
and weight
from the 200 youngest black males. We first would select()
the variables we want to focus on, in this case height
, weight
, age
, race
, and gender
. Try doing this below:
Now we want to narrow down our data to only black males, so let's use filter()
to create a tibble made up of rows from nhanes
where race == "Black"
and gender == "Male"
.
Great! So what now? Well, we want to look at the youngest black males. To do this, we would have to arrange()
by age. Do this below:
Nice! Now use slice()
to isolate the 200 rows with the lowest age value.
Because of the way we arranged the data, the 200 rows with the lowest age value would be the first 200
slice(1:200)
Congrats on finishing your first Gov 50 tutorial! You're on your way to being a master in data visualization and wrangling! :)
submission_ui
submission_server()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.