library(tidyverse)
library(PPBDS.data)
library(learnr)
library(shiny)
library(ggthemes)
library(viridis)
library(nycflights13)

knitr::opts_chunk$set(echo = FALSE, message = FALSE)
options(tutorial.exercise.timelimit = 60, tutorial.storage="local")  

Welcome!

Welcome to your first Gov 50 tutorial on Chapter 1: Visualization! We hope that this tutorial will be a great opportunity for you to learn and dive deeper into the course material. Most of these tutorial questions will be exercises in which you can put your coding skills to the test to practice all of the cool visualization techniques you read about in Chapter 1, but you will also test your knowledge in multiple choice and short answer questions. Let's get started!!

Name

``` {r name} question_text( "Student Name:", answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Modify your answer", incorrect = "Ok" )

## Email

``` {r email}
question_text(
  "Email:",
  answer(NULL, correct = TRUE),
  allow_retry = TRUE,
  try_again_button = "Modify your answer",
  incorrect = "Ok"
)

How do I code in R?

Exercise 1

In the code chunk below, use library() to load the tidyverse package. Whenever you load a package R will also load all of the packages that the first package depends on. For example, whenever you load tidyverse, tidyverse also loads ggplot2, dplyr, tibble, tidyr, readr, and purrr.


Exercise 2

Now use library() to load the ggplot2 package of tidyverse.


Functions are the commands that perform tasks in R. They take in inputs called arguments and return outputs.

Exercise 3

Use the sqrt() function in the chunk below to compute the square root of 962.


Exercise 4

Hit run code to examine the code that sqrt() runs.

sqrt

Exercise 5

Compare the code in sqrt() to the code in another R function, lm(). Press run code to examinelm()'s code body in the chunk below.

lm

Help pages gives us access to the documentation pages for R functions, data sets, and other objects.

Exercise 6

Say we want to know what lm() does. Open the help page for lm() by typing ?lm() below.


Code comments are text placed after a # symbol. Nothing will be run after a # symbol. This is useful because it lets you write human readable comments in your code..

Exercise 7

Run the code chunk below. Afterwards, delete the # and re-run the chunk. You should see a result.

# sqrt(961)

Objects* are where values are saved in R. We’ll show you how to assign values to objects and how to display the contents of objects. You can choose almost any name you like for an object, as long as the name does not begin with a number or a special character like +, *, -, /, ^, !, @, or &.

Exercise 8

question("Which of these would be valid object names?",
  answer("today", correct = TRUE),
  answer("1st"),
  answer("+1"),
  answer("vars", correct = TRUE),
  answer("\\^_^"),
  answer("foo", correct = TRUE),
  allow_retry = TRUE,
  correct = "Remember that the most helpful names will remind you what you put in your object."
)

Exercise 9

Use the assignment operator <- to save the results of rnorm(100, mean = 100, sd = 15) to an object named data.


Exercise 10

What do you think would happen if you assigned data to a new object named copy, like this? Run the code and then inspect both data and copy.

data <- rnorm(100, mean = 100, sd = 15)
copy <- data

Exercise 11

R comes with many toy data sets pre-loaded. Examine the contents of iris to see a classic toy data set. Type iris in the line below.


Exercise 12

A vector is a series of values. These are created using the c() function.

question('How many types of data can you put into a single vector?',
         answer("1", correct = TRUE),
         answer("6"),
         answer("As many as you like"),
         allow_retry = TRUE)

Exercise 13

In the chunk below, create a vector that contains the integers from one to ten.


# use the funciton c(...)

Exercise 14

If your vector contains a sequence of contiguous integers, you can create it with the : shortcut. Run 1:10 in the chunk below.


Exercise 15

You can extract any element of a vector by placing a pair of brackets [ ] behind the vector. Inside the brackets, place the number of the element that you'd like to extract. For example, vec[3] would return the third element of the vector named vec.

Use the chunk below to extract the fourth element of vec.

vec <- c(1, 2, 4, 8, 16)

Exercise 16

You can also use [ ] to extract multiple elements of a vector. Place the vector c(1,2,5) between the brackets below. What does R return?

vec <- c(1, 2, 4, 8, 16)
vec[]

Exercise 17

If the elements of your vector have names, you can extract them by name. To do so place a name or vector of names in the brackets behind a vector. Surround each name with quotation marks, e.g. vec2[c("alpha", "beta")].

Extract the element named "gamma" from the vector below.

vec2 <- c(alpha = 1, beta = 2, gamma = 3)

Below is the flights data frame.

flights

The letter abbreviations that appear under the column names of flights describe the type of data that is stored in each column of flights:

Exercise 18

One of the most common mistakes in R is to call an object when you mean to call a character string and vice versa.

question('Which of these are object names? What is the difference between object names and character strings?',
         answer("foo", correct = TRUE),
         answer('"num"'),
         answer("mu", correct = TRUE),
         answer('"sigma"'),
         answer('"data"'),
         answer("a", correct = TRUE),
         allow_retry = TRUE,
         correct = "Character strings are surrounded by quotation marks, object names are not.")

Data Frames

Data frames are “spreadsheet”-type datasets.You can make a data frame with the data.frame() function, which works similar to c().

Exercise 1

Assemble the vectors below into a data frame using data.frame()with the column names numbers, logicals, strings. Assign the data frame to the object named df.

nums <- c(1, 2, 3, 4)
logs <- c(TRUE, TRUE, FALSE, TRUE)
strs <- c("apple", "banana", "carrot", "duck")

Exercise 2

Extract the strings column of the df data frame using the $ operator.

nums <- c(1, 2, 3, 4)
logs <- c(TRUE, TRUE, FALSE, TRUE)
strs <- c("apple", "banana", "carrot", "duck")
df <- data.frame(numbers = nums, logicals = logs, strings = strs)

Exercise 3

Load the PPBDS.data package using library(). Then, look run the code below.

trains

Exercise 4

Use the glimpse() function to look at the trains data set. We already loaded the tidyverse and PPBDS.data packages.

library(tidyverse)
library(PPBDS.data)

Exercise 5

Extract the income variable in the trains data set using the $ operator.


Grammar of graphics

Graphics are designed to emphasize the findings and insights you want your audience to understand.

Exercise 1

quiz(
  question("What are the three essential components of a graphic?",
           answer("the data set containing the variables in question", correct = TRUE),
           answer("the geometric object we can observe in a plot", correct = TRUE),
           answer("axes labels on a plot"),
           answer("the aesthetic attributes", correct = TRUE),
           allow_retry = TRUE
  ),
  question("What are the two important arguments that we need to provide the `ggplot()` function?",
           answer("`data` and `mapping`", correct = TRUE),
           answer("`data` and `aesthetics`"),
           answer("`data` and `layers`"),
           allow_retry = TRUE
  )
)

Exercise 2

ggplot(data = trains, mapping = aes(x = att_start, y = att_end, color = treatment)) +
  geom_point() + facet_wrap(~party)
quiz(
  question("Which `data` variable is mapped to the `x`-position `aes`thetic of the points?",
           answer("`att_start`", correct = TRUE),
           answer("`att_end`"),
           answer("`treatment`"),
           allow_retry = TRUE
  ),
  question("Which `data` variable is mapped to the `y`-position `aes`thetic of the points?",
           answer("`att_start`"),
           answer("`att_end`", correct = TRUE),
           answer("`treatment`"),
           allow_retry = TRUE
  ),
  question("Which `data` variable is mapped to the `color` `aes`thetic of the points?",
           answer("`att_start`"),
           answer("`att_end`"),
           answer("`treatment`", correct = TRUE),
           allow_retry = TRUE
  )
)

geom_point

Scatterplots allow you to visualize the relationship between two numerical variables.

Let's create the following scatterplot.

scat_p <- ggplot(data = qscores, mapping = aes(x = rating, y = hours, size = enrollment)) +
  geom_point()

scat_p

Exercise 1

Load the PPBDS.data package and look at the qscores data set by simply typing the name of the data set.


library(...)
...

Exercise 2

Nice! Now load the ggplot2 package using library(). On the line below, use the ggplot() function to create a scatterplot using the qscores data set. Map rating to the x-axis and hours to the y-axis.


library(...)
ggplot(data = qscores, mapping = aes(x = ..., y = ...)) +geom_point()

Exercise 3

Awesome! Now we want to add a size aesthetic based on the number of students enrolled in each course. Set the argument size to enrollment inside the aes() function.


ggplot(data = qscores, mapping = aes(x = rating, y = hours, size = ...)) 

Reminder: This is what our graph should look like.

scat_p

Exercise 4

The following plot was created using the mpg data set. It only displays 126 points, but it visualizes a data set that contains 234 points. In this section, we will fix this issue.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()

The missing points are hidden behind other points, a phenomenon known as overplotting. Overplotting provides an incomplete picture of the data set. You cannot determine where the mass of the points fall, which makes it difficult to spot relationships in the data.

Causes of overplotting

  1. The data points have been rounded to a "grid" of common values, as in the plot above
  2. The data set is so large that it cannot be plotted without points overlapping each other

Exercise 5

One method to fight overplotting is to make each point semi-transparent. The code chunk below provides the code used to create the graph above. Change the transparency of the points by setting alpha = 0.2 within geom_point().

ggplot(data = qscores, mapping = aes(x = rating, y = hours, size = enrollment)) +
  geom_point()
ggplot(data = qscores, mapping = aes(x = rating, y = hours, size = enrollment)) +
  geom_point(alpha = ...)

geom_jitter()

geom_jitter() is another method to deal with overplotting. It plots a scatterplot and then adds a small amount of random noise to each point in the plot.

The following scatterplot, which was created using the mpg tibble, has overplotting. In this section, we will fix this issue using geom_jitter().

ggplot(data = trains, mapping = aes(x = att_start, y = att_end)) +
  geom_point()

Exercise 1

The code for the graph above has been provdied for you below. Replace geom_point() with geom_jitter().

ggplot(data = trains, mapping = aes(x = att_start, y = att_end)) + geom_point()
ggplot(data = trains, mapping = aes(x = att_start, y = att_end)) +
  geom_jitter()

Exercise 2

As you can see, jittering the points shifted them slightly from under each other. We can adjust the amount that the points are jittered by setting the width. Set the width to .2 within geom_jitter().


ggplot(data = trains, mapping = aes(x = att_start, y = att_end)) +
  geom_jitter(width = .2)

Exercise 3

Now set the color aesthetic to the party variable.


Because color is an aesthetic, set it inside of aes().

Exercise 5

Now use what you've learned to recreate the plot below. The graph was created using the data set diamonds. The alpha of the plot is 0.2, and the width of the jitter distribution is 5. Use the labs() function to add titles.

ggplot(data = diamonds, mapping = aes(x = depth, y = price)) +
  geom_jitter(width = 5, alpha = 0.2) +
  labs(title = "Depth and Price in Diamonds",
       x = "Depth",
       y = "Price")

geom_histogram()

A histogram is a plot that visualizes the distribution of a numerical variable. Let's create the following plot.

hist_p <- ggplot(data = qscores, mapping = aes(x = rating)) +
  geom_histogram(binwidth = 1, color = "white", fill = "red4", bins = 10)

hist_p

Exercise 1

Using the qscores data set and ggplot(), make a histogram where rating is on the x-axis to see the distribution of Harvard course ratings.


ggplot(data = ..., mapping = aes(x = ...)) + ...

Exercise 2

Add white vertical borders demarcating the bins by adding color = "white" and fill = "red4" arguments to geom_histogram().


ggplot(data = qscores, mapping = aes(x = rating)) +
  geom_histogram(color = ..., fill = ...)

Exercise 3

Specify the number of bins to be 10 via the bins argument in geom_histogram().


ggplot(data = qscores, mapping = aes(x = rating)) +
  geom_histogram(color = "white", fill = "red4", bins = 10)

Exercise 4

Specify the width of the bins to be 1 via the binwidth argument.


Reminder: This is what our graph should look like.

hist_p
ggplot(data = qscores, mapping = aes(x = rating)) +
  geom_histogram(color = "white", fill = "red4", bins = 10, binwidth = 1)

geom_boxplot()

A boxplot displays the five-number summary of a set of data(. The five-number summary (minimum, first quartile, median, third quartile, and maximum).

Let's create the following boxplot.

box_p <- ggplot(data = mpg, mapping = aes(x = class, y = hwy )) +
          geom_boxplot() + 
          labs( title = "Highway Fuel Efficiency in Different Types of Cars")

box_p

Exercise 1

Using the mpg tibble, which shows the fuel efficiency for different types of cars, set x = class, and y = hwy.


ggplot(data = mpg, mapping = aes(x = ... , y = ... )) +
  geom_boxplot()

Exercise 2

Nice! Using labs() title the plot "Highway Fuel Efficiency in Different Types of Cars".


labs(title = ...)

geom_bar()

A barplot is a plot that visualizes the distribution of a categorical variable.

Let's create the following barplot.

ggplot(data = trains, mapping = aes(x = treatment, fill = party)) +
  geom_bar(position = "dodge")

Exercise 1

Let's go back to the trains data set in the PPBDS.data package. Use ggplot() and geom_bar() to plot treatment on the x-axis.


ggplot(data = ..., mapping = aes(x = ...)) + ...

Exercise 2

We can now map the additional variable party by adding a fill = party inside the aes() aesthetic mapping.


ggplot(data = trains, mapping = aes(x = treatment, fill = ...)) + 
  geom_bar()

Exercise 3

Let's make our graph a side-by-side barplot. Set position to "dodge" in geom_bar().


ggplot(data = trains, mapping = aes(x = treatment, fill = party)) +
  geom_bar(position = ...)

geom_smooth()

geom_smooth() adds a regression line on a scatterplot.

Exercise 1

Let's start by creating a scatterplot using the nhanes dataset. Run the code below.

ggplot(data = nhanes, mapping = aes(x = weight, y = height)) +
  geom_point()

Exercise 2

Now add geom_smooth() to the graph. Remember you are adding a layer so you need to include +.

ggplot(data = nhanes, mapping = aes(x = weight, y = height)) +
  geom_point()

Exercise 3

Nice! As you can see, the line you just graphed represents the trend that we see in the scatterplot. See the message that R gave us? Because we didn't set the method that R uses to calculate the line, R defaulted to using "gam". Let's try setting the method to "lm" inside geom_smooth().


Remember to put "lm" in quotes.

Exercise 4

Great! Now get rid of the scatterplot by deleting geom_point() and only use geom_smooth(). Then, set the color aesthetic to the gender variable.


Because color is an aesthetic, set it inside of aes().

Exercise 5

Great! Now, use what you've learned about geom_smooth() to recreate the plot below.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_smooth(color = "red") +
  labs(title = "City vs. Highway Fuel Economy",
       x = "City Miles Per Gallon",
       y = "Highway Miles Per Gallon")

geom_density()

geom_density() is used to make a density plot, a smoothed version of a histogram. It is a useful alternative to the histogram that displays continuous data in a smooth distribution.

Let's create the following density plot.

dens_p <- ggplot(data = cces, mapping = aes(x = age, color = ideology, fill = ideology)) +
  geom_density(alpha = .3, position = "fill") + xlim(20, 90)

dens_p

Exercise 1

Use the dataset cces to make a density plot with age on the x axis and ideology set to the aesthetics color and fill.


ggplot(data = cces, mapping = aes(x = ..., color = ..., fill = ...))

Exercise 2

Great! As you can see, we have a lot of overlapping going on. Try setting the alpha parameter in geom_density() to 0.3.


Exercise 3

Nice! Now try setting the position parameter in geom_density() to "fill". Then try setting it to "stack".


Exercise 4

Great! Now add a limit to the x axis with an upper bound of 90 and a lower bound of 20.


Reminder: Your graph should look something like this.

dens_p
Use xlim() to add a limit on the x axis

The tidyverse

Now it's time to learn more about data wrangling using the tidyverse. We'll go over the following functions in this tutorial:

Exercise 1

Before we get started, load the tidyverse collection of packages and the package PPBDS.data.


Use library()

Exercise 2

Let's take a glimpse() at the data set cces. Press "Run Code".

glimpse(cces)

Exercise 3

Using the pipe operator %>%, add filter(). Within filter() use the argument state == "Massachusetts" and gender == "Female".


cces %>% 
  filter(state == "...", gender == "...")

Exercise 3

We now want to organize our code by descending order of age. Copy and paste your code from above. Use %>%to add arrange(). Within arrange(), use the argument desc(age).


cces %>% 
  filter(state == "Massachusetts", gender == "Female") %>% 
  arrange(desc(age))

Exercise 4

Great! Now continue the code using %>% again to add summarize(). We want to find the mean and median ages from our filtered data.


Exercise 5

For this section, we will be focusing on a really big data set called nhanes. Let's say we wanted to make a graph using data on height and weight from the 200 youngest black males. We first would select() the variables we want to focus on, in this case height, weight, age, race, and gender. Try doing this below:


Exercise 6

Now we want to narrow down our data to only black males, so let's use filter() to create a tibble made up of rows from nhanes where race == "Black" and gender == "Male".


Exercise 7

Great! So what now? Well, we want to look at the youngest black males. To do this, we would have to arrange() by age. Do this below:


Exercise 8

Nice! Now use slice() to isolate the 200 rows with the lowest age value.


Because of the way we arranged the data, the 200 rows with the lowest age value would be the first 200
slice(1:200)

Submit

Congrats on finishing your first Gov 50 tutorial! You're on your way to being a master in data visualization and wrangling! :)

submission_ui
submission_server()


davidkane9/PPBDS.data documentation built on Nov. 18, 2020, 1:17 p.m.