library(learnr) library(tidyverse) library(nycflights13) library(tutorialExtras) library(gradethis) library(tutorial.helpers) library(ggcheck) gradethis_setup() knitr::opts_chunk$set(echo = FALSE) options( tutorial.exercise.timelimit = 60 #tutorial.storage = "local" ) alaska_flights <- flights %>% filter(carrier == "AS")
grade_server("grade")
question_text("Name:", answer_fn(function(value){ if(length(value) >= 1 ) { return(mark_as(TRUE)) } return(mark_as(FALSE) ) }), correct = "submitted", allow_retry = FALSE )
Complete this tutorial while reading Sections 2.0 - 2.3 of the textbook. Each question allows 3 'free' attempts. After the third attempt a 10% deduction occurs per attempt.
You can check your current grade and the number of attempts you are on in the "View grade" section. You can click this button as often and as many times as you would like as you progress through the tutorial. Before submitting, make sure your grade is as expected.
"The Grammar of Graphics" define a set of rules for constructing statistical graphics by combining different types of layers.
question("What are the 3 essential components of the grammar of graphics?", answer("data", TRUE), answer("geometric object", TRUE), answer("aesthetic attributes", TRUE), answer("scales"), answer("coordinate systems"), answer("faceting"), answer("position adjustments"), allow_retry = TRUE, random_answer_order = TRUE )
question("Which of the following are examples of aesthetic attributes of geometric objects?", answer("color", correct = TRUE), answer("point"), answer("line"), answer("shape", correct = TRUE), answer("position (e.g. x and/or y coordinates)", correct = TRUE), answer("size", correct = TRUE), allow_retry = TRUE, random_answer_order = TRUE )
question_wordbank(paste("Drag and drop the variables to match the aesthetics they are mapped onto in Figure 2.1", htmltools::img(src="images/Figure_02_1.png", height = 400, width = 700) ), choices = c("x","size", "color", "doesn't get mapped", "y"), wordbank = c("GDP per Capita", "Population", "Continent", "Country", "Life Expectancy"), answer(c("GDP per Capita", "Population", "Continent", "Country", "Life Expectancy"), correct = TRUE), allow_retry = TRUE, random_answer_order = TRUE )
Load the package we will be using for data visualization, which is an implementation of the Grammar of Graphics for R.
library(...)
library(ggplot2)
grade_this_code()
The simplest of the 5NG are scatterplots, also called bivariate plots.
question("Scatterplots allow you to visualize...", answer("the relationship between 2 numeric variables", correct = TRUE), answer("the relationship between 2 categorical variables"), answer("the distribution of 1 numeric variable"), answer("the relationship between 1 numeric and 1 categorical variables"), answer("the distribution of 1 categorical variable"), allow_retry = TRUE, random_answer_order = TRUE )
Let's visualize the relationship between departure delay and arrival delay for Alaska Airlines flights leaving NYC in 2013.
First load the nycflights13
package using the library()
function.
library(...)
library(nycflights13)
grade_this_code()
Recall this package contains a dataset called flights
, which contains data on all 336,776 flights that left NYC in 2013.
We’ll take the flights
data frame, extract only the 714 rows corresponding to Alaska Airlines flights, and save this in a new data frame called alaska_flights
. Click "Submit Answer" below to create this new dataset.
alaska_flights <- flights %>% filter(carrier == "AS")
alaska_flights <- flights %>% filter(carrier == "AS")
alaska_flights <- flights %>% filter(carrier == "AS")
grade_this_code()
ggplot()
is the core function of the ggplot2 package. It creates a ggplot object that serves as a canvas for visualizations.
Run ggplot(data = alaska_flights)
.
ggplot(...)
ggplot(alaska_flights)
grade_this_code()
You should see a blank, grey square. R has set up the area in which it can place a plot, but we have yet to tell it what to plot.
Recall another core component of the grammar of graphics is the aesthetic mapping. To use the mapping
parameter, you have to give ggplot()
an aesthetic, which you get by calling the aes()
function. For example, if you wanted to set the variable for the x-axis to be dep_delay
, you would add mapping = aes(x = dep_delay)
in your call to ggplot()
.
Copy the previous code. Within the call to ggplot()
, set mapping = aes()
. Within aes()
set the x
parameter to be dep_delay
, the y
parameter to be arr_delay
, and run your code.
ggplot(data = alaska_flights, mapping = aes(x = ..., y = ...))
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay))
grade_this_code()
Common scatterplot language is y
by x
.
We have now mapped arr_delay
by dep_delay
, however we do not see any data yet.
That is because we need to add in the third component of the grammar: the geom
etric object. For a scatterplot the geometric object are points.
Copy the previous code and add geom_point()
to the pipeline.
To add a layer you use the +
symbol.
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_...
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point()
grade_this_code()
When describing the relationship between 2 numeric variables there are a few key concepts we look for:
question("What type of relationship exists between departure delays and arrival delays for Alaska Airlines flights from NYC in 2013?", answer("positive", correct = TRUE), answer("negative"), answer("no relationship"), allow_retry = TRUE, random_answer_order = TRUE)
As departure delays increase, arrival delays tend to also increase meaning we have a positive relationship. We could also say this is a fairly strong linear relationship with a large mass of points clustered near (0, 0).
Before we move on, let's make sure we understand the use of the +
operator.
question("Select the statements that are TRUE about the + sign when using ggplot()", answer("Not using the + sign to add a geometric object will result in an empty plot.", correct = TRUE), answer("The + sign adds a layer to the plot.", correct = TRUE), answer("The + sign should go at the beginning of a new line."), allow_retry = TRUE, random_answer_order = TRUE)
The large mass of points near (0, 0) can cause some confusion as it is hard to tell the true number of points that are plotted. This is the result of a phenomenon called overplotting. As one may guess, this corresponds to values being plotted on top of each other over and over again. It is often difficult to know just how many values are plotted in this way when looking at a basic scatterplot as we have here.
question("What is the name of the aesthetic argument that allows you to change the transparency of a geometric object in ggplot()?", answer("alpha", correct = TRUE), answer("color"), answer("jitter"), answer("transparency"), allow_retry = TRUE, random_answer_order = TRUE)
The first way of addressing overplotting is by changing the transparency of the points by using the alpha argument in geom_point().
The code from Exercise 6 has been copied for you below. Adjust the transparency by setting alpha = 0.2
.
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2)
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point(...)
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2)
grade_this_code()
The transparency of the points is cumulative: areas with a high-degree of overplotting are darker, whereas areas with a lower degree are less dark.
The second way of adjusting for overplotting is by "jittering" or randomly "nudging" the points.
Copy the previous code. And replace geom_point()
with geom_jitter()
. Do not include any additional arguments to geom_jitter()
.
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_...()
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_jitter()
grade_this_code()
It is possible to specify how much jitter
to add, by adjust the width
and height
arguments.
Most of the time leaving these blank and letting R pick the default values is sufficient.
If you do specify width
and height
check the x and y-axis. It is important to add just enough jitter to break any overlap in points, but not so much that we completely alter the overall pattern in points.
question("Which adjustment appears to be most appropriate for visualizing the relationship between `arr_delay` and `dep_delay`?", answer("setting alpha", correct = TRUE), answer("using geom_jitter()"), allow_retry = TRUE, random_answer_order = TRUE)
With the relatively large dataset and fact that the points were clustered (as opposed to directly overlapping), it can be argued that setting the transparency was more effective at handling overplotting in our alaska_flights
example.
With medium to large datasets, you may need to play around with the different modifications one can make to a scatterplot.
It is also possible to adjust both the transparency and jitter, for example geom_jitter(alpha = 0.2)
.
grade_button_ui(id = "grade")
Once you are finished:
grade_print_ui("grade")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.