library(learnr) library(tidyverse) library(nycflights13) library(tutorialExtras) library(gradethis) library(tutorial.helpers) library(ggcheck) gradethis_setup() knitr::opts_chunk$set(echo = FALSE) options( tutorial.exercise.timelimit = 60 #tutorial.storage = "local" ) fruits <- tibble( fruit = c("apple", "apple", "orange", "apple", "orange") ) fruits_counted <- tibble( fruit = c("apple", "orange"), number = c(3, 2) )
grade_server("grade")
question_text("Name:", answer_fn(function(value){ if(length(value) >= 1 ) { return(mark_as(TRUE)) } return(mark_as(FALSE) ) }), correct = "submitted", allow_retry = FALSE )
Complete this tutorial while reading Sections 2.7 - 2.9 of the textbook. Each question allows 3 'free' attempts. After the third attempt a 10% deduction occurs per attempt.
You can check your current grade and the number of attempts you are on in the "View grade" section. You can click this button as often and as many times as you would like as you progress through the tutorial. Before submitting, make sure your grade is as expected.
Similar to a histogram, a boxplot shows the distribution of a single numeric variable.
To compare distributions of a numerical variable split by another variable, another graphic besides a faceted histogram to achieve this is a side-by-side boxplot.
A boxplot is constructed from the information provided in the five-number summary of a numerical variable.
question("Which of the following summary statistics are included in the five-number summary and are used to construct a boxplot when there are no “outliers” in the data?", answer("minimum", correct = TRUE), answer("maximum", correct = TRUE), answer("mode"), answer("first quantile (Q1, 25th percentile)", correct = TRUE), answer("standard deviation"), answer("third quantile (Q3, 75th percentile)", correct = TRUE), answer("median", correct = TRUE), answer("mean"), allow_retry = TRUE, random_answer_order = TRUE)
question_wordbank("Drag and drop the features of a boxplot with the information they display about the data.", choices = c("lines extending from the box to points less than the 25th percentile or greater than the 75th percentile", "interquartile range (i.e. a measure of the spread of the data)", "outliers", "1st quartile, median, 3rd quartile (i.e. the middle 50% of the data)"), wordbank = c("whiskers", "length", "dots", "box"), answer(c("whiskers", "length", "dots", "box"), correct = TRUE), allow_retry = TRUE )
Let’s create a side-by-side boxplot of hourly temperatures split by the 12 months as we did in the past tutorial with the faceted histograms.
Within ggplot()
set the data = weather
. Set the second argument to mapping = aes()
and within aes()
define:
x
to be the variable month
y
to be the numeric variable temp
ggplot(data = weather, mapping = aes(x = ..., y = ...))
ggplot(data = weather, mapping = aes(x = month, y = temp))
grade_this_code()
Copy the previous code and use the +
operator to add geom_boxplot()
.
ggplot(data = weather, mapping = aes(x = month, y = temp)) + geom_...
ggplot(data = weather, mapping = aes(x = month, y = temp)) + geom_boxplot()
grade_this_code()
Oh no, this plot does not provide information about temperature separated by month! The warning messages clue us in as to why.
The first warning message is telling us that we have a “continuous”, or numerical variable, on the x-position aesthetic. Side-by-side boxplots require one categorical variable and one numeric variable.
Copy the previous code and convert the numerical variable month into a categorical variable by using the factor()
function
ggplot(data = weather, mapping = aes(x = ...(month), y = temp)) + geom_boxplot()
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) + geom_boxplot()
grade_this_code()
Another common task is visualize the distribution of a categorical variable. This is a simpler task, as we are simply counting different categories, also known as levels, of a categorical variable.
Below is the code we used to manually create two data frames, fruit
and fruit_counted
, representing a collection of fruit: 3 apples and 2 oranges.
fruits <- tibble( fruit = c("apple", "apple", "orange", "apple", "orange") ) fruits_counted <- tibble( fruit = c("apple", "orange"), number = c(3, 2) )
Run fruits
in the code chunk to print the data frame.
...
fruits
grade_this_code()
Notice that fruits
just lists the fruit individually.
Now, run fruits_counted
in the code chunk to print the data frame.
...
fruits_counted
grade_this_code()
fruits_counted
has a variable number
which represents pre-counted values of each fruit.
Let’s first generate a barplot using the fruits
data frame where all 5 fruits are listed individually in 5 rows.
Use the ggplot()
function with data = fruits
and mapping = aes(x = fruit)
.
Be careful the data frame is called fruits
and the variable is called fruit
.
ggplot(data = ..., mapping = aes(x = ...))
ggplot(data = fruits, mapping = aes(x = fruit))
grade_this_code()
Add a geom_bar()
layer.
ggplot(data = fruits, mapping = aes(x = fruit)) + geom_...
ggplot(data = fruits, mapping = aes(x = fruit)) + geom_bar()
grade_this_code()
Since the data was in list form (not pre-counted), there is no y
-aesthetic needed.
Copy the previous code and make the following modifications:
data = fruits_counted
y = number
after the x
aestheticgeom_bar()
with geom_col()
ggplot(data = fruits, mapping = aes(x = fruit)) + geom_...
ggplot(data = fruits, mapping = aes(x = fruit, y = count)) + geom_col()
grade_this_code()
Since this data frame is pre-counted we need to specify the counts of each fruit as the y
aesthetic (whereas geom_bar()
counts the list for us). Recall from Exercise 2 the name of the variable was number
.
question_wordbank("Which geometric layer do you use with categorical data that is...", choices = c("NOT pre-counted", "pre-counted"), answer(c("geom_bar()", "geom_col()"), correct = TRUE), allow_retry = TRUE, random_answer_order = TRUE)
Recall our flights
dataset from the nycflights13
package. The package has already been pre-loaded for you and a glimpse()
of the dataset is shown below.
glimpse(flights)
Using ggplot()
set the data = flights
and assign the x
-axis aes
thetic to be carrier
. Then add the appropriate geom
layer.
ggplot(data = ..., mapping = aes(x = ...)) + geom_...()
ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar()
grade_this_code()
Observe that United Air Lines (UA) had the most flights depart New York City in 2013 and SkyWest Airlines Inc. (OO) had the least.
If you don’t know which airlines correspond to which carrier codes, then run View(airlines) to see a directory of airlines.
Another use of barplots is to visualize the joint distribution of two categorical variables at the same time.
Let’s examine the joint distribution of outgoing domestic flights from NYC by carrier
and origin
, or in other words the number of flights for each carrier
and origin
combination.
Copy the previous code and map the additional variable origin
by adding a fill = origin
inside the aes()
aesthetic mapping
ggplot(data = flights, mapping = aes(x = carrier, ...)) + geom_bar()
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + geom_bar()
grade_this_code()
This is an example of a stacked barplot. While easy to make it is not always the most ideal.
An alternative to stacked barplots are side-by-side barplots, also known as a dodged barplot.
Copy the previous code and add the argument position = "dodge"
within geom_bar()
.
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + geom_bar(position = ...)
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) + geom_bar(position = "dodge")
grade_this_code()
This shows the same information as a faceted barplot.
grade_button_ui(id = "grade")
Once you are finished:
grade_print_ui("grade")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.