library(learnr) library(tidyverse) checker <- function(label, user_code, check_code, envir_result, evaluate_result, ...) { list(message = check_code, correct = TRUE, location = "append") } tutorial_options(exercise.timelimit = 60, exercise.checker = checker)
Data visualization is one of the most important tools for data science. It is also a great way to start learning R; when you visualize data, you get an immediate payoff that will keep you motivated as you push through the initial frustrations of learning a language.
This tutorial will teach you how to begin visualizing data with ggplot2, which is one of the most popular packages in R for visualizing data. The tutorial focuses on three essential skills:
These skills will quickly add visualization to your data science toolkit. You can expand on them later with RStudio awesome Visualize Data primer.
This tutorial is copied from Rstudio primers and distributed in this package only for convenience purpose. The lessons they write are superb and it's much smarter to stand on the shoulders of giants and give them proper credit than to write copycats.
These examples are excerpted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.
This tutorial uses the core tidyverse packages, including ggplot2, which have been pre-loaded for your convenience.
"The simple graph has brought more information to the data analyst’s mind than any other device. " --- John Tukey
Let's begin the tutorial with a question to explore.
question("What do you think: Do cars with big engines use more fuel than cars with small engines?", type = "single", correct = "Great!", answer("Cars with bigger engines use _more_ fuel.", correct = TRUE, message = "In other words, there is a positive relationship between engine size and fuel efficiency. Now let's test your hypothesis against data."), answer("Cars with bigger engines use _less_ fuel.", correct = TRUE, message = "In other words, there is a negative relationship between engine size and fuel efficiency. Now let's test your hypothesis against data.") )
mpg
You can test your answer with two things: the mpg
dataset that comes in ggplot2 and a plot. mpg
contains observations collected on 38 models of cars by the US Environmental Protection Agency. Among the variables in mpg
are:
displ
, a car's engine size, in liters.hwy
, a car's fuel efficiency on the highway, in miles per gallon (mpg).
A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
To see the mpg
data frame, type mpg
in the code block below and click "Submit Answer".
mpg
"Good job! We'll use interactive code chunks like this throughout these tutorials. Whenever you encounter one, you can click Submit Answer to run (or re-run) the code in the chunk. If there is a Solution button, you can click it to see the answer. Now let's use this data to make our first graph."
The code below uses functions from the ggplot2 package to plot the relationship between displ
and hwy
.
Let's look at the plot and then talk about the code itself. To see the plot, click "Submit Answer" to run the code below.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
"Excellent! Can you spot the relationship?"
The plot shows a negative relationship between engine size (displ
) and fuel efficiency (hwy
). Points that have a large value of displ
have a small value of hwy
and vice versa.
In other words, cars with big engines use more fuel. If that was your hypothesis, you were right!
Now let's look at how we made the plot.
ggplot()
Here's the code that we used to make the plot. Notice that it contains three functions: ggplot()
, geom_point()
, and aes()
.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
The first function, ggplot()
, creates a coordinate system that you can add layers to. The first argument of ggplot()
is the dataset to use in the graph.
By itself, ggplot(data = mpg)
creates an empty graph, but it is not very interesting so I'm not going to show it here.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
geom_point()
adds a layer of points to the empty plot created by ggplot()
. As a result, we get a scatterplot.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
geom_point()
takes a mapping
argument, which defines which variables in your dataset are mapped to which axes in your graph. The mapping
argument is always paired with the function aes()
, which you use to gather together all of the mappings that you want to create.
Here, we want to map the displ
variable to the x axis and the hwy
variable to the y axis, so we add x = displ
and y = hwy
inside of aes()
(and we separate them with a comma). Where will ggplot2 look for these mapped variables? In the data frame that we passed to the data
argument, in this case, mpg
.
Our code follows the common workflow for making graphs with ggplot2. To make a graph, you:
ggplot()
geom_
functionmapping = aes()
argumentIn fact, you can turn our code into a reusable template for making graphs. To make a graph, replace the bracketed sections in the code below with a data set, a geom_
function, or a collection of mappings.
Give it a try! Replace the bracketed sections with mpg
, geom_boxplot
, and x = class, y = hwy
to make a slightly different graph. Be sure to delete the #
symbols before you run the code.
# ggplot(data = <DATA>) + # <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
ggplot(data = mpg) + geom_boxplot(mapping = aes(x = class, y = hwy))
"Good job! This plot uses boxplots to compare the fuel efficiencies of different types of cars. ggplot2 comes with many geom functions that each add a different type of layer to a plot. You'll learn more about boxplots and other geoms in the tutorials that follow."
As you start to run R code, you're likely to run into problems. Don't worry --- it happens to everyone. I have been writing R code for years, and every day I still write code that doesn't work!
Start by carefully comparing the code that you're running to the code in the examples. R is extremely picky, and a misplaced character can make all the difference. Make sure that every (
is matched with a )
and every "
is paired with another "
. Also pay attention to capitalization; R is case sensitive.
One common problem when creating ggplot2 graphics is to put the +
in the wrong place: it has to come at the end of a line, not the start. In other words, make sure you haven't accidentally written code like this:
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
If you're still stuck, try the help. You can get help about any R function by running ?function_name
in a code chunk, e.g. ?geom_point
. Don't worry if the help doesn't seem that helpful --- instead skip down to the bottom of the help page and look for a code example that matches what you're trying to do.
If that doesn't help, carefully read the error message that appears when you run your (non-working) code. Sometimes the answer will be buried there! But when you're new to R, you might not yet know how to understand the error message. Another great tool is Google: try googling the error message, as it's likely someone else has had the same problem, and has gotten help online.
Run ggplot(data = mpg)
what do you see?
"Good job! A ggplot that has no layers looks blank. To finish the graph, add a geom function."
Make a scatterplot of cty
vs hwy
.
"Excellent work!"
"Scatterplots are also called points plots and bubble plots. They use the point geom."
What happens if you make a scatterplot of class
vs drv
. Try it. Why is the plot not useful?
"Nice job! `class` and `drv` are both categorical variables. As a result, points can only appear at certain values, where many points overlap each other. You have no idea how many points fall on top of each other at each location. Experiment with geom_count() to find a better solution."
"The greatest value of a picture is when it forces us to notice what we never expected to see." --- John Tukey
In the plot below, one group of points (highlighted in red) seems to fall outside of the linear trend between engine size and gas mileage. These cars have a higher mileage than you might expect. How can you explain these cars?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_point(data = dplyr::filter(mpg, displ > 5, hwy > 20), colour = "red", size = 2.2)
Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the class
value for each car. The class
variable of the mpg
dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular). To check this, we need to add the class
variable to the plot.
You can add a third variable, like class
, to a two dimensional scatterplot by mapping it to a new aesthetic. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.
You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word "value" to describe data, let's use the word "level" to describe aesthetic properties. Here we change the levels of a point's size, shape, and color to make the point small, triangular, or blue:
ggplot() + geom_point(aes(1, 1), size = 20) + geom_point(aes(2, 1), size = 10) + geom_point(aes(3, 1), size = 20, shape = 17) + geom_point(aes(4, 1), size = 20, colour = "blue") + scale_x_continuous(NULL, limits = c(0.5, 4.5), labels = NULL) + scale_y_continuous(NULL, limits = c(0.9, 1.1), labels = NULL) + theme(aspect.ratio = 1/3)
We can add the class
variable to the plot by mapping the levels of an aesthetic (like color) to the values of class
. For example, we can color a point green if it belongs to the compact class, blue if it belongs to the midsize class, and so on.
Let's give this a try. Fill in the blank piece of code below with color = class
. What happens? Delete the commenting symbols (#
) before running your code. (If you prefer British English, you can use colour
instead of color
.)
# ggplot(data = mpg) + # geom_point(mapping = aes(x = displ, y = hwy, ____________))
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))
"Great Job! You can now tell which class of car each point represents by examining the color of the point."
The colors reveal that many of the unusual points in mpg
are two-seater cars. These cars don't seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.
This isn't the only insight we've gleaned; you've also learned how to add new aesthetics to your graph. Let's review the process.
To map an aesthetic to a variable, set the name of the aesthetic equal to the name of the variable, and do this inside mapping = aes()
. ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable. ggplot2 will also add a legend that explains which levels correspond to which values.
This insight gives us a new way to think about the mapping argument. Mappings tell ggplot2 more than which variables to put on which axes, they tell ggplot2 which variables to map to which visual properties. The x and y locations of each point are just two of the many visual properties displayed by a point.
In the above example, we mapped color to class
, but we could have mapped size to class
in the same way.
Change the code below to map size
to class
. What happens?
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, size = class))
"Great Job! Now the size of a point represents its class. Did you notice the warning message? ggplot2 gives us a warning here because mapping an unordered variable (class) to an ordered aesthetic (size) is not a good idea."
You can also map class
to the alpha
aesthetic, which controls the transparency of the points. Try it below.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
"Great Job! If you look closely, you can spot something subtle: many locations contain multiple points stacked on top of each other (alpha is additive so multiple transparent points will appear opaque)."
Let's try one more aesthetic. This time map the class of the points to shape
, then look for the SUVs. What happened?
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = class))
"Good work! What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic. So only use it when you have fewer than seven groups."
In the code below, map cty
, which is a continuous variable, to color
, size
, and shape
. How do these aesthetics behave differently for continuous variables, like cty
, vs. categorical variables, like class
?
# Map cty to color ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) # Map cty to size ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) # Map cty to shape ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
"Very nice! ggplot2 treats continuous and categorical variables differently. Noteably, ggplot2 supplies a blue gradient when you map a continuous variable to color, and ggplot2 will not map continuous variables to shape."
Map class
to color
, size
, and shape
all in the same plot. Does it work?
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
"Very nice! ggplot2 can map the same variable to multiple aesthetics."
What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)
? Try it.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
"Good job! ggplot2 will map the aesthetic to the results of the expression. Here, ggplot2 mapped the color of each point to TRUE or FALSE based on whther the point's `displ` value was less than five."
What if you just want to make all of the points in your plot blue, like in the plot below?
ggplot(mpg) + geom_point(aes(displ, hwy), color = "blue")
You can do this by setting the color aesthetic outside of the aes()
function, like this
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Setting works for every aesthetic in ggplot2. If you want to manually set the aesthetic to a value in the visual space, set the aesthetic outside of aes()
.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue", shape = 3, alpha = 0.5)
If you want to map the aesthetic to a variable in the data space, map the aesthetic inside aes()
.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class, shape = fl, alpha = displ))
What do you think went wrong in the code below? Fix the code so it does something sensible.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
"Good job! Putting an aesthetic in the wrong location is one of the most common graphing errors. Sometimes it helps to think of legends. If you will need a legend to understand what the color/shape/etc. means, then you should probably put the aesthetic inside `aes()` --- ggplot2 will build a legend for every aesthetic mapped here. If the aesthetic has no meaning and is just... well, aesthetic, then set it outside of `aes()`."
For each aesthetic, you associate the name of the aesthetic with a variable to display, and you do this within aes()
.
Once you map a variable to an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.
You've experimented with the most common aesthetics for points: x
, y
, color
, size
, alpha
and shape
. Each geom uses its own set of aesthetics (you wouldn't expect a line to have a shape, for example). To find out which aesthetics a geom uses, open its help page, e.g. ?geom_line
.
This raises a new question that we've only brushed over: what is a geom?
How are these two plots similar?
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))
Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different geoms.
A geom is the geometrical object that a plot uses to represent observations. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.
As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.
To change the geom in your plot, change the geom function that you add to ggplot()
. For example, take this code which makes the plot on the left (above), and change geom_point()
to geom_smooth()
. What do you get?
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
# right ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))
"Good job! You get the plot on the right (above)."
ggplot2 provides over 30 geom functions that you can use to make plots, and extension packages provide even more (see https://exts.ggplot2.tidyverse.org/index.html for a sampling). You'll learn how to use these geoms to explore data in the Visualize Data primer.
Until then, the best way to get a comprehensive overview of the available geoms is with the ggplot2 cheatsheet. To learn more about any single geom, look at its help page, e.g. ?geom_smooth
.
What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
question("What does the `se` argument to `geom_smooth()` do?", answer("Nothing. `se` is not an argument of `geom_smooth()`"), answer("chooses a method for calculating the smooth line", message = "That is what the `method` argument of `geom_smooth()` does."), answer("controls whether or not to **s**how **e**rrors"), answer("Adds or removes a standard error ribbon around the smooth line", correct = TRUE), allow_retry = TRUE )
The ideas that you've learned here: geoms, aesthetics, and the implied existence of a data space and a visual space combine to form a system known as the Grammar of Graphics.
The Grammar of Graphics provides a systematic way to build any graph, and it underlies the ggplot2 package. In fact, the first two letters of ggplot2 stand for "Grammar of Graphics".
The best way to understand the Grammar of Graphics is to see it explained in action:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.