Layers

library(learnr)
library(tutorial.helpers)
library(tidyverse)
library(ggridges)
library(maps)
knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 60, 
        tutorial.storage = "local") 

nz <- map_data("nz")

polling_data <- read_csv("data/vis4_polling_data.csv",
              col_types = cols(createddate = col_date(),
                               approval_type = col_character(),
                               rate = col_double()))


Introduction

This tutorial covers Chapter 9: Layers from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn how to use ggplot2 and related packages to create persuasive graphics. Key functions include geom_smooth(), geom_histogram(), and facet_wrap().

Aesthetic mappings

We’ll start with a deeper dive into aesthetic mappings.

Exercise 1

Load up the tidyverse library.


library(...)
library(tidyverse)

“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

Exercise 2

Run glimpse() on mpg.


glimpse(...)
glimpse(mpg)

The mpg tibble bundled with the ggplot2 package contains 234 observations on 38 car models. The key variables for us include:

Exercise 3

Run ggplot(), setting the first argument, data, equal to mpg.


ggplot(data = ...)
ggplot(data = mpg)

Note that this just creates an empty area in which a plot can be placed. Because data is the first argument, we don't need to include it. We get the same result with ggplot(mpg).

Exercise 4

Continue the code by setting the mapping argument equal to the aes() function, setting the x argument to displ and the y argument to equal hwy.


ggplot(mpg, aes(x = ..., 
                ... = hwy))
ggplot(mpg, aes(x = displ, 
                y = hwy))

We want to visualize the relationship between the car's engine size and fuel efficiency on the highway. Because we have not added a geom, no data is plotted. But the axes are now labelled because R knows the range of the displ and hwy variables.

Exercise 5

Add geom_point() --- don't forget the + after the call to ggplot() --- to see the actual data.


ggplot(mpg, aes(x = displ, 
                y = hwy)) +
  ...()
ggplot(mpg, aes(x = displ, 
                y = hwy)) +
  geom_point()

A scatter plot like this is the most common way to look at data. If there is a lot of overlap among the data points, you may want to use geom_jitter() to separate the points.

Exercise 6

Set the color argument within aes() to class. Each point will have a color based on the class of the car.


ggplot(mpg, aes(x = displ, 
                y = hwy, 
                color = ...)) +
  geom_point()
ggplot(mpg, aes(x = displ, 
                y = hwy,
                color = class)) +
  geom_point()

The plot is much richer, even though it isn't any larger, because we have used color to incorporate information about the type of car. Notice how the SUVs cluster in the lower right. Their engines are (mostly) bigger, and they all get worse gas mileage.

Exercise 7

Replace the color argument with the shape argument within aes(). Set the value to class, as before. Each point will have a shape based on the class of the car.


ggplot(mpg, aes(x = displ, 
                y = hwy, 
                shape = ...)) +
  geom_point()
ggplot(mpg, aes(x = displ, 
                y = hwy,
                shape = class)) +
  geom_point()

When class is mapped to shape, we get two warnings:

1: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have 7. Consider specifying shapes manually if you must have them.

2: Removed 62 rows containing missing values (geom_point()).

Since ggplot() will only use six shapes at a time, by default, additional groups will go unplotted when you use the shape aesthetic. The second warning is related – there are 62 SUVs in the dataset and they’re not plotted.

Exercise 8

Replace the shape argument with the size argument within aes(). Set the value to class, as before. Each point will have a size based on the class of the car.


ggplot(mpg, aes(x = displ, 
                y = hwy, 
                size = ...)) +
  geom_point()
ggplot(mpg, aes(x = displ, 
                y = hwy,
                size = class)) +
  geom_point()

You should see a warning message:

Using size for a discrete variable is not advised. 

Mapping an unordered discrete (categorical) variable, like class, to an ordered aesthetic, like size is generally not a good idea because it implies a ranking that does not in fact exist, hence the warning.

Exercise 9

Replace the size argument with the alpha argument within aes(). Set the value to class, as before. Each point will have a transparency based on the class of the car.


ggplot(mpg, aes(x = displ, 
                y = hwy, 
                alpha = ...)) +
  geom_point()
ggplot(mpg, aes(x = displ, 
                y = hwy,
                alpha = class)) +
  geom_point()

You should see a warning message, as above:

Using alpha for a discrete variable is not advised. 

Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line provides the same information as a legend; it explains the mapping between locations and values.

Exercise 10

You can also set the visual properties of your geom manually as an argument of your geom function (outside of aes()) instead of relying on a variable mapping to determine the appearance. Remove the alpha argument from your code and add color = "blue" within the call to geom_point().


ggplot(mpg, aes(x = ..., 
                y = hwy)) + 
  geom_point(color = "...")
ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  geom_point(color = "blue")

Here, the color doesn’t convey information about a variable, but only changes the appearance of the plot. You’ll need to pick a value that makes sense for that aesthetic:

Geometric objects

So far we have discussed aesthetics that we can map or set in a scatterplot, when using a point geom. You can learn more about all possible aesthetic mappings in the aesthetic specifications vignette.

Exercise 1

To change the geom in your plot, change the geom function that you add to ggplot(). Let's start with the plot which we were just working with. (You need to copy this code by hand from the previous question because this is a new Section.) Use color = "blue" removed from the call to geom_point().


ggplot(..., aes(x = displ, 
                y = hwy)) + 
  geom_point()
ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  geom_point()

Every geom function in ggplot2 takes a mapping argument, either defined locally in the geom layer or globally in the ggplot() layer.

Exercise 2

Replace geom_point() with geom_smooth(). Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different geometric object, geom, to represent the data.


ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  ...()
ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  geom_smooth()

Not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. If you try, ggplot2 will silently ignore that aesthetic mapping.

Exercise 3

Note the message which gets displayed when calling geom_smooth().

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

geom_smooth() is informing us of the values which it is using, by default, for the method and formula arguments. Add method = 'loess' to the geom_smooth() call.


ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  geom_smooth(... = 'loess')
ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  geom_smooth(method = 'loess')

Note how the message has changed.

`geom_smooth()` using formula = 'y ~ x'

ggplot2 thinks that method = 'loess' and formula = 'y ~ x' are good default values, but it wants to be sure that you are aware of them. So, it will keep telling you about them until you make the choice explicitly in your code, which you should do.

Exercise 4

Make two more additions to the call to geom_smooth(). First, add formula = 'y ~ x' to remove the message. Second, add se = FALSE.


ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  geom_smooth(method = 'loess', 
              ... = 'y ~ x',
              se = ...)
ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE)

The se, which stands for "standard error," determines whether or not error bars around the line are included. By default, the value of se is TRUE. We set it to FALSE here in order to make the graphic less cluttered.

Exercise 5

You can set the linetype of a line. geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype. Within the call to aes(), add linetype = drv.


ggplot(mpg, aes(x = displ, 
                y = hwy, 
                linetype = ...)) + 
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE)
ggplot(mpg, aes(x = displ, 
                y = hwy, 
                linetype = drv)) + 
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE)

Here, geom_smooth() separates the cars into three lines based on their drv value, which describes a car’s drive train. One line describes all of the points that have a 4 value. One line describes all of the points that have an f value. And one line describes all of the points that have an r value. Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive.

Exercise 6

Replace linetype = drv with color = drv within aes().


ggplot(mpg, aes(x = displ, 
                y = hwy, 
                ... = drv)) + 
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE)
ggplot(mpg, aes(x = displ, 
                y = hwy, 
                color = drv)) + 
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE)

The geom_smooth() geom is applied to each category of color separately. With one call to a geom, we get three different plots, all on the same graph. There are no front-wheel drive cars with a displ value greater than 5, so the green line does not extend to the right-side of the graphic.

Exercise 7

If this sounds strange, we can make it clearer by overlaying the lines on top of the raw data. To do this, simply add a call to geom_point() in between ggplot() and geom_smooth(). Don't forget to separate different lines with +.


ggplot(mpg, aes(x = displ, 
                y = hwy, 
                color = drv)) + 
  ...() +
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE)
ggplot(mpg, aes(x = displ, 
                y = hwy, 
                color = drv)) + 
  geom_point() +
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE)

Notice that this plot contains two geoms in the same graph.

Exercise 8

The aes() values cascade across all the geom's, unless and until we change them ourselves. To see this, add aes(linetype = drv) to the call to geom_smooth().


ggplot(mpg, aes(x = displ, 
                y = hwy, 
                color = drv)) + 
  ... +
  geom_smooth(...(linetype = ...))
ggplot(mpg, aes(x = displ, 
                y = hwy, 
                color = drv)) + 
  geom_point() +
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE,
              aes(linetype = drv))

Compared to the previous plot, the linetype now varies by drv.

Many geoms, like geom_smooth(), use a single geometric object to display multiple rows of data. For these geoms, you can set the group aesthetic to a categorical variable to draw multiple objects.

Exercise 9

The explore the use of the group aesthetic, begin with our standard plot, which uses just aes(x = displ, y = hwy) in the call to ggplot() and just method = 'loess', formula = 'y ~ x', and se = FALSE in the call to geom_smooth().


ggplot(mpg, aes(x = displ, 
                y = hwy)) +
  geom_smooth(method = ..., 
              ... = 'y ~ x',
              se = ...)
ggplot(mpg, aes(x = displ, 
                y = hwy)) +
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE)

Note how the line extends across the full range of displ because we are not splitting up the geom for different values of drv.

Exercise 10

ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example).

Add aes(group = drv) to the call to geom_smooth().


ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE,
              ...(group = ...))
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE,
              aes(group = drv))

It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.

Exercise 11

Switch group = drv to color = drv and add show.legend = FALSE to the call to geom_smooth().


ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE,
              aes(color = drv), 
              ... = FALSE)
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_smooth(method = 'loess', 
              formula = 'y ~ x',
              se = FALSE,
              aes(color = drv), 
              show.legend = FALSE)

It is sometimes difficult to know if the color aesthetic should be set in the ggplot() call or in the call to the individual geom. Experiment!

Exercise 12

To explore some of these concepts, let's construct this plot:

geom_p <- ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_point(
    data = mpg |> filter(class == "2seater"), 
    color = "red"
  ) +
  geom_point(
    data = mpg |> filter(class == "2seater"), 
    shape = "circle open", size = 3, color = "red"
  )

geom_p

Begin with ggplot(mpg, aes(x = displ, y = hwy)) and use geom_point().


ggplot(mpg, aes(x = displ, y = ...)) + 
  ...()
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point()

The easiest way to build a complex plot is interactively. Add one line. See how it looks. Add another line. And so on.

Exercise 13

Add a second call to geom_point(). In that call, set data equal to mpg |> filter(class == "2seater") and color equal to "red".


ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_point(
    ... = mpg |> filter(class == "2seater"), 
    color = ...
  )
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_point(
    data = mpg |> filter(class == "2seater"), 
    color = "red"
  )

By default, calls to geoms use whatever values for the data argument have been given previously. But, the data argument can also be changed. In this case, the second call to geom_point() only applies to the five rows for which class is "2seater".

Exercise 14

Add a third call to geom_point(), setting data equal to mpg |> filter(class == "2seater"), color equal to "red" (as in the previous call), shape equal to "circle open", and size equal to 3.



ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_point(
    data = mpg |> filter(class == "2seater"),
    color = "red"
  ) +
  geom_point(
    data = mpg |> filter(class == "2seater"), 
    color = "red",
    shape = "circle open",
    size = 3 
  )

We need the two extra calls to geom_point() because the first one makes the centeral point red while the second makes the red circle around it.

Exercise 15

Geoms are the fundamental building blocks of ggplot2. You can completely transform the look of your plot by changing its geom, and different geoms can reveal different features of your data.

Make a new plot by hitting "Run Code".

ggplot(mpg, aes(x = hwy)) +
  geom_histogram(binwidth = 2)
ggplot(mpg, aes(x = hwy)) +
  geom_histogram(binwidth = 2)
ggplot(mpg, aes(x = hwy)) +
  geom_histogram(binwidth = 2)

The histogram reveals that the distribution of highway mileage is bimodal and right skewed. You can see a similar plot if you replace geom_histogram(binwidth = 2) with geom_density().

Exercise 16

Hit "Run Code" to create a boxplot with the same data.

ggplot(mpg, aes(x = hwy)) +
  geom_boxplot()

ggplot(mpg, aes(x = hwy)) +
  geom_boxplot()
ggplot(mpg, aes(x = hwy)) +
  geom_boxplot()

geom_boxplot() helps to highlight the two outlier data points. ggplot2 provides more than 40 geoms but these don’t cover all possible plots one could make. If you need a different geom, look at extension packages first to see if someone else has already implemented it (see https://exts.ggplot2.tidyverse.org/gallery/ for a sampling).

Exercise 17

The ggridges package is useful for making ridgeline plots, which can be useful for visualizing the density of a numerical variable for different levels of a categorical variable. Hit "Run Code" to see an example.

library(ggridges)

ggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +
  geom_density_ridges(alpha = 0.5, show.legend = FALSE)

ggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +
  geom_density_ridges(alpha = 0.5, show.legend = FALSE)
ggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +
  geom_density_ridges(alpha = 0.5, show.legend = FALSE)

The best place to get a comprehensive overview of all of the geoms ggplot2 offers, as well as all functions in the package, is the reference page. To learn more about any single geom, use the help (e.g., ?geom_smooth).

Facets

Facets, created with functions like facet_wrap() and facet_grid(), are the easiest way to create plot "multiples," copies of the same plot for each value of a third variable.

Exercise 1

Start with our usual basic scatter plot, displ versus hwy from the mpg tibble.


ggplot(mpg, aes(... = displ, 
                y = ...)) + 
  ...()
ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  geom_point()

To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() is a formula, which you create with ~ followed by a variable name. The variable that you pass to facet_wrap() should be categorical, i.e., either character or factor.

Exercise 2

Add facet_wrap(~cyl) to the end of your plot code. Don't forget the + after geom_point().


ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  geom_point() + 
  facet_wrap(~ ...)
ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  geom_point() + 
  facet_wrap(~cyl)

facet_wrap() wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid() because most displays are roughly rectangular.

Exercise 3

To facet your plot with the combination of two variables, switch from facet_wrap() to facet_grid(). The first argument of facet_grid() is also a formula, but now it’s a double sided formula: rows ~ cols.

Replace facet_wrap(~cyl) with facet_grid(drv ~ cyl).


ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  geom_point() + 
  facet_...(... ~ cyl)
ggplot(mpg, aes(x = displ, 
                y = hwy)) + 
  geom_point() + 
  facet_grid(drv ~ cyl)

By default each of the facets share the same scale and range for x and y axes. This is useful when you want to compare data across facets but it can be limiting when you want to visualize the relationship within each facet better.

Exercise 4

Setting the scales argument in a faceting function to "free" will allow for different axis scales across both rows and columns, "free_x" will allow for different scales across rows, and "free_y" will allow for different scales across columns.

Add scales = "free_y" to the call to facet_grid(). Don't forget the comma after drv ~ cyl.


ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  facet_grid(drv ~ cyl, ... = ...)
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  facet_grid(drv ~ cyl, scales = "free_y")

facet_wrap() and facet_grid() include lots of useful arguments, but note that the names of those arguments are not always consistent, for historical reasons. Consider nrow from facet_wrap() versus rows from facet_grid().

Statistical transformations

Instead of simply plotting data, it is sometimes useful to calculate a statistic based on the data and then plot the value of that statistic.

Exercise 1

The diamonds dataset is part of the ggplot2 package. Type diamonds and hit "Run Code".


diamonds
diamonds

The diamonds dataset contains information on about 54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond.

Exercise 2

Pipe diamonds into ggplot().


diamonds |> 
  ...()
diamonds |> 
  ggplot()

Instead of passing in diamonds as the first argument to ggplot(), we can "pipe" it in with the |> symbol. Recall that |> does exactly that, take the object in the left side --- in this case diamonds --- and pass it as the value of the first argument to the thing on the right side, in this case ggplot().

Exercise 3

Add aes(x = cut) to the class to ggplot().


diamonds |> 
  ggplot(aes(... = cut))
diamonds |> 
  ggplot(aes(x = cut))

The plot has no data but the x-axis is correctly labelled with all the possible values of the variable cut.

Exercise 4

Add geom_bar() as the geom. Don't forget the + after the call to ggplot().


diamonds |> 
  ggplot(aes(x = cut)) + 
  ...()
diamonds |> 
  ggplot(aes(x = cut)) + 
  geom_bar()

On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds! It is a "statistical transformation" which ggplot2 has calculated for us.

Exercise 5

Pipe diamonds into count(cut).



diamonds |>
  count(cut)

Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:

Exercise 6

count() outputs a tibble with values of the variable(s) you selected, along with the number of observations corresponding to each value. We can pipe this output directly into ggplot(), along with aes(x = cut, y = n).


diamonds |>
  count(cut) |>
  ggplot(aes(x = ..., ... = n))
diamonds |>
  count(cut) |>
  ggplot(aes(x = cut, y = n))

The algorithm used to calculate new values for a graph is called a "stat," short for statistical transformation. In this case, we are calculating the stat by hand, and then using the calculated value n to make a plot.

Exercise 7

Add geom_bar(stat = 'identity') to the pipeline.



diamonds |>
  count(cut) |>
  ggplot(aes(x = cut, y = n)) +
  geom_bar(stat = 'identity')

The result is the same as the much-more-straight-forward usage of geom_bar() which we created above.

Exercise 8

You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts. Hit "Run Code".

ggplot(diamonds, aes(x = cut, y = after_stat(prop), group = 1)) + 
  geom_bar()

ggplot(diamonds, aes(x = cut, y = after_stat(prop), group = 1)) + 
  geom_bar()
ggplot(diamonds, aes(x = cut, y = after_stat(prop), group = 1)) + 
  geom_bar()

To find the possible variables that can be computed by the stat, look for the section titled “computed variables” in the help for geom_bar().

Exercise 9

You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing. Hit "Run Code".

ggplot(diamonds) + 
  stat_summary(
    aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )

ggplot(diamonds) + 
  stat_summary(
    aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )
ggplot(diamonds) + 
  stat_summary(
    aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )

ggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g., ?stat_bin.

Position adjustments

There’s one more piece of magic associated with bar charts. You can color a bar chart using either the color aesthetic, or, more usefully, the fill aesthetic.

Exercise 1

Create a plot which starts with ggplot(mpg, aes(x = drv, color = drv)) and then adds, after a +, geom_bar().


ggplot(mpg, aes(... = drv, color = drv)) + 
  ...()
ggplot(mpg, aes(x = drv, color = drv)) + 
  geom_bar()

This is a basic bar chart, but with the edges of each bar colored according to the value of drv.

Exercise 2

Change color = drv to fill = drv.


ggplot(mpg, aes(x = drv, ... = drv)) + 
  geom_bar()
ggplot(mpg, aes(x = drv, fill = drv)) + 
  geom_bar()

It may seem strange to use drv as the argument to both x and to fill. Yet this is a common approach. The two arguments are doing two different things but, in this case, both of those things are associated with drv.

Exercise 3

Change fill = drv to fill = class.


ggplot(mpg, aes(x = drv, fill = ...)) + 
  geom_bar()
ggplot(mpg, aes(x = drv, fill = class)) + 
  geom_bar()

Note what happens if you map the fill aesthetic to another variable, like class: the bars are automatically stacked. Each colored rectangle represents a combination of drv and class.

Exercise 4

The stacking is performed automatically using the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: "identity", "dodge" or "fill". To explore, take the previous plot and add alpha = 1/5, position = "identity" to the call to geom_bar().


ggplot(mpg, aes(x = drv, fill = class)) + 
  geom_bar(alpha = ..., ... = "identity")
ggplot(mpg, aes(x = drv, fill = class)) + 
  geom_bar(alpha = 1/5, position = "identity")

position = "identity" will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value.

Exercise 5

We can achieve a similar effect by fill = NA instead of alpha = 1/5 in the class to geom_bar().


ggplot(mpg, aes(x = drv, color = class)) + 
  geom_bar(... = NA, position = "identity")
ggplot(mpg, aes(x = drv, color = class)) + 
  geom_bar(fill = NA, position = "identity")

The identity position adjustment is more useful for 2d geoms, like points, where it is the default.

Exercise 6

position = "fill" works like stacking, but makes each set of stacked bars the same height. Use the same code, but remove fill = NA and replace "identity" with "fill".



ggplot(mpg, aes(x = drv, fill = class)) + 
  geom_bar(position = "fill")

This makes it easier to compare proportions across groups.

Exercise 7

position = "dodge" places overlapping objects directly beside one another. Replace "fill" with "dodge".


ggplot(mpg, aes(x = drv, fill = class)) + 
  geom_bar(position = ...)
ggplot(mpg, aes(x = drv, fill = class)) + 
  geom_bar(position = "dodge")

This makes it easier to compare individual values.

Exercise 8

There’s one other type of adjustment that’s not useful for bar charts, but can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset? Starting with our usual ggplot(mpg, aes(x = displ, y = hwy)), add geom_point(position = "jitter")


ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(... = "jitter")
ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(position = "jitter")

Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph more revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for geom_point(position = "jitter"): geom_jitter().

To learn more about a position adjustment, look up the help page associated with each adjustment: ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?position_stack.

Coordinate Systems

Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point.

Exercise 1

Load the maps package with library().


library(...)
library(maps)

The maps package includes the area of each matching region in the map is computed, and regions which match the same element of regions have their areas combined. Each region is assumed planar, with vertices specified by the x and y components of the map object.

Exercise 2

Hit "Run Code" to create an object with data for New Zealand and then create a plot.

nz <- map_data("nz")

ggplot(nz, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", color = "black")
nz <- map_data("nz")

ggplot(nz, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", color = "black")
nz <- map_data("nz")

ggplot(nz, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", color = "black")

We don’t have the space to discuss maps extensively in this tutorial but you can learn more in the Maps chapter of ggplot2: Elegant Graphics for Data Analysis.

Exercise 3

coord_quickmap() sets the aspect ratio correctly for geographic maps. This is very important if you’re plotting spatial data with ggplot2. Add coord_quickmap() to the ggplot() call from the previous question.


... +
  coord_quickmap()
ggplot(nz, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", color = "black") +
  coord_quickmap()

coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

The layered grammar of graphics

Consider this overview:

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

Our new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.

The seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, a faceting scheme, and a theme.

Biden Approval

Let's finish this tutorial with a graphic. We will plot Joe Biden's approval rating from January through June 2021, the first ~6 months of his term. We will explore different types of geom_smooth().

We will be attempting to recreate this plot below.

biden_p <- ggplot(data = polling_data, 
       mapping = aes(x = createddate,
                     y = rate,
                     color = approval_type)) +
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "gam",
              formula = y ~ s(x, bs = "cs"),
              se = TRUE) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme(legend.position = "bottom") +
  labs(x = "Month",
       y = "Percent",
       title = "Biden's Approval Rating For First Half of 2021",
       subtitle = "While Biden's approval rating has stayed relatively static, his disapproval rating \nhas been rising.",
       color = "",
       caption = "Source: Five Thirty Eight Data")

biden_p

Exercise 1

We will first explore the file that contains this data. Run summary() on polling_data in the console below.


summary(...)

This data was taken from the fivethirtyeight website.

Exercise 2

Write down the name of the three variables.

question_text(NULL,
    message = "createddate, rate, approval_type",
    answer(NULL,
    correct = TRUE),
    allow_retry = FALSE,
    incorrect = NULL,
    rows = 6)

createddate stands for the day the poll was finished. rate is the % of approval or disapproval, and approval_type indicates whether or not the rate is approval or disapproval.

Exercise 3

We will get started building the plot. Start the function ggplot(), and set the argument data to polling_data.


ggplot(data = ...)

Remember: you should not see a plot here as you have not mapped anything, nor do you have a geom layer.

Exercise 4

Now, map createddate to the x-axis, and map rate to the y-axis.


ggplot(data = polling_data,
       mapping = aes(x = ...,
                     y = ...))

createddate is a date column, and rate is a double column.

Exercise 5

We want to be able to see both Joe Biden's approval and disapproval rate. Therefore, we'll change the color of the dots depending on whether the dot represents an approval/disapproval rating. Set color to approval_type.


ggplot(data = polling_data,
       mapping = aes(x = createddate,
                     y = rate,
                     color = ...))

Every geom function in ggplot2 takes a mapping argument, either defined locally in the geom layer or globally in the ggplot() layer. However, not every aesthetic works with every geom. Now we're ready to start building the plot.

Exercise 6

Next, we will add the points to the graph. Add the layer geom_point().


ggplot(data = polling_data,
       mapping = aes(x = createddate,
                     y = rate,
                     color = approval_type)) +
  ...()

Currently, we have a bit of overplotting. We will rectify this in the next code chunk.

Exercise 7

In the geom_point() layer set the argument alpha to 0.5.


ggplot(data = ...,
       mapping = aes(x = ...,
                     y = ...,
                     color = ...)) +
  geom_point(alpha = ...)

Good! Now you should be able to see more distinct points.

Exercise 8

We want to see the trend line for Biden's approval/disapproval ratings throughout his term. Add the layer of geom_smooth().


... + 
  ...()

If you just add this layer, you will get a few warning messages. We will fix this in the following exercises. If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers.

Exercise 9

Within geom_smooth(), set the method argument to "gam". We use "gam" since there is a large quantity of data points.


... + 
  geom_smooth(method = ...)

"gam" stands for generalized additive model. It is a modified form of a linear model. If you are curious to learn more, GAMs in R is a good resource.

Exercise 10

Within geom_smooth() set formula to y ~ s(x, bs = "cs").


... + 
  geom_smooth(method = "gam",
              formula = ...)

Just like y ~ x indicates a relationship between x and y, y ~ s(x, bs = "cs") indicates a "generalized additive model" relationship. With a generalized additive model we are able to handle large data.

Exercise 11

Set se = TRUE.


... + 
  geom_smooth(method = "gam",
              formula = y ~ s(x, bs = "cs"),
              se = ...)

With se we are able to create a margin of error that makes our data more accurate.

Exercise 12

Next, we want to change the labels on the graph so they are in percentage format. Use scale_y_continuous to do so. Set the labels argument to scales::percent_format(accuracy = 1).


...+
  scale_y_continuous(labels = ...)

We use scale_y_continuous when we want to modify the y-axis for continuous data. This documentation here will tell you more, and all the possible arguments you can use with scale_y_continuous

Exercise 13

Currently our legend is to the right of the graph. Let's change the way that we see our legend. Add theme() to the pipe and set legend.position to "bottom".


+ 
  theme(...="bottom")

With the help of legend.position. We have the ability to set it to the left, right, top, and bottom based on how we would like to see it within our graph.

Exercise 14

Finally, use labs() to add an appropriate heading, subtitle, caption, etc.

Reminder - your graph should look like this:

biden_p

... +
  labs(x = ...,
       y = ...,
       title = ...,
       subtitle = ...,
       color = ...,
       caption = ...)

Good! Now you have a plot of Biden's approval rating for the first six months of his term.

Summary

This tutorial covered Chapter 9: Layers from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You learned how to use ggplot2 and related packages to create persuasive graphics. Key functions included geom_smooth(), geom_histogram(), and facet_wrap().




Try the r4ds.tutorials package in your browser

Any scripts or data that you put into this service are public.

r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m.