Data visualization

library(learnr)
library(tutorial.helpers)
library(knitr)
library(tidyverse)
library(palmerpenguins)
library(ggthemes) 

knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(out.width = '90%')
options(tutorial.exercise.timelimit = 60, 
        tutorial.storage = "local") 


Introduction

This tutorial covers Chapter 1: Data visualization from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn about the key functions from the ggplot2 package for data visualization, including gggplot(), geom_point(), geom_bar(), geom_boxplot(), geom_histogram(), and more.

First steps

R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places.

We will create this plot:

intro_p <- penguins |>
  drop_na() |> 
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
    geom_point(mapping = aes(color = species, 
                             shape = species)) +
    geom_smooth(method = "lm", formula = y ~ x) +
    labs(title = "Body Mass and Flipper Length",
         subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
         x = "Flipper Length (mm)", 
         y = "Body Mass (g)",
         color = "Species", 
         shape = "Species")

intro_p

Exercise 1

Load the tidyverse library using library().


library(...)
library(tidyverse)

We almost always begin our work by loading the tidyverse package. Note that the terms "package" and "library" are used interchangeably and that there is no package() function. To load a package, you need to use library().

Exercise 2

Load the palmerpenguins package using library(). This is the package which holds the data which we will be using.


library(...)
library(palmerpenguins)

Do penguins with longer flippers weigh more or less than penguins with shorter flippers? What does the relationship between flipper length and body mass look like? Is it positive? Negative? Linear? Nonlinear? Does the relationship vary by the species of the penguin? How about by the island where the penguin lives?

Exercise 3

In the Console, run library(palmerpenguins). The Console and the Tutorial are separate environments. Loading a library in one does not load it in another.

Run ?palmerpenguins in the Console after loading in the package. After doing so, copy and paste the description here. (As a reminder, running ?something in the console will open the help page for something. Running ??something, though, will search for help pages with something).

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

The data includes size measurements, clutch observations, and blood isotope ratios for adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica.

Exercise 4

Load the ggthemes package by running the library() command with ggthemes as the argument.


library(...)
library(ggthemes)

ggthemes is one of many packages which add functionality to ggplot2.

Exercise 5

Type penguins and hit Run Code. penguins is the table that holds the data we'll be using to make the plot.


penguins
penguins

Tabular data is data organized in a table. A table is a group of cells, organized in rows and columns. Tabular data is considered tidy if and only if it satisfies the following rules:

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row, each row is an observation.
  3. Each value is a cell, each cell is a single value.
knitr::include_graphics("images/tidy-data.png")

penguins is an example of tidy data.

Exercise 6

Run glimpse() with penguins as its argument.


glimpse(...)
glimpse(penguins)

Among the variables in penguins are:

Exercise 7

ggplot() is the core function of the ggplot2 package. It creates a ggplot object that serves as a canvas for visualizations, like scatter and box plots.

Run ggplot(data = penguins).


ggplot(...)
ggplot(data = penguins)

You should see a blank, grey square. R has set up the area in which it can place a plot, but we have yet to tell it what to plot.

In the realm of data analysis, a fundamental concept is the notion of a variable. A variable represents a quantity, quality, or property that can be measured or observed. Variables can take different forms, depending on the nature of the data being studied. They can be numeric or categorical, continuous or discrete, qualitative or quantitative.

Exercise 8

Pipe penguins into ggplot().


penguins |> 
  ...
penguins |>
  ggplot()

penguins, on the left of the pipe, becomes the input to ggplot(), on the right side. This generates the same output as ggplot(data = penguins). While indentation may not affect how the code performs, it does make the code more readable. We start each command in a pipe on a new line. Each line of code in a pipe ends with the pipe itself: |>.

When you're working with ggplot(), you typically won't be using just the data parameter (input into a function). You'll be using the mapping parameter as well. The mapping parameter lets you set, among other things, variables for the x- and y-axis.

To use the mapping parameter, you have to give ggplot() an aesthetic, which you get by calling the aes() function. For example, if you wanted to set the variable for the x-axis to be foo, you would add mapping = aes(x = foo) in your call to ggplot().

Exercise 9

Copy the previous code. Within the call to ggplot(), set the mapping to an aesthetic and set the x parameter to be flipper_length_mm and run your code.


penguins |>
  ggplot(... = aes(x = ...))
penguins |>
  ggplot(mapping = aes(x = flipper_length_mm))

mapping = aes(x = flipper_length_mm) specifies that the variable flipper_length_mm is mapped to the x-axis of a plot. This causes the length of the penguin's flipper to be plotted on the x-axis.

Exercise 10

Copy the previous code. Within the aes() set y equal to body_mass_g and run the code.


penguins |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = ...))
penguins |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g))

To display the data points, we need to add a geometric object, or, in ggplot terms, a geom. A geom is the geometrical object that a plot uses to represent data. These geometric objects are made available in ggplot2 with functions that start with geom_.

Exercise 11

The function geom_point() adds a layer of points to your plot, which creates a scatterplot. Copy the previous code and add + geom_point().


penguins |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
  ...
penguins |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
    geom_point() 

Note the warning message. There are two problematic data points in the penguins data set. In a real data science project, we would investigate these further. For this tutorial, we will just discard them. We do that with the drop_na() function.

drop_na() removes any row with NA values for any of the variables. If you provide drop_na() with the name of a variable as an argument, it will only remove rows that have an NA value for that variable.

Exercise 12

Insert drop_na() |> (drop_na() with a pipe) in between the expressions, penguins |> and ggplot(mapping = ...).


penguins |>
  ... |> 
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
    geom_point() 
penguins |>
  drop_na() |> 
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
    geom_point() 

The aesthetic function, aes(), has more parameters than just x and y. aes() has parameters like color, shape, and size as well! You can add them in the same way you added the x and y parameters: add color = foo or shape = bar(baz) to your aes() call in ggplot().

Exercise 13

Copy your previous code. Set the color to species and run your code.


penguins |>
  drop_na() |> 
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g, 
                       color = ...)) +
    geom_point()
penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g,
                       color = species)) +
    geom_point() 

Scatter plots are useful for displaying relationships between x and y but it is also helpful to ask yourself if there are other variables which might contribute to the relationship. For example, does the relationship between x and y differ for different species of penguins?

When you ran it, you saw that the data points differ by the color of the species, thereby creating a more interesting plot. Color isn't just the only aesthetic. See aesthetic mappings for more examples.

Exercise 14

Let's add another geom, geom_smooth(). geom_smooth() "smoothes out" data into a line or curve to make patterns easier to see. geom_smooth(), though, is a geom that requires you to give it another parameter. In this case, it's method: the method used to smooth the data. Different methods are used for different purposes. You'll learn about them later on.

Add the geom_smooth() layer using +. Set method to "lm" within the call to geom_smooth().


penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g, 
                       color = species)) +
  geom_point() +
  geom_smooth(...)
penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g,
                       color = species)) +
    geom_point() +
    geom_smooth(method = "lm") 

The "lm" method stands for linear model, meaning that R do its best to fit a straight line through the points.

Note that warning about "geom_smooth() using formula = 'y ~ x'". Since we failed to specify an argument for formula, geom_smooth() provides a sensible default, using the x and y variables specified in aes() above. But geom_smooth() thinks we are sloppy for not confirming this decision, so it provides a warning. As always, we want to address any warnings or messages which our code issues.

Exercise 15

Address the warning by adding formula = y ~ x to the call to geom_smooth(). As always, different arguments to a function must be separated by a comma.


... +
  geom_smooth(method = "lm", ...) 
penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g,
                       color = species)) +
    geom_point() +
    geom_smooth(method = "lm", formula = y ~ x) 

geom_smooth() creates a fitted line or curve which can help identify trends and patterns in data. It offers different smoothing methods like linear or polynomial regression and loess smoothing. The shaded error around the fitted line represents uncertainty about the estimate.

Exercise 16

Our current plot shows one line for the entire data set instead of separate lines for each penguin species. To accomplish this effect, delete the color aesthetic within ggplot() and add it to geom_point() by running geom_point(mapping = aes(color = species)).


penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
  geom_point(mapping = ...) +
  geom_smooth(method = "lm", formula = y ~ x)
penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
    geom_point(mapping = aes(color = species, 
                             shape = species)) +
    geom_smooth(method = "lm", formula = y ~ x) 

In ggplot2, when aesthetic mappings are defined at the global level, they are passed down to all subsequent geom layers in the plot. However, each geom function in ggplot2 can also accept a mapping argument, allowing for local-level aesthetic mappings that are added to those inherited from the global level.

Exercise 17

Not all individuals perceive colors the same due to color blindness or other color vision differences. To help, we can map species to the shape aesthetic, in addition to the color aesthetic, within geom_point().


penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
  geom_point(mapping = aes(color = species, ... = species)) +
  geom_smooth(method = "lm", formula = y ~ x)
penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
    geom_point(mapping = aes(color = species, 
                             shape = species)) +
    geom_smooth(method = "lm", formula = y ~ x) 

In addition to the built-in shapes, ggplot() also allows for custom shapes to be created and used in plots. This can be useful for creating unique visualizations or for incorporating custom symbols or logos into a plot.

Exercise 18

Now that we have the data points, let's add the title, subtitle, labels, et cetera. Copy the previous code, add a new layer using + and add labs(). Within labs, set title equal to "Body Mass and Flipper Length".


penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
    geom_point(mapping = aes(color = species, ... = species)) +
    geom_smooth(method = "lm", formula = y ~ x) + 
    labs(...)
penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
    geom_point(mapping = aes(color = species, 
                             shape = species)) +
    geom_smooth(method = "lm", formula = y ~ x) +
    labs(title = "Body Mass and Flipper Length")

The labs() function takes in several arguments to modify the plot labels, including x, y, title, subtitle, caption, and tag. The x and y arguments are used to modify the axis labels, while the title, subtitle, and caption arguments are used to modify the plot title, subtitle, and caption, respectively. The tag argument is used to add a label to the plot that can be used for reference in later code.

Exercise 19

Copy the previous code. Add the subtitle by adding subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins" after the title in the call to labs() (separated by a comma).


penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
  geom_point(mapping = aes(color = species, ... = species)) +
  geom_smooth(method = "lm", formula = y ~ x) +
  labs(title = "Body Mass and Flipper Length", 
       ...)
penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
    geom_point(mapping = aes(color = species, 
                             shape = species)) +
    geom_smooth(method = "lm", formula = y ~ x) +
    labs(title = "Body Mass and Flipper Length",
         subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins")

The labs() function can also be used with the ggtitle() function to modify the plot title. This can be useful when you want to have more control over the formatting of the plot title, such as changing the font size or color.

Exercise 20

Lets modify the x and y axes. In the call to labs(), type a comma, then add the x parameter and set it to "Flipper Length (mm)" Type another comma and a y, then set it equal to "Body Mass (g)".


penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
  geom_point(mapping = aes(color = species, ... = species)) +
  geom_smooth(method = "lm", formula = y ~ x) +
  labs(title = "Body mass and Flipper Length" +
       subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
        ...)
penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
    geom_point(mapping = aes(color = species, 
                             shape = species)) +
    geom_smooth(method = "lm", formula = y ~ x) +
    labs( title = "Body Mass and Flipper Length",
          subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
          x = "Flipper Length (mm)", 
          y = "Body Mass (g)")

You can use the expression() function within the labs() function to include mathematical symbols or other special characters (like Greek letters) in your plot labels.

Finally, it's worth noting that the labs() function is just one way to modify plot labels in ggplot2. Other functions, such as xlab(), ylab(), and ggtitle(), can be used to modify specific plot labels without affecting others. It's important to choose the appropriate function for your needs depending on the level of customization you require.

Exercise 21

You might see that it looks the same as the one in the beginning but we forgot one minor thing which is capitalizing the legend, we can do so by typing a comma and setting both color and shape equal to "Species" (in quotes).


penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
  geom_point(mapping = aes(color = species, ... = species)) +
  geom_smooth(method = "lm", formula = y ~ x) +
  labs(title = "Body Mass and Flipper Length" +
       subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
       x = "Flipper Length (mm)", y = "Body Mass (g)", 
       ...)
penguins |>
  drop_na() |>
  ggplot(mapping = aes(x = flipper_length_mm, 
                       y = body_mass_g)) +
    geom_point(mapping = aes(color = species, 
                             shape = species)) +
    geom_smooth(method = "lm", formula = y ~ x) +
    labs(title = "Body Mass and Flipper Length",
         subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
         x = "Flipper Length (mm)", 
         y = "Body Mass (g)",
         color = "Species", 
         shape = "Species")

Reminder: This is what our graph should look like

intro_p

ggplot2 calls

As we move on from these introductory sections, we’ll transition to a more concise expression of ggplot2 code.

Exercise 1

Run this code:

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point()

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point()

Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to ggplot() are data and mapping, in the remainder of the book, we won’t supply those names.

Exercise 2

Run this code:

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point()

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point()

Leaving out the data and mapping arguments saves typing, and, by reducing the amount of extra text, makes it easier to see what’s different between plots.

Exercise 3

Plots are often the last step on a pipeline which includes various sorts of data manipulation. Run this code:

penguins |> 
  drop_na() |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point()

penguins |> 
  drop_na() |> 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point()

By including drop_na() in the pipeline, we prevent that annoying warning.

Always keep in mind that steps in the pipeline are separated by |> while steps in the construction of your ggplot() object are separated with +.

Visualizating distributions

How you visualize the distribution of a variable depends on the type of variable: categorical or numerical.

Exercise 1

A variable is categorical if it can only take one of a small set of values. To examine the distribution of a categorical variable, you can use a bar chart.

Run this code:

ggplot(penguins, aes(x = species)) +
  geom_bar()

ggplot(penguins, aes(x = species)) +
  geom_bar()

The height of the bars displays how many observations occurred with each x value.

Exercise 2

In bar plots of categorical variables with non-ordered levels, like the penguin species above, it’s often preferable to reorder the bars based on their frequencies. Doing so requires transforming the variable to a factor (how R handles categorical data) and then reordering the levels of that factor.

Replace species in the previous exercise with fct_infreq(species).


ggplot(penguins, aes(x = fct_infreq(species))) +
  geom_bar()

The forcats package, which is part of the Tidyverse provides a variety of functions for these sorts of manipulations.

Exercise 3

A variable is numerical (or quantitative) if it can take on a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values. Numerical variables can be continuous or discrete. One commonly used visualization for distributions of continuous variables is a histogram. Run this code:

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 200)

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 200)

A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that 39 observations have a body_mass_g value between 3,500 and 3,700 grams, which are the left and right edges of the bar.

Exercise 4

You can set the width of the intervals in a histogram with the binwidth argument, which is measured in the units of the x variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. Change the value of the binwidth argument from 200 to 20.


ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(... = 20)
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 20)

A binwidth of 20 is too narrow, resulting in too many bars, making it difficult to determine the shape of the distribution.

Exercise 5

Change the value of the binwidth argument from 20 to 2000.


ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(... = 2000)
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 2000)

A binwidth of 2,000 is too high, resulting in all data being binned into only three bars, and also making it difficult to determine the shape of the distribution. A binwidth of 200 provides a sensible balance.

Exercise 6

An alternative visualization for distributions of numerical variables is a density plot. A density plot is a smoothed-out version of a histogram and a practical alternative, particularly for continuous data that comes from an underlying smooth distribution. Replace geom_histogram(binwidth = 2000) in the previous exercise with geom_density().


ggplot(penguins, aes(x = body_mass_g)) +
  ..._density()
ggplot(penguins, aes(x = body_mass_g)) +
  geom_density()

We won’t go into how geom_density() estimates the density (you can read more about that in the function documentation), but let’s explain how the density curve is drawn with an analogy. Imagine a histogram made out of wooden blocks. Then, imagine that you drop a cooked spaghetti string over it. The shape the spaghetti will take draped over blocks can be thought of as the shape of the density curve. It shows fewer details than a histogram but can make it easier to quickly glean the shape of the distribution, particularly with respect to modes and skewness.

Visualizating relationships

To visualize a relationship we need to have at least two variables mapped to aesthetics of a plot. In the following exercises you will learn about commonly used plots for visualizing relationships between two or more variables and the geoms used for creating them.

Exercise 1

To visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots. A boxplot is a type of visual shorthand for measures of position (percentiles) that describe a distribution. It is also useful for identifying potential outliers.

Let's start by running ggplot() with penguins as the first argument.


ggplot(...)
ggplot(penguins)

As usual, we get a blank plot.

Exercise 2

Add aes(x = species, y = body_mass_g) as the second argument to the ggplot(penguins) call.


ggplot(penguins, ...(x = species, ... = body_mass_g))
ggplot(penguins, aes(x = species, y = body_mass_g))

The axes are now specified. A boxplot consists of:

Exercise 3

Extend the plot by adding geom_boxplot().


ggplot(penguins, aes(x = species, y = body_mass_g)) +
  ...()
ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot()

knitr::include_graphics("images/EDA-boxplot.png")

Exercise 4

Another way to look at this data is with geom_density(). Run this code:

ggplot(penguins, aes(x = body_mass_g, color = species)) +
  geom_density(linewidth = 0.75)

ggplot(penguins, aes(x = body_mass_g, color = species)) +
  geom_density(linewidth = 0.75)

We’ve also customized the thickness of the lines using the linewidth argument in order to make them stand out a bit more against the background.

Exercise 5

Additionally, we can map species to both color and fill aesthetics and use the alpha aesthetic to add transparency to the filled density curves. This aesthetic takes values between 0 (completely transparent) and 1 (completely opaque). In the following plot it’s set to 0.5. Run this code:

ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +
  geom_density(alpha = 0.5)

ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +
  geom_density(alpha = 0.5)

Note the terminology we have used here:

Exercise 6

We can use stacked bar plots to visualize the relationship between two categorical variables. Run this code:

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar()

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar()

This plot shows the frequencies of each species of penguins on each island. The plot of frequencies shows that there are equal numbers of Adelies on each island. But we don’t have a good sense of the percentage balance within each island.

Exercise 7

Change the code by adding position = "fill" as an argument to geom-bar(). In creating these bar charts, we map the variable that will be separated into bars to the x aesthetic, and the variable that will change the colors inside the bars to the fill aesthetic.


ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(... = "fill")
ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = "fill")

Using this plot we can see that Gentoo penguins all live on Biscoe island and make up roughly 75% of the penguins on that island, Chinstrap all live on Dream island and make up roughly 50% of the penguins on that island, and Adelie live on all three islands and make up all of the penguins on Torgersen.

Exercise 8

So far you’ve learned about scatterplots (created with geom_point()) and smooth curves (created with geom_smooth()) for visualizing the relationship between two numerical variables. Run this code:

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

A scatterplot is probably the most commonly used plot for visualizing the relationship between two numerical variables.

Exercise 9

We can incorporate more variables into a plot by mapping them to additional aesthetics. For example, in the following scatterplot the colors of points represent species and the shapes of points represent islands. Run this code:

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = island))

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = island))

However adding too many aesthetic mappings to a plot makes it cluttered and difficult to make sense of. Another way, which is particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.

Exercise 10

To facet your plot by a single variable, use facet_wrap(). Run this code:

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  facet_wrap(~island)

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  facet_wrap(~island)

The first argument of facet_wrap() is a formula, which you create with ~ followed by a variable name. The variable that you pass to facet_wrap() should be categorical.

Saving your plots

Once you’ve made a plot, you might want to get it out of R by saving it as an image that you can use elsewhere. That’s the job of ggsave(), which will save the plot most recently created to disk.

Exercise 1

Run this code:

ggplot(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g)) +
  geom_point()

ggplot(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g)) +
  geom_point()
ggplot(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g)) +
  geom_point()

This creates a simple scatter plot. Ignore the warning about missing values.

If your code produces an error message, carefully read it. Sometimes the answer will be buried there! But when you’re new to R, even if the answer is in the error message, you might not yet know how to understand it. Another great tool is Google: try googling the error message, as it’s likely someone else has had the same problem, and has gotten help online.

Exercise 2

In order to save a copy of this plot, we use the ggsave() by running ggsave(filename = "penguin-plot.png") immediately after creating the plot.

ggplot(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g)) +
  geom_point()
ggplot(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g)) +
  geom_point()

ggsave(... = "penguin-plot.png")
ggplot(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g)) +
  geom_point()

ggsave(filename = "penguin-plot.png")

file.remove("penguin-plot.png")

By default, ggsave() saves the most recently created plot.

If you don’t specify the width and height they will be taken from the dimensions of the current plotting device. For reproducible code, you’ll want to specify them.

Exercise 3

Run this code:

my_plot <- ggplot(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g)) +
  geom_point()

my_plot

my_plot <- ...(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g)) +
  geom_point()

...
my_plot <- ggplot(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g)) +
  geom_point()

my_plot

When creating plots to save, it is best practice to explicitly create an object. Don't count on the fact that ggsave() saves the last plot you created. What happens when you move the plotting code to another location? Instead, make it certain which object you are saving.

Exercise 4

Use the code from the last exercise, but replace my_plot with ggsave(filename = "penguin-plot.png", plot = my_plot).


my_plot <- ggplot(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g)) +
  geom_point()

ggsave(... = "penguin-plot.png", plot = ...)
my_plot <- ggplot(penguins, aes(x = flipper_length_mm, 
                     y = body_mass_g)) +
  geom_point()

ggsave(filename = "penguin-plot.png", plot = my_plot)

file.remove("penguin-plot.png")

R is extremely picky, and a misplaced character can make all the difference. Make sure that every ( is matched with a ) and every " is paired with another ". Sometimes you’ll run the code and nothing happens. Check the left-hand of your console: if it’s a +, it means that R doesn’t think you’ve typed a complete expression and it’s waiting for you to finish it. In this case, it’s usually easy to start from scratch again by pressing ESCAPE to abort processing the current command.

Exercise 5

One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of the line, not the start. For example, run this code:

ggplot(data = mpg) 
+ geom_point(mapping = aes(x = displ, y = hwy))

Note how the error message explains the problem.

If you get stuck when coding, try the help. You can get help about any R function by running ?function_name in the Console, or highlighting the function name and pressing F1 in RStudio. Don’t worry if the help doesn’t seem that helpful - instead skip down to the examples and look for code that matches what you’re trying to do.

Summary

This tutorial covered Chapter 1: Data visualization from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You learned about the key functions from the ggplot2 package for data visualization, including geom_point(), geom_line(), geom_bar(), geom_boxplot(), geom_histogram(), and more.




Try the r4ds.tutorials package in your browser

Any scripts or data that you put into this service are public.

r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m.