Exploratory data analysis

library(learnr)
library(tutorial.helpers)
library(knitr)

library(tidyverse)
library(tidymodels)
library(hexbin)

knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(out.width = '90%')
options(tutorial.exercise.timelimit = 60, 
        tutorial.storage = "local") 

smaller <- diamonds |> 
  filter(carat < 3)

p1 <- nycflights13::flights |> 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + (sched_min / 60)
  ) |> 
  ggplot(aes(x = sched_dep_time)) + 
  geom_freqpoly(aes(color = cancelled), binwidth = 1/4) + 
  labs(title = "Frequency of Cancelled and Non-cancelled Flights for Scheduled Flights",
       subtitle = "There are a lot of non-cancelled flights at 3 PM",
       x = "Scheduled Departure Time",
       y = "Frequency",
       color = "Cancelled or Non-Cancelled")

p2 <- mpg |> 
  ggplot(aes(x = hwy, y = fct_reorder(class, hwy, median))) +
  geom_boxplot() + 
  labs(
    title = "Variation of Highway Mileage over Classes", 
    subtitle = "The compact and midsized cars are tied for the highest highway mileage.",
    x = "Highway Mileage",
    y = "Car Classes"
  )

p3 <- diamonds |> 
  count(color, cut) |>  
  ggplot(aes(x = color, y = cut)) +
  geom_tile(aes(fill = n)) +
  labs(
    title = "Covariation between Diamond Cut and the Color of the Diamond",
    subtitle = "There are more ideally cut diamonds than other cut types of diamonds",
    x = "Color of diamond",
    y = "Cut of diamond"
  )

p4 <- smaller |> 
  ggplot( aes(x = carat, y = price)) + 
  geom_boxplot(aes(group = cut_width(carat, 0.1))) +
  labs(title = "Covariation between Number of Carats and Price of Diamonds",
       subtitle = "As the number of carats of a diamond increases the price also increases",
       x = "Number of Carats",
       y = "Price of the Diamond")

diamonds3 <- diamonds |>
  mutate(
    log_price = log(price),
    log_carat = log(carat)
  )

diamonds_fit <- linear_reg() |>
  fit(log_price ~ log_carat, data = diamonds3)

diamonds_aug <- augment(diamonds_fit, new_data = diamonds3) |>
  mutate(.resid = exp(.resid))

p5 <- diamonds_aug |> 
  ggplot(aes(x = cut, y = .resid)) + 
  geom_boxplot() + 
  labs(
    title = "Prediction Model between Carat and Residual Price",
    subtitle = "As the number of carat increases the Residual Price decreases",
    x = "Number of Carats",
    y = "Residual Price"
  )


Introduction

This tutorial covers Chapter 10: Exploratory data analysis from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We will be making use of the ggplot2 and the dplyr packages to learn about variation and patterns. Some key commands include coord_cartesian(), geom_freqpoly(), fct_reorder(), geom_tile(), and geom_count().

Variation

Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

Exercise 1

Load in the tidyverse package using library().


library(...)
library(tidyverse)

In this tutorial, we’ll combine what you’ve learned about the dplyr package and the ggplot2 package to interactively ask questions, answer them with data, and then ask new questions.

Exercise 2

Type in diamonds and hit "Run Code"


diamonds
diamonds

The diamonds dataset contains the prices and other attributes of r nrow(diamonds) diamonds, such as the price, carat, cut, color, and so on.

Exercise 3

With a new pipeline, pipe diamonds to ggplot().


diamonds |> 
  ...
diamonds |>
  ggplot()

Note that this plot is blank. ggplot() has simply created an empty plot object which, when printed, shows as a blank recatangle.

You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. Each of your measurements will include a small amount of error that varies from measurement to measurement.

Exercise 4

Within ggplot(), using aes(), set the x argument to carat.


diamonds |> 
  ggplot(aes(x = ...))
diamonds |>
  ggplot(aes(x = carat))

Variables can also vary if you measure across different subjects (e.g., the eye colors of different people) or at different times (e.g., the energy levels of an electron at different moments).

Exercise 5

Add geom_histogram() to the pipeline.


... + 
  geom_histogram()
diamonds |>
  ggplot(aes(x = carat)) + 
  geom_histogram()

Every variable has its own pattern of variation, which can reveal interesting information about how that it varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variable’s values.

Exercise 6

Within geom_histogram(), add the binwidth argument and set it equal to 0.5.


... +
  geom_histogram(binwidth = ...)
diamonds |>
  ggplot(aes(x = carat)) + 
  geom_histogram(binwidth = 0.5)

We’ll start our exploration by visualizing the distribution of weights (carat) of ~54,000 diamonds from the diamonds dataset. Since carat is a numerical variable, we can use a histogram.

Exercise 7

Create a new variable called smaller and set it equal to diamonds.


smaller <- ...
smaller <- diamonds

Some valuable questions to ask yourself when dealing with variation are the following:

Exercise 8

We are still creating the smaller tibble. Copy your previous code and create a pipeline with diamonds. Pipe diamonds with filter() and filter out rows for which carat is less than 3.


smaller <- diamonds |> 
  filter(carat < ...)
smaller <- diamonds |>
  filter(carat < 3)

“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

Exercise 9

With a new pipeline, pipe smaller to ggplot().


smaller |> 
  ...
smaller |>
  ggplot()

EDA, or exploratory data analysis, is an iterative cycle. You:

  1. Generate questions about your data.

  2. Search for answers by visualizing, transforming, and modelling your data.

  3. Use what you learn to refine your questions and/or generate new questions.

Exercise 10

Within ggplot(), using aes(), set the x argument to carat.


smaller |> 
  ggplot(aes(x = ...))
smaller |>
  ggplot(aes(x = carat))

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends.

Exercise 11

Add geom_histogram() to the pipeline.


... + 
  geom_histogram()
smaller |>
  ggplot(aes(x = carat)) +
  geom_histogram()

This histogram suggests several interesting questions:

Exercise 12

Within geom_histogram(), set the binwidth argument to 0.01.


... +
  geom_histogram(binwidth = ...)
smaller |>
  ggplot(aes(x = carat)) +
  geom_histogram(binwidth = 0.01)

This visualization also revealed clusters, which suggest that subgroups exist in your data. To understand the subgroups, ask:

Exercise 13

Let's observe and take note of unusual values in the dataset. With a new pipeline, pipe diamonds to ggplot().


diamonds |> 
  ...
diamonds |> 
  ggplot()

Outliers are observations that are unusual; data points that don’t seem to fit the pattern.

Exercise 14

Within ggplot(), using aes(), set the x argument to y.


diamonds |> 
  ggplot(aes(x = ...))
diamonds |> 
  ggplot(aes(x = y))

Be careful! The x in this code is an argument, the usual usage with aes(). The y, on the other hand, is not an argument! It is not the usual y for which we provide a value when using aes(). In this case, y is a variable in the diamonds tibble.

Always be careful when variable names happen to be the same as argument names!

Sometimes outliers are data entry errors, sometimes they are simply values at the extremes that happened to be observed in this data collection, and other times they suggest important new discoveries.

Exercise 15

Add geom_histogram() to the pipeline


... +
  geom_histogram()
diamonds |> 
  ggplot(aes(x = y)) +
    geom_histogram()

When you have a lot of data, outliers are sometimes difficult to see in a histogram. We know that they are there because the x-axis extends all the way to 60.

Exercise 16

Within geom_histogram(), add the binwidth argument and set it equal to 0.5.


... +
  geom_histogram(binwidth = ...)
diamonds |> 
  ggplot(aes(x = y)) +
    geom_histogram(binwidth = 0.5)

There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 you’ll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with coord_cartesian().

Exercise 17

Add coord_cartesian() to the pipeline. Within coord_cartesian(), add the ylim argument and set it equal to c(0, 50).


... + 
  coord_cartesian(ylim = ...)
diamonds |> 
  ggplot(aes(x = y)) +
    geom_histogram(binwidth = 0.5) +
    coord_cartesian(ylim = c(0, 50))

The function, coord_cartesian(), also has an xlim argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits. This allows us to see that there are three unusual values: 0, ~30, and ~60.

Exercise 18

Pipe diamonds to filter() and filter y to be less than 3.


diamonds |> 
  filter(y < ...)
diamonds |> 
  filter(y < 3)

The y variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can’t have a width of 0 mm, so these values must be incorrect.

Exercise 19

Modify the pipe by adding, to the filter() command, | and then y greater than 20.


diamonds |> 
  filter(y < 3 | y > ...)
diamonds |> 
  filter(y < 3 | y > 20)

EDA is an important part of any data analysis, even if the primary research questions are handed to you on a platter, because you always need to investigate the quality of your data.

Exercise 20

Add select() to the pipeline. Within select(), select the variables price, x, y, and z.


... |> 
  select(price, x, ...)
diamonds |> 
  filter(y < 3 | y > 20) |> 
  select(price, x, y, z)

Exercise 21

Add arrange() to the pipeline. Within arrange(), add the variable y


... |> 
  arrange(...)
diamonds |> 
  filter(y < 3 | y > 20) |> 
  select(price, x, y, z) |> 
  arrange(y)

This arranges the data by the width of the diamonds. By doing EDA, we have discovered missing data that was coded as 0, which we never would have found by simply searching for NAs. Going forward we might choose to re-code these values as NAs in order to prevent misleading calculations. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don’t cost hundreds of thousands of dollars!

Unusual Values

In this section we will learn about the ways to deal with unusual values in a dataset. One way is to drop the entire row with the strange values. By the end of this section we will be creating the plot below:

p1 <- nycflights13::flights |> 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + (sched_min / 60)
  ) |> 
  ggplot(aes(x = sched_dep_time)) + 
  geom_freqpoly(aes(color = cancelled), binwidth = 1/4) + 
  labs(title = "Frequency of Cancelled and Non-cancelled Flights for Scheduled Flights",
       subtitle = "There are a lot of non-cancelled flights at 3 PM",
       x = "Scheduled Departure Time",
       y = "Frequency",
       color = "Cancelled or Non-Cancelled")

p1

Exercise 1

With a new pipeline, pipe diamonds to filter(). Within filter(), type between(y, 3, 20).




This just keeps the rows with the normal values and drops all of the rows with the strange values. We don’t recommend this option because one invalid value doesn’t imply that all the other values for that observation are also invalid. Additionally, if you have low quality data, by the time that you’ve applied this approach to every variable you might find that you don’t have any data left!

Exercise 2

Create a new variable called diamonds2 and set it equal to diamonds.


diamonds2 <- ...

The function if_else() is a vectorized if-else. Compared to the base R equivalent, ifelse(), this function allows you to handle missing values in the condition with missing and always takes true, false, and missing into account when determining what the output type should be.

Exercise 3

Copy your previous code and pipe diamonds to mutate(). Within mutate(), create a new variable, y_new, and set it equal to if_else(y < 3 | y > 20, NA, y).


diamonds2 <- diamonds |> 
  mutate(y_new = ...)

Here we replaced the unusual values with missing values. Here, we use the mutate() function to replace the variable with a modified copy. We used the if_else() function to replace unusual values with NA:

Exercise 4

Add ggplot() to the pipeline.


... |> 
  ggplot()

Exercise 5

Within ggplot(), using aes(), set the x argument to x and the y argument to y_new.


... +
  ggplot(aes(x = ..., y = ...))

Exercise 6

Add geom_point() to the pipeline.


... +
  geom_point()

It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed. To suppress that warning, set na.rm = TRUE.

Exercise 7

Within geom_point(), add the na.rm argument and set it equal to TRUE.


... + 
  geom_point(na.rm = ...)

Exercise 8

Add a title, subtitle, x axis title, and a y axis title using labs().


... + 
  labs(
    title = ...,
    subtitle = ...,
    x = ...,
    y = ...
    )

Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in nycflights13::flights, missing values in the dep_time variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable, using is.na() to check if dep_time is missing. Let's make a plot using geom_freqpoly().

Exercise 9

With a new pipeline, pipe nycflights13::flights to mutate(). Within mutate(), create a new variable called cancelled and set it equal to is.na(dep_time).


nycflights13::flights |> 
  mutate(
    cancelled = ...
  )

In the dataset nycflights13::flights1, missing values in the dep_time variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable, using is.na() to check if dep_time is missing.

Exercise 10

Within mutate(), create the sched_hour variable and set it equal to sched_dep_time %/% 100.


nycflights13::flights |> 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = ...
  )

Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.

Exercise 11

Within mutate(), create the sched_min variable and set it equal to sched_dep_time %% 100.


nycflights13::flights |> 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = ...
  )

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

Exercise 12

Within mutate(), create the sched_dep_time variable and set it equal to sched_hour plus (sched_min / 60).


nycflights13::flights |> 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = ...
  )

EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights can be gleaned from your dataset.

Exercise 13

Add ggplot() to the pipeline.


... + 
  ggplot()

Each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.

Exercise 14

Within ggplot(), using aes(), set the x argument to the sched_dep_time variable.


... + 
  ggplot(aes(x = ...))

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

  1. What type of variation occurs within my variables?

  2. What type of covariation occurs between my variables?

Exercise 15

Add geom_freqpoly() to the pipeline.


... + 
  geom_freqpoly()

Exercise 16

Within geom_freqpoly(), using aes(), set the color argument equal to the cancelled variable.


... + 
  geom_freqpoly(aes(color = ...))

Exercise 17

Within geom_freqpoly(), add the binwidth argument and set it equal to 1/4.


... +
  geom_freqpoly(aes(color = cancelled), binwidth = ...)

Exercise 18

Add a title, subtitle, x axis title, and a y axis title by adding labs() to the pipeline.


... + 
  labs(
    title = ...,
    subtitle = ...,
    x = ...,
    y = ...
  )

However this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.

Reminder: This is what our graph should look like

p1

Covariation

If variation describes the behavior within a variable, covariation describes the behavior between variables. By the end of this section we will have made this graph:

p2 <- mpg |> 
  ggplot(aes(x = hwy, y = fct_reorder(class, hwy, median))) +
  geom_boxplot() + 
  labs(
    title = "Variation of Highway Mileage over Classes", 
    subtitle = "The compact and midsized cars are tied for the highest highway mileage.",
    x = "Highway Mileage",
    y = NULL
  )

p2

Exercise 1

With a new pipeline, pipe diamonds to ggplot().


diamonds |> 
  ...
diamonds |> 
  ggplot()

Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables.

Exercise 2

Within ggplot(), using aes(), set the x argument to the variable price.


diamonds |> 
  ggplot(aes(x = ...))
diamonds |> 
  ggplot(aes(x = price))

Every variable has its own pattern of variation, which can reveal interesting information about how that it varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variable’s values.

Exercise 3

Add geom_freqpoly() to the pipeline.


... + 
  geom_freqpoly()
diamonds |> 
  ggplot(aes(x = price)) +
    geom_freqpoly()

The function, geom_freqpoly(), visualizes the distribution of a single continuous variable by dividing the x-axis into bins and counting the number of observations in each bin. Frequency polygons display the counts with lines. Frequency polygons are more suitable when you want to compare the distribution across the levels of a categorical variable.

Exercise 4

Within geom_freqpoly(), using aes(), set the color argument to the cut variable.


... + 
  geom_freqpoly(aes(color = ...))
diamonds |> 
  ggplot(aes(x = price)) +
    geom_freqpoly(aes(color = cut))

Note that ggplot2 uses an ordered color scale for cut because it’s defined as an ordered factor variable in the data.

Exercise 5

Within geom_freqpoly(), set binwidth to 500 and linewidth to 0.75.


... + 
  geom_freqpoly(aes(color = cut), 
                binwidth = ...,
                ... = 0.75)
diamonds |> 
  ggplot(aes(x = price)) +
    geom_freqpoly(aes(color = cut), 
                  binwidth = 500,
                  linewidth = 0.75)

The default appearance of geom_freqpoly() is not that useful here because the height, determined by the overall count, differs so much across cuts, making it hard to see the differences in the shapes of their distributions. To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the density, which is the count standardized so that the area under each frequency polygon is one.

Exercise 6

Within ggplot(), within aes(), set the y argument to after_stat(density).


diamonds |> 
  ggplot(aes(x = price, y = ...) + 
  geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)
diamonds |> 
  ggplot(aes(x = price, y = after_stat(density))) +
    geom_freqpoly(aes(color = cut), 
                  binwidth = 500,
                  linewidth = 0.75)

Note that we’re mapping the density to y, but since density is not a variable in the diamonds dataset, we need to first calculate it. We use the after_stat() function to do so.

Exercise 7

Add a title, subtitle, x axis title, a y axis title, and a legend title (color) by adding labs() to the pipeline.


... + 
  labs(
    title = ...,
    subtitle = ...,
    x = ...,
    y = ...
  )

There’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that’s because frequency polygons are a little hard to interpret - there’s a lot going on in this plot. A visually simpler plot for exploring this relationship is using side-by-side boxplots.

Exercise 8

With a new pipeline, pipe diamonds to ggplot(), setting x to cut and y to price within aes(). Add geom_boxplot() at the end.


... |> 
  ggplot(...(x = cut, y = price)) +
    ...
diamonds |> 
  ggplot(aes(x = cut, y = price)) +
    geom_boxplot()

We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are typically cheaper! In the exercises, you’ll be challenged to figure out why.

Exercise 9

Add a title, subtitle, x axis title, and a y axis title by adding labs() to the pipeline.


... + 
  labs(
    title = ...,
    subtitle = ...,
    x = ...,
    y = ...
  )

cut is an ordered factor: Fair is worse than Good, which is worse than Very Good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with fct_reorder(). Let's use a new dataset to explore this approach.

Exercise 10

Type in mpg and hit "Run Code".


mpg

This dataset contains a subset of the fuel economy data that the EPA makes available on https://fueleconomy.gov/. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.

Exercise 11

With a new pipeline, pipe mpg to ggplot(), setting x to class and y to hwy within aes(). Add geom_boxplot() at the end.


mpg |> 
  ggplot(aes(... = class, y = ...)) +
    ...
mpg |> 
  ggplot(aes(x = class, y = hwy)) +
    geom_boxplot()

The trend is pretty hard to see, so to make it easier on the eyes, we can reorder class based on the median value of hwy.

Exercise 12

Within ggplot(), within aes(), set the x argument to fct_reorder(class, hwy, median).


mpg |> 
  ggplot(aes(x = ..., y = hwy)) +
    geom_boxplot()
mpg |> 
  ggplot(aes(x = fct_reorder(class, hwy, median), y = hwy)) +
    geom_boxplot()

The function, fct_reorder(), is useful for 1d displays where the factor is mapped to position and the function fct_reorder2() is for 2d displays where the factor is mapped to a non-position aesthetic.

Exercise 13

Since we have long variable names, geom_boxplot() will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings. Within ggplot(), within aes(), swap the aesthetic mappings of x and y.


mpg |> 
  ggplot(aes(y = fct_reorder(class, hwy, median), ... = ...)) +
    geom_boxplot()
  ...
mpg |> 
  ggplot(aes(y = fct_reorder(class, hwy, median), x = hwy)) +
    geom_boxplot()

If you have a small dataset, it’s sometimes useful to use geom_jitter() to avoid overplotting to more easily see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter().

Exercise 14

Add a title, subtitle and x axis title by adding labs() to the pipeline. Set the y-axis label to NULL.


... + 
  labs(
    title = ...,
    subtitle = ...,
    x = ...,
    y = NULL
  )
mpg |> 
  ggplot(aes(x = hwy, y = fct_reorder(class, hwy, median))) +
  geom_boxplot() + 
  labs(
    title = "Variation of Highway Mileage over Classes", 
    subtitle = "The compact and midsized cars are tied for the highest highway mileage.",
    x = "Highway Mileage",
    y = NULL
  )

Reminder this is what our plot should look like:

p2

Covariation with two categorical variables

In this section we will learn how to visualize the covariation between categorical variables.

By the end of this section, we will be creating this plot:

p3 <- diamonds |> 
  count(color, cut) |>  
  ggplot(aes(x = color, y = cut)) +
  geom_tile(aes(fill = n)) +
  labs(
    title = "Covariation between Diamond Cut and the Color of the Diamond",
    subtitle = "There are more ideally cut diamonds than other cut types of diamonds",
    x = "Color of diamond",
    y = "Cut of diamond"
  )

p3

Exercise 1

With a new pipeline, pipe the diamonds dataset to ggplot() with aes(x = cut, y = color).


diamonds |> 
  ggplot(aes(... = cut, y = ...))
diamonds |> 
  ggplot(aes(x = cut, y = color))

To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables.

Exercise 2

One way to visualize two categorical variables is by using the function geom_count(). This function counts the number of observations at each location. Add geom_count() to the pipeline.


... + 
  geom_count()
diamonds |> 
  ggplot(aes(x = cut, y = color)) +
    geom_count()

The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.

Exercise 3

Another approach for exploring the relationship between these variables is computing the counts with dplyr.

With a new pipeline, pipe the diamonds dataset to count(). Within count add the variables color and cut.


diamonds |> 
  count(color, ...)
diamonds |> 
  count(color, cut)

Now, let's visualize this with geom_tile() and the fill aesthetic.

Exercise 4

Add ggplot() to the pipeline, using aes(), with x set to color and y set to cut.


... + 
  ggplot(aes(x = color, y = ...))
diamonds |> 
  count(color, cut) |> 
  ggplot(aes(x = color, y = cut))

If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns.

Exercise 5

To the pipeline, add geom_tile(), within which using aes(), setting the fill argument to the n variable.


... + 
  geom_tile()
diamonds |> 
  count(color, cut) |> 
  ggplot(aes(x = color, y = cut)) +
    geom_tile(aes(fill = n))

For larger plots, you might want to try the heatmaply package, which creates interactive plots.

Exercise 6

Add a title, subtitle, x axis title, and a y axis title by adding labs() to the pipeline.


... + 
  labs(title = ...,
       subtitle = ...,
       x = ...,
       y = ...)
diamonds |> 
  count(color, cut) |> 
  ggplot(aes(x = color, y = cut)) +
    geom_tile(aes(fill = n)) +
    labs(
      title = "Covariation between Diamond Cut and the Color of the Diamond",
      subtitle = "There are more ideally cut diamonds than other cut types of diamonds",
      x = "Color of diamond",
      y = "Cut of diamond"
      )

Reminder this is what your plot should look like:

p3

Covariation with two numerical variables

In this section, we will learn how to visualize covariation between two numerical variables.

p4 <- smaller |> 
  ggplot(aes(x = carat, y = price)) + 
  geom_boxplot(aes(group = cut_width(carat, 0.1))) +
  labs(title = "Covariation between Number of Carats and Price of Diamonds",
       subtitle = "As the number of carats of a diamond increases the price also increases",
       x = "Number of Carats",
       y = "Price of the Diamond")

p4

Let's build this plot.

Exercise 1

Type in smaller and hit "Run Code".


smaller
smaller

This tibble only includes diamonds that have less than 3 carats.

Exercise 2

You’ve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with geom_point(). You can see covariation as a pattern in the points. Pipe smaller into ggplot(), using aes(x = carat, y = price)) as the value for mapping. Add geom_point() at the end.


smaller |> 
  ggplot(aes(x = ..., y = price)) +
    ...
smaller |> 
  ggplot(aes(x = carat, y = price)) +
    geom_point()

For example, you can see a positive relationship between the carat size and the price of a diamond: diamonds with more carats have a higher price. The relationship seems exponential.

Exercise 3

Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black, making it hard to judge differences in the density of the data across the 2-dimensional space as well as making it hard to spot the trend.

Take the previous code and add alpha = 1 / 100 to the call to geom_point().


smaller |> 
  ggplot(aes(x = carat, y = price)) +
    geom_point(alpha = ...)
smaller |> 
  ggplot(aes(x = carat, y = price)) +
    geom_point(alpha = 1 / 100)

Using the alpha aesthetic to add transparency is one way to deal with too much data.

Exercise 4

Using transparency can be challenging for very large datasets. Another solution is to "bin" the data, using functions like geom_bin2d() and geom_hex(). Replace the call to geom_point() with geom_bin2d().


smaller |> 
  ggplot(aes(x = carat, y = price)) +
    ...
smaller |> 
  ggplot(aes(x = carat, y = price)) +
    geom_bin2d()

geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins. You will need to install the hexbin package to use geom_hex().

Exercise 5

Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin carat and then for each group, display a boxplot.

To create such a plot, begin by replacing geom_hex() with geom_boxplot() from the previous pipeline.


... +
  geom_boxplot()
smaller |> 
  ggplot(aes(x = carat, y = price)) +
    geom_boxplot()

Not a particularly informative plot! We need to change carat from a continuous numeric variable to a category variable, each value of which would cover a range of carat.

Exercise 6

Within geom_boxplot(), using aes(), set the group argument to cut_width(carat, 0.1).


... +
  geom_boxplot(aes(... = cut_width(..., 0.1)))
smaller |> 
  ggplot(aes(x = carat, y = price)) +
    geom_boxplot(aes(group = cut_width(carat, 0.1)))

cut_width(x, width), as used above, divides x into bins of width width. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summarizes a different number of points.

Exercise 7

Add a title, subtitle, x axis title, and a y axis title by adding labs() to the pipeline.


... + 
  labs(
    title = ...,
    subtitle = ...,
    x = ...,
    y = ...,
  )
smaller |> 
  ggplot(aes(x = carat, y = price)) + 
  geom_boxplot(aes(group = cut_width(carat, 0.1))) +
  labs(title = "Covariation between Number of Carats and Price of Diamonds",
       subtitle = "As the number of carats of a diamond increases the price also increases",
       x = "Number of Carats",
       y = "Price of the Diamond")

This is a reminder that our plot should look like this:

p4

One way to deal with different number of observations in each bin is to make the width of the boxplot proportional to the number of points with varwidth = TRUE.

Patterns and models

In this section, we will discuss how to find and display patterns in your data. We will create this plot:

p5 <- diamonds_aug |> 
  ggplot(aes(x = cut, y = .resid)) + 
  geom_boxplot() + 
  labs(
    title = "Cut Quality and (adjusted) Price",
    subtitle = "A better cut means higher prices, adjusted for the number of carats",
    x = "Quality of the Cut",
    y = "Price (adjusted for carats)"
  )

p5

If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, you can ask yourself the following questions:

Exercise 1

Load in the tidymodels package by using library().


library(...)
library(tidymodels)

Other relevant questions of data patterns include:

The tidymodels package provides tools for answering these questions.

Exercise 2

Create a new variable diamonds3 and set it equal to diamonds.


diamonds3 <- diamonds
diamonds3 <- diamonds

Patterns in your data provide clues about relationships, i.e., they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it.

Exercise 3

Let's manipulate the variables in diamonds3. Retain the code from the previous question but pipe diamonds to mutate(). Within mutate(), create a new variable log_price and set it equal to log(price).


diamonds3 <- diamonds |> 
  mutate(
    log_price = ...
  )
diamonds3 <- diamonds |> 
  mutate(
    log_price = log(price)
  )

If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.

Exercise 4

Within mutate(), create another variable called log_carat and set it equal to log(carat).


diamonds3 <- diamonds |> 
  mutate(
    log_price = log(price),
    log_carat = ...
  )
diamonds3 <- diamonds |> 
  mutate(
    log_price = log(price),
    log_carat = log(carat)
  )

We will be using diamonds3, with these two new variables, to make a model that predicts log_price from log_carat and then computes the residuals (the difference between the predicted value and the actual value).

Exercise 5

Create a new variable called diamonds_fit and set it equal to linear_reg().


diamonds_fit <- ...
diamonds_fit <- linear_reg()

Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related.

Exercise 6

Using your previous code, pipe linear_reg() with the following code fit(log_price ~ log_carat, data = diamonds3).


diamonds_fit <- linear_reg() |> 
  ...
diamonds_fit <- linear_reg() |> 
  fit(log_price ~ log_carat, data = diamonds3)

The functon linear_reg() defines a model that can predict numeric values from predictors using a linear function. This function can fit regression models. There are different ways to fit this model, and the method of estimation is chosen by setting the model engine.

Exercise 7

Create a new variable called diamonds_aug and set it equal to augment(diamonds_fit, new_data = diamonds3).


diamonds_aug <- ...
diamonds_aug <- augment(diamonds_fit, new_data = diamonds3)

If you look at diamonds_aug, you will notice a new variable: .resid. The residuals give us a view of the log(price) of the diamond, once the effect of log(carat) has been removed.

Exercise 8

Using your previous code, pipe augment(diamonds_fit, new_data = diamonds3) to mutate(). Within mutate(), create a new variable .resid and set it equal to exp(.resid).


diamonds_aug <- augment(diamonds_fit, new_data = diamonds3) |>
  ...
diamonds_aug <- augment(diamonds_fit, new_data = diamonds3) |>
  mutate(.resid = exp(.resid))

Note that instead of using the raw values of price and carat, we log transformed them first, and fitted a model to the log-transformed values. Then, we exponentiate the residuals to put them back on the scale of raw prices.

Exercise 9

With a new pipeline, pipe diamonds_aug to ggplot().


diamonds_aug |> 
  ggplot()
diamonds_aug |> 
  ggplot()

Note that this plot is empty, as usual when we call ggplot() without specifying a value for mapping.

Exercise 10

Within ggplot(), using aes(), set the x argument to the carat variable.


diamonds_aug |> 
  ggplot(aes(x = ...))
diamonds_aug |> 
  ggplot(aes(x = carat))

Big picture: We are sure that there is a major relationship between price and carat. Diamonds with more carats are much more expensive. If we want to study the relationship between price and cut, we first need to "adjust" for the number of carats.

Exercise 11

Within ggplot(), using aes(), set the y argument to the .resid variable.


diamonds_aug |> 
  ggplot(aes(x = carat, y = ...))
diamonds_aug |> 
  ggplot(aes(x = carat, y = .resid))

If we had not log-transformed price and carat, this adjustment process would not have worked nearly as well.

Exercise 12

Add geom_point() to the pipeline.


... +
  geom_point()
diamonds_aug |> 
  ggplot(aes(x = carat, y = .resid)) +
    geom_point()

Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.

Exercise 13

In the previous pipeline, replace carat with cut and replace geom_point() with geom_boxplot().


diamonds_aug |> 
  ggplot(aes(x = ..., y = .resid)) +
    ...
diamonds_aug |> 
  ggplot(aes(x = cut, y = .resid)) +
    geom_boxplot()

A boxplot is the best way to see the relationship between the type of diamond cut and the price of the diamond, adjusted for carat.

Exercise 14

Add a title, subtitle, x axis title, and a y axis title by adding labs() to the pipeline.


... +
  labs(
    title = ...,
    subtitle = ...,
    x = ...,
    y = ...
  )
diamonds_aug |> 
  ggplot(aes(x = cut, y = .resid)) + 
  geom_boxplot() + 
  labs(
    title = "Cut Quality and (adjusted) Price",
    subtitle = "A better cut means higher prices, adjusted for the number of carats",
    x = "Quality of the Cut",
    y = "Price (adjusted for carats)"
  )

Reminder this is what our plot should look like.

p5

Summary

This tutorial covered Chapter 10: Exploratory data analysis from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We made use of the ggplot2 and the dplyr packages to learn about variation and patterns. Some key commands included coord_cartesian(), geom_freqpoly(), fct_reorder(), geom_tile(), and geom_count().




Try the r4ds.tutorials package in your browser

Any scripts or data that you put into this service are public.

r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m.