library(learnr) library(tutorial.helpers) library(knitr) library(tidyverse) library(tidymodels) library(hexbin) knitr::opts_chunk$set(echo = FALSE) knitr::opts_chunk$set(out.width = '90%') options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") smaller <- diamonds |> filter(carat < 3) p1 <- nycflights13::flights |> mutate( cancelled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = sched_dep_time %% 100, sched_dep_time = sched_hour + (sched_min / 60) ) |> ggplot(aes(x = sched_dep_time)) + geom_freqpoly(aes(color = cancelled), binwidth = 1/4) + labs(title = "Frequency of Cancelled and Non-cancelled Flights for Scheduled Flights", subtitle = "There are a lot of non-cancelled flights at 3 PM", x = "Scheduled Departure Time", y = "Frequency", color = "Cancelled or Non-Cancelled") p2 <- mpg |> ggplot(aes(x = hwy, y = fct_reorder(class, hwy, median))) + geom_boxplot() + labs( title = "Variation of Highway Mileage over Classes", subtitle = "The compact and midsized cars are tied for the highest highway mileage.", x = "Highway Mileage", y = "Car Classes" ) p3 <- diamonds |> count(color, cut) |> ggplot(aes(x = color, y = cut)) + geom_tile(aes(fill = n)) + labs( title = "Covariation between Diamond Cut and the Color of the Diamond", subtitle = "There are more ideally cut diamonds than other cut types of diamonds", x = "Color of diamond", y = "Cut of diamond" ) p4 <- smaller |> ggplot( aes(x = carat, y = price)) + geom_boxplot(aes(group = cut_width(carat, 0.1))) + labs(title = "Covariation between Number of Carats and Price of Diamonds", subtitle = "As the number of carats of a diamond increases the price also increases", x = "Number of Carats", y = "Price of the Diamond") diamonds3 <- diamonds |> mutate( log_price = log(price), log_carat = log(carat) ) diamonds_fit <- linear_reg() |> fit(log_price ~ log_carat, data = diamonds3) diamonds_aug <- augment(diamonds_fit, new_data = diamonds3) |> mutate(.resid = exp(.resid)) p5 <- diamonds_aug |> ggplot(aes(x = cut, y = .resid)) + geom_boxplot() + labs( title = "Prediction Model between Carat and Residual Price", subtitle = "As the number of carat increases the Residual Price decreases", x = "Number of Carats", y = "Residual Price" )
This tutorial covers Chapter 10: Exploratory data analysis from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We will be making use of the ggplot2 and the dplyr packages to learn about variation and patterns. Some key commands include coord_cartesian()
, geom_freqpoly()
, fct_reorder()
, geom_tile()
, and geom_count()
.
Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
Load in the tidyverse package using library()
.
library(...)
library(tidyverse)
In this tutorial, we’ll combine what you’ve learned about the dplyr package and the ggplot2 package to interactively ask questions, answer them with data, and then ask new questions.
Type in diamonds
and hit "Run Code"
diamonds
diamonds
The diamonds
dataset contains the prices and other attributes of r nrow(diamonds)
diamonds, such as the price
, carat
, cut
, color
, and so on.
With a new pipeline, pipe diamonds
to ggplot()
.
diamonds |> ...
diamonds |> ggplot()
Note that this plot is blank. ggplot()
has simply created an empty plot object which, when printed, shows as a blank recatangle.
You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. Each of your measurements will include a small amount of error that varies from measurement to measurement.
Within ggplot()
, using aes()
, set the x
argument to carat
.
diamonds |> ggplot(aes(x = ...))
diamonds |> ggplot(aes(x = carat))
Variables can also vary if you measure across different subjects (e.g., the eye colors of different people) or at different times (e.g., the energy levels of an electron at different moments).
Add geom_histogram()
to the pipeline.
... + geom_histogram()
diamonds |> ggplot(aes(x = carat)) + geom_histogram()
Every variable has its own pattern of variation, which can reveal interesting information about how that it varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variable’s values.
Within geom_histogram()
, add the binwidth
argument and set it equal to 0.5
.
... + geom_histogram(binwidth = ...)
diamonds |> ggplot(aes(x = carat)) + geom_histogram(binwidth = 0.5)
We’ll start our exploration by visualizing the distribution of weights (carat) of ~54,000 diamonds from the diamonds dataset. Since carat is a numerical variable, we can use a histogram.
Create a new variable called smaller
and set it equal to diamonds
.
smaller <- ...
smaller <- diamonds
Some valuable questions to ask yourself when dealing with variation are the following:
We are still creating the smaller
tibble. Copy your previous code and create a pipeline with diamonds
. Pipe diamonds
with filter()
and filter out rows for which carat
is less than 3
.
smaller <- diamonds |> filter(carat < ...)
smaller <- diamonds |> filter(carat < 3)
“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox
With a new pipeline, pipe smaller
to ggplot()
.
smaller |> ...
smaller |> ggplot()
EDA, or exploratory data analysis, is an iterative cycle. You:
Generate questions about your data.
Search for answers by visualizing, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.
Within ggplot()
, using aes()
, set the x
argument to carat
.
smaller |> ggplot(aes(x = ...))
smaller |> ggplot(aes(x = carat))
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends.
Add geom_histogram()
to the pipeline.
... + geom_histogram()
smaller |> ggplot(aes(x = carat)) + geom_histogram()
This histogram suggests several interesting questions:
Why are there more diamonds at whole carats and common fractions of carats?
Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
Within geom_histogram()
, set the binwidth
argument to 0.01
.
... + geom_histogram(binwidth = ...)
smaller |> ggplot(aes(x = carat)) + geom_histogram(binwidth = 0.01)
This visualization also revealed clusters, which suggest that subgroups exist in your data. To understand the subgroups, ask:
How are the observations within each subgroup similar to each other?
How are the observations in separate clusters different from each other?
How can you explain or describe the clusters?
Why might the appearance of clusters be misleading?
Let's observe and take note of unusual values in the dataset. With a new pipeline, pipe diamonds
to ggplot()
.
diamonds |> ...
diamonds |> ggplot()
Outliers are observations that are unusual; data points that don’t seem to fit the pattern.
Within ggplot()
, using aes()
, set the x
argument to y
.
diamonds |> ggplot(aes(x = ...))
diamonds |> ggplot(aes(x = y))
Be careful! The x
in this code is an argument, the usual usage with aes()
. The y
, on the other hand, is not an argument! It is not the usual y
for which we provide a value when using aes()
. In this case, y
is a variable in the diamonds
tibble.
Always be careful when variable names happen to be the same as argument names!
Sometimes outliers are data entry errors, sometimes they are simply values at the extremes that happened to be observed in this data collection, and other times they suggest important new discoveries.
Add geom_histogram()
to the pipeline
... + geom_histogram()
diamonds |> ggplot(aes(x = y)) + geom_histogram()
When you have a lot of data, outliers are sometimes difficult to see in a histogram. We know that they are there because the x-axis extends all the way to 60.
Within geom_histogram()
, add the binwidth
argument and set it equal to 0.5
.
... + geom_histogram(binwidth = ...)
diamonds |> ggplot(aes(x = y)) + geom_histogram(binwidth = 0.5)
There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 you’ll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with coord_cartesian()
.
Add coord_cartesian()
to the pipeline. Within coord_cartesian()
, add the ylim
argument and set it equal to c(0, 50)
.
... + coord_cartesian(ylim = ...)
diamonds |> ggplot(aes(x = y)) + geom_histogram(binwidth = 0.5) + coord_cartesian(ylim = c(0, 50))
The function, coord_cartesian()
, also has an xlim
argument for when you need to zoom into the x-axis. ggplot2 also has xlim()
and ylim()
functions that work slightly differently: they throw away the data outside the limits. This allows us to see that there are three unusual values: 0
, ~30
, and ~60
.
Pipe diamonds
to filter()
and filter y
to be less than 3
.
diamonds |> filter(y < ...)
diamonds |> filter(y < 3)
The y
variable measures one of the three dimensions of these diamonds, in mm
. We know that diamonds can’t have a width of 0 mm, so these values must be incorrect.
Modify the pipe by adding, to the filter()
command, |
and then y
greater than 20
.
diamonds |> filter(y < 3 | y > ...)
diamonds |> filter(y < 3 | y > 20)
EDA is an important part of any data analysis, even if the primary research questions are handed to you on a platter, because you always need to investigate the quality of your data.
Add select()
to the pipeline. Within select()
, select the variables price
, x
, y
, and z
.
... |> select(price, x, ...)
diamonds |> filter(y < 3 | y > 20) |> select(price, x, y, z)
Add arrange()
to the pipeline. Within arrange()
, add the variable y
... |> arrange(...)
diamonds |> filter(y < 3 | y > 20) |> select(price, x, y, z) |> arrange(y)
This arranges the data by the width of the diamonds. By doing EDA, we have discovered missing data that was coded as 0
, which we never would have found by simply searching for NAs
. Going forward we might choose to re-code these values as NAs
in order to prevent misleading calculations. We might also suspect that measurements of 32mm
and 59mm
are implausible: those diamonds are over an inch long, but don’t cost hundreds of thousands of dollars!
In this section we will learn about the ways to deal with unusual values in a dataset. One way is to drop the entire row with the strange values. By the end of this section we will be creating the plot below:
p1 <- nycflights13::flights |> mutate( cancelled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = sched_dep_time %% 100, sched_dep_time = sched_hour + (sched_min / 60) ) |> ggplot(aes(x = sched_dep_time)) + geom_freqpoly(aes(color = cancelled), binwidth = 1/4) + labs(title = "Frequency of Cancelled and Non-cancelled Flights for Scheduled Flights", subtitle = "There are a lot of non-cancelled flights at 3 PM", x = "Scheduled Departure Time", y = "Frequency", color = "Cancelled or Non-Cancelled") p1
With a new pipeline, pipe diamonds
to filter()
. Within filter()
, type between(y, 3, 20)
.
This just keeps the rows with the normal values and drops all of the rows with the strange values. We don’t recommend this option because one invalid value doesn’t imply that all the other values for that observation are also invalid. Additionally, if you have low quality data, by the time that you’ve applied this approach to every variable you might find that you don’t have any data left!
Create a new variable called diamonds2
and set it equal to diamonds
.
diamonds2 <- ...
The function if_else()
is a vectorized if-else. Compared to the base R equivalent, ifelse()
, this function allows you to handle missing values in the condition
with missing
and always takes true
, false
, and missing
into account when determining what the output type should be.
Copy your previous code and pipe diamonds
to mutate()
. Within mutate()
, create a new variable, y_new
, and set it equal to if_else(y < 3 | y > 20, NA, y)
.
diamonds2 <- diamonds |> mutate(y_new = ...)
Here we replaced the unusual values with missing values. Here, we use the mutate()
function to replace the variable with a modified copy. We used the if_else()
function to replace unusual values with NA
:
Add ggplot()
to the pipeline.
... |> ggplot()
Within ggplot()
, using aes()
, set the x
argument to x
and the y
argument to y_new
.
... + ggplot(aes(x = ..., y = ...))
Add geom_point()
to the pipeline.
... + geom_point()
It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed. To suppress that warning, set na.rm = TRUE
.
Within geom_point()
, add the na.rm
argument and set it equal to TRUE
.
... + geom_point(na.rm = ...)
Add a title
, subtitle
, x
axis title, and a y
axis title using labs()
.
... + labs( title = ..., subtitle = ..., x = ..., y = ... )
Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in nycflights13::flights
, missing values in the dep_time
variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable, using is.na()
to check if dep_time
is missing. Let's make a plot using geom_freqpoly()
.
With a new pipeline, pipe nycflights13::flights
to mutate()
. Within mutate()
, create a new variable called cancelled
and set it equal to is.na(dep_time)
.
nycflights13::flights |> mutate( cancelled = ... )
In the dataset nycflights13::flights1
, missing values in the dep_time
variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable, using is.na()
to check if dep_time
is missing.
Within mutate()
, create the sched_hour
variable and set it equal to sched_dep_time %/% 100
.
nycflights13::flights |> mutate( cancelled = is.na(dep_time), sched_hour = ... )
Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.
Within mutate()
, create the sched_min
variable and set it equal to sched_dep_time %% 100
.
nycflights13::flights |> mutate( cancelled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = ... )
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
Within mutate()
, create the sched_dep_time
variable and set it equal to sched_hour
plus (sched_min / 60)
.
nycflights13::flights |> mutate( cancelled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = sched_dep_time %% 100, sched_dep_time = ... )
EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights can be gleaned from your dataset.
Add ggplot()
to the pipeline.
... + ggplot()
Each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.
Within ggplot()
, using aes()
, set the x
argument to the sched_dep_time
variable.
... + ggplot(aes(x = ...))
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
Add geom_freqpoly()
to the pipeline.
... + geom_freqpoly()
Within geom_freqpoly()
, using aes()
, set the color
argument equal to the cancelled
variable.
... + geom_freqpoly(aes(color = ...))
Within geom_freqpoly()
, add the binwidth
argument and set it equal to 1/4
.
... + geom_freqpoly(aes(color = cancelled), binwidth = ...)
Add a title
, subtitle
, x
axis title, and a y
axis title by adding labs()
to the pipeline.
... + labs( title = ..., subtitle = ..., x = ..., y = ... )
However this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.
Reminder: This is what our graph should look like
p1
If variation describes the behavior within a variable, covariation describes the behavior between variables. By the end of this section we will have made this graph:
p2 <- mpg |> ggplot(aes(x = hwy, y = fct_reorder(class, hwy, median))) + geom_boxplot() + labs( title = "Variation of Highway Mileage over Classes", subtitle = "The compact and midsized cars are tied for the highest highway mileage.", x = "Highway Mileage", y = NULL ) p2
With a new pipeline, pipe diamonds
to ggplot()
.
diamonds |> ...
diamonds |> ggplot()
Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables.
Within ggplot()
, using aes()
, set the x
argument to the variable price
.
diamonds |> ggplot(aes(x = ...))
diamonds |> ggplot(aes(x = price))
Every variable has its own pattern of variation, which can reveal interesting information about how that it varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variable’s values.
Add geom_freqpoly()
to the pipeline.
... + geom_freqpoly()
diamonds |> ggplot(aes(x = price)) + geom_freqpoly()
The function, geom_freqpoly()
, visualizes the distribution of a single continuous variable by dividing the x-axis into bins and counting the number of observations in each bin. Frequency polygons display the counts with lines. Frequency polygons are more suitable when you want to compare the distribution across the levels of a categorical variable.
Within geom_freqpoly()
, using aes()
, set the color
argument to the cut
variable.
... + geom_freqpoly(aes(color = ...))
diamonds |> ggplot(aes(x = price)) + geom_freqpoly(aes(color = cut))
Note that ggplot2 uses an ordered color scale for cut
because it’s defined as an ordered factor variable in the data.
Within geom_freqpoly()
, set binwidth
to 500
and linewidth
to 0.75
.
... + geom_freqpoly(aes(color = cut), binwidth = ..., ... = 0.75)
diamonds |> ggplot(aes(x = price)) + geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)
The default appearance of geom_freqpoly()
is not that useful here because the height, determined by the overall count, differs so much across cut
s, making it hard to see the differences in the shapes of their distributions. To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the density, which is the count standardized so that the area under each frequency polygon is one.
Within ggplot()
, within aes()
, set the y
argument to after_stat(density)
.
diamonds |> ggplot(aes(x = price, y = ...) + geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)
diamonds |> ggplot(aes(x = price, y = after_stat(density))) + geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)
Note that we’re mapping the density
to y
, but since density
is not a variable in the diamonds
dataset, we need to first calculate it. We use the after_stat()
function to do so.
Add a title
, subtitle
, x
axis title, a y
axis title, and a legend title (color
) by adding labs()
to the pipeline.
... + labs( title = ..., subtitle = ..., x = ..., y = ... )
There’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that’s because frequency polygons are a little hard to interpret - there’s a lot going on in this plot. A visually simpler plot for exploring this relationship is using side-by-side boxplots.
With a new pipeline, pipe diamonds
to ggplot()
, setting x
to cut
and y
to price
within aes()
. Add geom_boxplot()
at the end.
... |> ggplot(...(x = cut, y = price)) + ...
diamonds |> ggplot(aes(x = cut, y = price)) + geom_boxplot()
We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are typically cheaper! In the exercises, you’ll be challenged to figure out why.
Add a title
, subtitle
, x
axis title, and a y
axis title by adding labs()
to the pipeline.
... + labs( title = ..., subtitle = ..., x = ..., y = ... )
cut
is an ordered factor: Fair is worse than Good, which is worse than Very Good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with fct_reorder()
. Let's use a new dataset to explore this approach.
Type in mpg
and hit "Run Code".
mpg
This dataset contains a subset of the fuel economy data that the EPA makes available on https://fueleconomy.gov/. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
With a new pipeline, pipe mpg
to ggplot()
, setting x
to class
and y
to hwy
within aes()
. Add geom_boxplot()
at the end.
mpg |> ggplot(aes(... = class, y = ...)) + ...
mpg |> ggplot(aes(x = class, y = hwy)) + geom_boxplot()
The trend is pretty hard to see, so to make it easier on the eyes, we can reorder class
based on the median
value of hwy
.
Within ggplot()
, within aes()
, set the x
argument to fct_reorder(class, hwy, median)
.
mpg |> ggplot(aes(x = ..., y = hwy)) + geom_boxplot()
mpg |> ggplot(aes(x = fct_reorder(class, hwy, median), y = hwy)) + geom_boxplot()
The function, fct_reorder()
, is useful for 1d displays where the factor is mapped to position and the function fct_reorder2()
is for 2d displays where the factor is mapped to a non-position aesthetic.
Since we have long variable names, geom_boxplot()
will work better if you flip it 90°
. You can do that by exchanging the x
and y
aesthetic mappings. Within ggplot()
, within aes()
, swap the aesthetic mappings of x
and y
.
mpg |> ggplot(aes(y = fct_reorder(class, hwy, median), ... = ...)) + geom_boxplot() ...
mpg |> ggplot(aes(y = fct_reorder(class, hwy, median), x = hwy)) + geom_boxplot()
If you have a small dataset, it’s sometimes useful to use geom_jitter()
to avoid overplotting to more easily see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter()
.
Add a title
, subtitle
and x
axis title by adding labs()
to the pipeline. Set the y-axis label to NULL
.
... + labs( title = ..., subtitle = ..., x = ..., y = NULL )
mpg |> ggplot(aes(x = hwy, y = fct_reorder(class, hwy, median))) + geom_boxplot() + labs( title = "Variation of Highway Mileage over Classes", subtitle = "The compact and midsized cars are tied for the highest highway mileage.", x = "Highway Mileage", y = NULL )
Reminder this is what our plot should look like:
p2
In this section we will learn how to visualize the covariation between categorical variables.
By the end of this section, we will be creating this plot:
p3 <- diamonds |> count(color, cut) |> ggplot(aes(x = color, y = cut)) + geom_tile(aes(fill = n)) + labs( title = "Covariation between Diamond Cut and the Color of the Diamond", subtitle = "There are more ideally cut diamonds than other cut types of diamonds", x = "Color of diamond", y = "Cut of diamond" ) p3
With a new pipeline, pipe the diamonds
dataset to ggplot()
with aes(x = cut, y = color)
.
diamonds |> ggplot(aes(... = cut, y = ...))
diamonds |> ggplot(aes(x = cut, y = color))
To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables.
One way to visualize two categorical variables is by using the function geom_count()
. This function counts the number of observations at each location. Add geom_count()
to the pipeline.
... + geom_count()
diamonds |> ggplot(aes(x = cut, y = color)) + geom_count()
The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.
Another approach for exploring the relationship between these variables is computing the counts with dplyr.
With a new pipeline, pipe the diamonds
dataset to count()
. Within count
add the variables color
and cut
.
diamonds |> count(color, ...)
diamonds |> count(color, cut)
Now, let's visualize this with geom_tile()
and the fill aesthetic.
Add ggplot()
to the pipeline, using aes()
, with x
set to color
and y
set to cut
.
... + ggplot(aes(x = color, y = ...))
diamonds |> count(color, cut) |> ggplot(aes(x = color, y = cut))
If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns.
To the pipeline, add geom_tile()
, within which using aes()
, setting the fill
argument to the n
variable.
... + geom_tile()
diamonds |> count(color, cut) |> ggplot(aes(x = color, y = cut)) + geom_tile(aes(fill = n))
For larger plots, you might want to try the heatmaply package, which creates interactive plots.
Add a title
, subtitle
, x
axis title, and a y
axis title by adding labs()
to the pipeline.
... + labs(title = ..., subtitle = ..., x = ..., y = ...)
diamonds |> count(color, cut) |> ggplot(aes(x = color, y = cut)) + geom_tile(aes(fill = n)) + labs( title = "Covariation between Diamond Cut and the Color of the Diamond", subtitle = "There are more ideally cut diamonds than other cut types of diamonds", x = "Color of diamond", y = "Cut of diamond" )
Reminder this is what your plot should look like:
p3
In this section, we will learn how to visualize covariation between two numerical variables.
p4 <- smaller |> ggplot(aes(x = carat, y = price)) + geom_boxplot(aes(group = cut_width(carat, 0.1))) + labs(title = "Covariation between Number of Carats and Price of Diamonds", subtitle = "As the number of carats of a diamond increases the price also increases", x = "Number of Carats", y = "Price of the Diamond") p4
Let's build this plot.
Type in smaller
and hit "Run Code".
smaller
smaller
This tibble only includes diamonds that have less than 3 carats.
You’ve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with geom_point()
. You can see covariation as a pattern in the points. Pipe smaller
into ggplot()
, using aes(x = carat, y = price))
as the value for mapping
. Add geom_point()
at the end.
smaller |> ggplot(aes(x = ..., y = price)) + ...
smaller |> ggplot(aes(x = carat, y = price)) + geom_point()
For example, you can see a positive relationship between the carat size and the price of a diamond: diamonds with more carats have a higher price. The relationship seems exponential.
Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black, making it hard to judge differences in the density of the data across the 2-dimensional space as well as making it hard to spot the trend.
Take the previous code and add alpha = 1 / 100
to the call to geom_point()
.
smaller |> ggplot(aes(x = carat, y = price)) + geom_point(alpha = ...)
smaller |> ggplot(aes(x = carat, y = price)) + geom_point(alpha = 1 / 100)
Using the alpha
aesthetic to add transparency is one way to deal with too much data.
Using transparency can be challenging for very large datasets. Another solution is to "bin" the data, using functions like geom_bin2d()
and geom_hex()
. Replace the call to geom_point()
with geom_bin2d()
.
smaller |> ggplot(aes(x = carat, y = price)) + ...
smaller |> ggplot(aes(x = carat, y = price)) + geom_bin2d()
geom_bin2d()
and geom_hex()
divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d()
creates rectangular bins. geom_hex()
creates hexagonal bins. You will need to install the hexbin package to use geom_hex()
.
Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin carat and then for each group, display a boxplot.
To create such a plot, begin by replacing geom_hex()
with geom_boxplot()
from the previous pipeline.
... + geom_boxplot()
smaller |> ggplot(aes(x = carat, y = price)) + geom_boxplot()
Not a particularly informative plot! We need to change carat
from a continuous numeric variable to a category variable, each value of which would cover a range of carat
.
Within geom_boxplot()
, using aes()
, set the group
argument to cut_width(carat, 0.1)
.
... + geom_boxplot(aes(... = cut_width(..., 0.1)))
smaller |> ggplot(aes(x = carat, y = price)) + geom_boxplot(aes(group = cut_width(carat, 0.1)))
cut_width(x, width)
, as used above, divides x into bins of width
width. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summarizes a different number of points.
Add a title
, subtitle
, x
axis title, and a y
axis title by adding labs()
to the pipeline.
... + labs( title = ..., subtitle = ..., x = ..., y = ..., )
smaller |> ggplot(aes(x = carat, y = price)) + geom_boxplot(aes(group = cut_width(carat, 0.1))) + labs(title = "Covariation between Number of Carats and Price of Diamonds", subtitle = "As the number of carats of a diamond increases the price also increases", x = "Number of Carats", y = "Price of the Diamond")
This is a reminder that our plot should look like this:
p4
One way to deal with different number of observations in each bin is to make the width of the boxplot proportional to the number of points with varwidth = TRUE
.
In this section, we will discuss how to find and display patterns in your data. We will create this plot:
p5 <- diamonds_aug |> ggplot(aes(x = cut, y = .resid)) + geom_boxplot() + labs( title = "Cut Quality and (adjusted) Price", subtitle = "A better cut means higher prices, adjusted for the number of carats", x = "Quality of the Cut", y = "Price (adjusted for carats)" ) p5
If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, you can ask yourself the following questions:
Could this pattern be due to coincidence (i.e. random chance)?
How can you describe the relationship implied by the pattern?
Load in the tidymodels
package by using library()
.
library(...)
library(tidymodels)
Other relevant questions of data patterns include:
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if you look at individual subgroups of the data?
The tidymodels package provides tools for answering these questions.
Create a new variable diamonds3
and set it equal to diamonds
.
diamonds3 <- diamonds
diamonds3 <- diamonds
Patterns in your data provide clues about relationships, i.e., they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it.
Let's manipulate the variables in diamonds3
. Retain the code from the previous question but pipe diamonds
to mutate()
. Within mutate()
, create a new variable log_price
and set it equal to log(price)
.
diamonds3 <- diamonds |> mutate( log_price = ... )
diamonds3 <- diamonds |> mutate( log_price = log(price) )
If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
Within mutate()
, create another variable called log_carat
and set it equal to log(carat)
.
diamonds3 <- diamonds |> mutate( log_price = log(price), log_carat = ... )
diamonds3 <- diamonds |> mutate( log_price = log(price), log_carat = log(carat) )
We will be using diamonds3
, with these two new variables, to make a model that predicts log_price
from log_carat
and then computes the residuals (the difference between the predicted value and the actual value).
Create a new variable called diamonds_fit
and set it equal to linear_reg()
.
diamonds_fit <- ...
diamonds_fit <- linear_reg()
Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related.
Using your previous code, pipe linear_reg()
with the following code fit(log_price ~ log_carat, data = diamonds3)
.
diamonds_fit <- linear_reg() |> ...
diamonds_fit <- linear_reg() |> fit(log_price ~ log_carat, data = diamonds3)
The functon linear_reg()
defines a model that can predict numeric values from predictors using a linear function. This function can fit regression models. There are different ways to fit this model, and the method of estimation is chosen by setting the model engine
.
Create a new variable called diamonds_aug
and set it equal to augment(diamonds_fit, new_data = diamonds3)
.
diamonds_aug <- ...
diamonds_aug <- augment(diamonds_fit, new_data = diamonds3)
If you look at diamonds_aug
, you will notice a new variable: .resid
. The residuals give us a view of the log(price)
of the diamond, once the effect of log(carat)
has been removed.
Using your previous code, pipe augment(diamonds_fit, new_data = diamonds3)
to mutate()
. Within mutate()
, create a new variable .resid
and set it equal to exp(.resid)
.
diamonds_aug <- augment(diamonds_fit, new_data = diamonds3) |> ...
diamonds_aug <- augment(diamonds_fit, new_data = diamonds3) |> mutate(.resid = exp(.resid))
Note that instead of using the raw values of price
and carat
, we log transformed them first, and fitted a model to the log-transformed values. Then, we exponentiate the residuals to put them back on the scale of raw prices.
With a new pipeline, pipe diamonds_aug
to ggplot()
.
diamonds_aug |> ggplot()
diamonds_aug |> ggplot()
Note that this plot is empty, as usual when we call ggplot()
without specifying a value for mapping
.
Within ggplot()
, using aes()
, set the x
argument to the carat
variable.
diamonds_aug |> ggplot(aes(x = ...))
diamonds_aug |> ggplot(aes(x = carat))
Big picture: We are sure that there is a major relationship between price
and carat
. Diamonds with more carats are much more expensive. If we want to study the relationship between price
and cut
, we first need to "adjust" for the number of carats.
Within ggplot()
, using aes()
, set the y
argument to the .resid
variable.
diamonds_aug |> ggplot(aes(x = carat, y = ...))
diamonds_aug |> ggplot(aes(x = carat, y = .resid))
If we had not log-transformed price
and carat
, this adjustment process would not have worked nearly as well.
Add geom_point()
to the pipeline.
... + geom_point()
diamonds_aug |> ggplot(aes(x = carat, y = .resid)) + geom_point()
Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
In the previous pipeline, replace carat
with cut
and replace geom_point()
with geom_boxplot()
.
diamonds_aug |> ggplot(aes(x = ..., y = .resid)) + ...
diamonds_aug |> ggplot(aes(x = cut, y = .resid)) + geom_boxplot()
A boxplot is the best way to see the relationship between the type of diamond cut and the price of the diamond, adjusted for carat
.
Add a title
, subtitle
, x
axis title, and a y
axis title by adding labs()
to the pipeline.
... + labs( title = ..., subtitle = ..., x = ..., y = ... )
diamonds_aug |> ggplot(aes(x = cut, y = .resid)) + geom_boxplot() + labs( title = "Cut Quality and (adjusted) Price", subtitle = "A better cut means higher prices, adjusted for the number of carats", x = "Quality of the Cut", y = "Price (adjusted for carats)" )
Reminder this is what our plot should look like.
p5
This tutorial covered Chapter 10: Exploratory data analysis from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We made use of the ggplot2 and the dplyr packages to learn about variation and patterns. Some key commands included coord_cartesian()
, geom_freqpoly()
, fct_reorder()
, geom_tile()
, and geom_count()
.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.