library(learnr) library(tutorial.helpers) library(tidyverse) knitr::opts_chunk$set(echo = FALSE) knitr::opts_chunk$set(out.width = '90%') options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") scat_p <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(alpha = 0.5, color = "steelblue") + labs(title = "Measurements for Engine Displacement and Highway Fuel Efficiency of Select Car Models", subtitle = "Cars with greater engineer displacement are less fuel efficient", x = "Engine Displacement (L)", y = "Highway Efficiency (mpg)", caption = "EPA (2008)")
This tutorial covers the Introduction from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn how to work with data sets using the R packages dplyr and ggplot2. You will learn how to direct the result of one function to another using the pipe -- |>
--- and how to make a professional plot using the ggplot()
function, and how to store the plot-creation code in an R script.
This tutorial assumes that you have already completed the "Getting Started with Tutorials" in the tutorial.helpers package. If you haven't, do so now. It is quick!
You will learn how to explore new data sets using functions like summary()
, glimpse()
, and slice_sample()
to get an overview of 2 data sets: diamonds
and midwest
.
Data science is a vast field, and there’s no way you can master it all by reading a single book. This book aims to give you a solid foundation on the most important tools and enough knowledge to find the resources to learn more when necessary. The steps of a typical data science project looks something like this:
knitr::include_graphics("images/base.png")
Looking at the graphic above, the first step in a data science project is to import your data into R. This typically means that you take data stored in a file, database, or web application programming interface (API) and load it into a data frame in R. If you can’t get your data into R, you can’t do data science on it!
Before you start doing data science, you must load the packages you are going to use. Use the function library()
to load the tidyverse package.
library(...)
library(tidyverse)
Nothing is returned, which is often the case with R code. But note the check mark which has appeared next to "Exercise 2" above. This indicates that you have only submitted your answer and doesn't verify if you have answered the question correctly.
Data frames, also referred to as "tibbles", are spreadsheet-type data sets. Type diamonds
in the line below.
diamonds
diamonds
After importing your data, it is a good idea to tidy it. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with how it is stored.
Run summary()
on diamonds
. This function provides a quick statistics overview of each variable in the data set.
summary(...)
summary(diamonds)
When your data is tidy, each column is a variable and each row is an observation. Tidy data is important because the consistent structure lets you focus your efforts on answering questions about the data, not fighting to get the data into the right form for different functions.
Visualization is a fundamentally human activity. A good visualization will show you things you did not expect or raise new questions about the data.
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + geom_point(alpha = 0.5, color = "steelblue") + labs(title = "Price of Diamonds by Carat", x = "Carat", y = "Price", caption = "Diamonds (2008)")
A good visualization might also hint that you’re asking the wrong question or that you need to collect different data. Visualizations can surprise you, but they don’t scale particularly well because they require a human to interpret them.
Run slice_sample()
on diamonds
. This selects a random row from the data set.
slice_sample(...)
slice_sample(diamonds)
Once you have tidy data, a common next step is to transform it. Transformation includes narrowing in on observations of interest (like all people in one city or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means).
Copy paste the code above, but add the argument n = 10
to slice_sample()
. This will return 50 random rows from the diamonds
data set.
slice_sample(..., n = 10)
slice_sample(diamonds, n = 10)
Together, tidying and transforming are called wrangling because getting your data in a form that’s natural to work with often feels like a fight!
Run print()
on diamonds
. This returns the same result as typing diamonds
in the code block.
...(diamonds)
print(diamonds)
You can choose how many rows to display by using the n
argument in the print()
function, and how many columns to display by using the width
argument.
Run print()
on diamonds
with the argument n = 3
. This returns the first 3 rows of the diamonds
data set.
print(..., n = 3)
print(diamonds, n = 3)
The diamonds
data set contains 53,940 rows and 10 columns. Each row represents a single diamond, and each column represents a different characteristic of the diamond.
Type ?diamonds
to look up the help page for the diamonds
tibble from the ggplot2 package, which is one of the core packages in the Tidyverse.
?...
Copy/paste the Description from the help page into the box below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
You can find help about an entire package with help(package = "ggplot2")
. It is confusing, but unavoidable, that package names are sometimes unquoted, as in library(ggplot2)
, and sometimes quoted, as in help(package = "ggplot2")
. If one does not work, try the other.
Run glimpse()
on diamonds
.
...(diamonds)
glimpse(diamonds)
glimpse()
displays columns running down the page and the data running across across. Note how the "type" of each variable is listed next to the variable name. For example, price
is listed as <int>
, meaning that it is an integer variable. To learn more about the glimpse()
function, run ?glimpse
.
Type midwest
and hit "Run Code."
midwest
midwest
Models are complementary tools to visualization. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are fundamentally mathematical or computational tools, so they generally scale well.
Run summary()
on midwest
.
summary(...)
summary(midwest)
Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature, a model cannot question its own assumptions. That means a model cannot fundamentally surprise you.
Run slice_sample()
on midwest
.
slice_sample(...)
slice_sample(midwest)
Referring again to our graphic, the last step of data science is communication, an absolutely critical part of any data analysis project. It doesn’t matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others.
Copy paste the code above, but add the argument n = 50
to slice_sample()
.
slice_sample(..., n = 50)
slice_sample(midwest, n = 50)
Surrounding all these tools is programming. Programming is a cross-cutting tool that you use in nearly every part of a data science project. You don’t need to be an expert programmer to be a successful data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks and solve new problems with greater ease.
Run glimpse()
on midwest
.
glimpse(...)
glimpse(midwest)
view()
is another useful function, but, just because it is interactive, we should not use it within a tutorial.
If you are ever stuck while coding, R has help pages. Let's say we want to know what the function sqrt()
does. Open the help page for sqrt()
by typing ?sqrt
below.
?...
Note that "library" and "package" mean the same thing in R. We have different words for historical reasons. However, only the library()
command will load a package/library, giving us access to the functions and data which it contains.
Assign the value of sqrt(144)
to the variable x
. Remember to use the assignment operator <-
.
x <- ...(144)
x <- sqrt(144)
The assignment operator <-
is used to assign values to variables. The left side of the operator is the variable name, and the right side is the value to be assigned. The value can be a number, a string, a logical value, or the result of a function.
Type x
in the exercise code block below. Note that it will return an error.
x
This is because the variable x
is only available in the code block where it was created. If you want to use x
in another code block, you must assign it again.
Code comments are text placed after a #
symbol. Nothing will be run after a #
symbol, which is useful if you want to write human readable comments in your code.
Press "Run Code." Afterwards, add the #
and re-run the code block. You should no longer see a result.
sqrt(144)
The previous description of the tools of data science is organized roughly according to the order in which you use them in an analysis (although, of course, you’ll iterate through them multiple times). In our experience, however, learning data importing and tidying first is suboptimal because, 80% of the time, it’s routine and boring, and the other 20% of the time, it’s weird and frustrating. That’s a bad place to start learning a new subject!
Instead, we’ll start with visualization and transformation of data that’s already been imported and tidied.
Let's create the following scatterplot from the mpg
dataset, which provides measurements of attributes from various car models.
scat_p
Run ?mpg
to look up the help page for the mpg
tibble from the ggplot2 package.
?mpg
This dataset contains a subset of the fuel economy data that the EPA makes available on https://fueleconomy.gov/. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
First, let's glimpse()
the mpg
data set. Looking at the axis titles above, can you determine what the names are for the two variables we will use?
glimpse(...)
glimpse(mpg)
glimpse()
is most effective when you want to see both all the variables in a data set and many observations.
We are going to use displ
and hwy
to create the plot.
Referring to what pops up when you run ?mpg
, describe in your own words what displ
and hwy
stand for?
question_text(NULL, message = "In the `mpg` data set, `displ` stands for engine displacement, measured in litres while `hwy` stands for highway miles per gallon.", answer(NULL, correct = TRUE), allow_retry = FALSE, incorrect = NULL, rows = 6)
Run ggplot()
, setting data
equal to mpg
.
...(data = mpg)
ggplot(data = mpg)
ggplot()
initializes a ggplot object. Your output should be an empty screen.
The first argument to ggplot()
is data
, as above. The second argument is mapping
. Set the mapping
to equal aes()
, which is the "aesthetics" function for plotting.
ggplot(data = mpg, ... = ...())
ggplot(data = mpg, mapping = aes())
This produces the same blank canvas as above. We need to specify some arguments to aes()
in order to generate the plot.
The two most important arguments in aes()
are x
and y
. Set x
equal to displ
. Set y
equal to hwy
.
ggplot(data = mpg, mapping = aes(x = ..., y = ...))
ggplot(data = mpg, mapping = aes(x = displ, y = hwy))
Anything included in aes()
brings some information from the data in our tibble onto the graph. In this case, R knows that displ
(a measure of the size of a car's engine) goes on the x axis and hwy
(miles per gallon for highway driving) goes on the y-axis.
R can also see the range of values in mpg
for both displ
and hwy
, thereby determining the range of values which the axes should cover.
Let's now add the layer geom_point()
. Steps within a series of plotting commands are connected by plus signs (+
).
Remember when you add a layer you use `+`.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_...()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()
geom_*
functions (such as geom_point()
) add additional layers to the base ggplot. This allows us to create a graphic piece-by-piece.
The code above uses the mpg
tibble to create a scatterplot that displays 126 points, however, it visualizes a data set that contains 234 points. Because many points share the same values, this causes individual data points to be hidden behind other points. This is also known as overplotting.
One method to fight overplotting is to make each point semi-transparent. Change the transparency of the points by setting alpha
equal to 0.5
within the call to geom_point()
.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + ...(alpha = 0.5)
ggplot(data = mpg, mapping = aes( x = displ, y = hwy)) + geom_point(alpha = .5)
alpha
only changes the appearance of the graph and does not add new information from the data. Thus, this argument is within the geom
and is not nested within a call to aes()
.
Now, also within geom_point()
, set color
equal to "steelblue"
.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(alpha = ..., color = "...")
ggplot(data = mpg, mapping = aes( x = displ, y = hwy)) + geom_point(alpha = .5, color = "steelblue")
R has 657 (built in color names)[https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf?page=3]. Like alpha
, color
also is not changing the data so the argument is within geom
.
Now, use labs()
to add the title to to the graph using the argument title
. Reminder: This is what our graph should look like.
scat_p
... + labs(title = "...")
ggplot(data = mpg, mapping = aes( x = displ, y = hwy)) + geom_point(alpha = .5, color = "steelblue") + labs(title = "Measurements for Engine Displacement and Highway Fuel Efficiency for Selected Car Models")
This book proudly and primarily focuses on small, in-memory datasets. This is the right place to start because you can’t tackle big data unless you have experience with small data.
Add the subtitle
to the graph using the argument subtitle
.
... + labs(title = "...", subtitle = "...")
ggplot(data = mpg, mapping = aes( x = displ, y = hwy)) + geom_point(alpha = .5, color = "steelblue") + labs(title = "Measurements for Engine Displacement and Highway Fuel Efficiency for Selected Car Models", subtitle = "Cars with greater engine displacement are less fuel efficient")
The subtitle should be the one sentence of information about the graph with which you would hope a reader walks away. What is the most important fact demonstrated in the graphic?
Set x
to "Engine Displacement (L)"
.
... + labs(title = "...", subtitle = "...", x = "...")
ggplot(data = mpg, mapping = aes( x = displ, y = hwy)) + geom_point(alpha = .5, color = "steelblue") + labs(title = "Measurements for Engine Displacement and Highway Fuel Efficiency for Selected Car Models", subtitle = "Cars with greater engine displacement are less fuel efficient", x = "Engine Displacement (L)")
The tools you’ll learn throughout the majority of this book and thue tutorials will easily handle hundreds of megabytes of data, and with a bit of care, you can typically use them to work with a few gigabytes of data.
Set y
to "Highway Efficiency (mpg)"
.
... + labs(title = "...", subtitle = "...", x = "...", y = "...")
ggplot(data = mpg, mapping = aes( x = displ, y = hwy)) + geom_point(alpha = .5, color = "steelblue") + labs(title = "Measurements for Engine Displacement and Highway Fuel Efficiency for Selected Car Models", subtitle = "Cars with greater engine displacement are less fuel efficient", x = "Engine Displacement (L)", y = "Highway Efficiency (mpg)")
We’ll also show you how to get data out of databases and parquet files, both of which are often used to store big data. You won’t necessarily be able to work with the entire dataset, but that’s not a problem because you only need a subset or subsample to answer the question that you’re interested in.
Finally, set the caption
to "EPA (2008)"
. The caption is a place where you credit the source of your data.
... + labs(title = "...", subtitle = "...", x = "...", y = "...", caption = "...")
ggplot(data = mpg, mapping = aes( x = displ, y = hwy)) + geom_point(alpha = .5, color = "steelblue") + labs(title = "Measurements for Engine Displacement and Highway Fuel Efficiency for Selected Car Models", subtitle = "Cars with greater engine displacement are less fuel efficient", x = "Engine Displacement (L)", y = "Highway Efficiency (mpg)", caption = "EPA (2008)")
Large data sets (10-100GB, say), uses a different interface than the tidyverse and requires you to learn some different conventions. If you are interested, we recommend learning more about data.table
R for Data Science (2e) and these associated tutorials cover a lot of material. But we can't cover everything. In particular, the book and these tutorials do not cover modeling, big data, or other programming languages like Python and Julia.
If you are running this tutorial, then you probably already know about R and RStudio. If, for some reason you don't, this Getting Started chapter is the best place to start.
If you plan to do several of the tutorials in this package, you may find it useful to install all the necessary packages. Simply copy/paste this code into the Console.
install.packages( c("arrow", "babynames", "curl", "duckdb", "gapminder", "ggrepel", "ggridges", "ggthemes", "hexbin", "janitor", "Lahman", "leaflet", "maps", "nycflights13", "openxlsx", "palmerpenguins", "repurrrsive", "tidymodels", "writexl") )
Below are two background questions which are sometimes used by instructors in organizing their breakout rooms.
How old are you?
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 1)
What is your highest level of education? (Just completed 10th grade. Sophomore in college. Et cetera.)
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 1)
Although the Tidyverse includes hundreds of commands, the most important are filter()
, select()
, arrange()
, mutate()
, and summarize()
. Whenever you face a new problem, try to think about which one of these commands might be a good way to start.
Let's warm up by examining the gss_cat
tibble from the forcats package. Since forcats is a core tidyverse package, you have already loaded it. Type gss_cat
and hit "Run Code."
...
gss_cat
Whenever we print a tibble, the number of rows and columns is displayed at the top:
A tibble: 21,483 × 9
You can also see the variable type under each of the column names.
Run summary()
on gss_cat
.
summary(...)
summary(gss_cat)
Note that there are missing values in some columns. The word NA
stands for "Not Available" and is used to represent missing data in R.
Pipe gss_cat
to drop_na()
. This function removes rows with missing values.
... |> drop_na()
gss_cat |> drop_na()
Note the number of rows in the tibble after drop_na()
. Since drop_na()
removes rows with missing values, the number of rows in the tibble will be less than the original number of rows.
A tibble: 11,299 × 9
Run ?forcats::gss_cat
in the Console. This should work even if you have not loaded the forcats package. The double colon --- ::
--- notation allows us to access the inside of a package even if we have not loaded it.
Copy/paste the Description.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Throughout these tutorials, we use a consistent set of conventions to refer to code:
Functions are displayed in a code font and followed by parentheses, like sum()
or mean()
.
Other R objects (such as data or function arguments) are in a code font, without parentheses, like flights
or x
.
Sometimes, to make it clear which package an object comes from, we’ll use the package name followed by two colons, like dplyr::mutate()
or nycflights13::flights
. This is also valid R code.
Recall that the +
sign is used to "chain" different pieces of plot creation code together. When doing data analysis, we use the "pipe" symbol --- |>
--- to do the same thing between different pieces code which manipulate the data.
As a simple example, "pipe" the gss_cat
tibble to the print()
command.
... |> print()
gss_cat |> print()
Note the language. We write "pipe this to that." That is, we pipe the gss_cat
tibble to the print()
command. This accomplishes the same effect as simply running print(gss_cat)
, but allows us to string together several commands in a row.
Pipe gss_cat
to filter()
. Within filter()
, use the argument age > 88
.
gss_cat |> ...(age > 88)
gss_cat |> filter(age > 88)
This workflow --- in which we pipe a tibble to a function, which then outputs another tibble, which we can then pipe to another function, and so on --- is very common in R programming.
The resulting tibble has the same number of columns --- filter()
only affects the rows --- as gss_cat
but many fewer rows, because there are only 150 people in the data older than 88.
Continue the code and pipe with select()
, using the argument age, marital, race, relig, tvhours
.
... |> select(age, ..., race, ..., tvhours)
gss_cat |> filter(age > 88) |> select(age, marital, race, relig, tvhours)
Note how the Hint only gives the most recent line of the pipe. Because select()
does not affect the rows, we have the same number as after filter()
. But we only have 5 columns now, consistent with what we told select()
to do.
Copy previous code. Continue the pipe with summary()
... |> summary()
gss_cat |> filter(age > 88) |> select(age, marital, race, relig, tvhours) |> summary()
Note that there are missing values in the tvhours
column. Let's remove them.
Copy previous code. Replace the summary()
with drop_na()
.
... |> drop_na()
gss_cat |> filter(age > 88) |> select(age, marital, race, relig, tvhours) |> drop_na()
Note that the number of rows has decreased as we removed rows with missing values.
Continue the pipe with arrange()
, using tvhours
as the argument.
... |> arrange(...)
gss_cat |> filter(age > 88) |> select(age, marital, race, relig, tvhours) |> drop_na() |> arrange(tvhours)
The arrange()
function sorts the rows of a tibble. By default, it sorts in ascending order.
Copy the previous code. Put desc()
around tvhours
to sort in descending order.
... |> arrange(desc(...))
gss_cat |> filter(age > 88) |> select(age, marital, race, relig, tvhours) |> drop_na() |> arrange(desc(tvhours))
Got to respect someone who watches TV 18 hours a day!
Let's make a plot. Copy the previous code, and pipe to ggplot()
. Set aes(x = tvhours, y = age)
.
... |> ggplot(mapping = aes(x = ..., y = ...))
gss_cat |> filter(age > 88) |> select(age, marital, race, relig, tvhours) |> drop_na() |> arrange(desc(tvhours)) |> ggplot(aes(x = tvhours, y = age))
Note that this will return a plain graph as we have not mapped any data to the graph yet.
Add another layer with geom_point()
using the +
sign.
... + geom_point()
gss_cat |> filter(age > 88) |> select(age, marital, race, relig, tvhours) |> drop_na() |> arrange(desc(tvhours)) |> ggplot(aes(x = tvhours, y = age)) + geom_point()
This is a scatterplot of tvhours
versus age
. The x-axis is the number of hours of TV watched per day, and the y-axis is the age of the person.
Let's rescale the y
axes. Add scale_y_continuous(breaks = c(89), limits = c(89, 89))
to the code.
gss_cat |> filter(age > 88) |> select(age, marital, race, relig, tvhours) |> drop_na() |> arrange(desc(tvhours)) |> ggplot(aes(x = tvhours, y = age)) + geom_point() + scale_y_continuous(breaks = c(89), limits = c(89, 89))
By looking at the graph, we can see that most people watch TV for less than 10 hours a day. However, there is one person who watches TV for 18 hours a day.
Finally, add a title, subtitle, labels for x and y axes using labs()
. Remember this is what your graph should look like.
gss_cat |> filter(age > 88) |> select(age, marital, race, relig, tvhours) |> drop_na() |> arrange(desc(tvhours)) |> ggplot(aes(x = tvhours, y = age)) + geom_point() + scale_y_continuous(breaks = c(89), limits = c(89, 89)) + labs(title = "TV Hours Watched by Age", subtitle = "Got to respect someone who watches TV 18 hours a day!", x = "TV Hours", y = "Age")
... + labs(title = "...", subtitle = "...", x = "...", y = "...")
gss_cat |> filter(age > 88) |> select(age, marital, race, relig, tvhours) |> drop_na() |> arrange(desc(tvhours)) |> ggplot(aes(x = tvhours, y = age)) + geom_point() + scale_y_continuous(breaks = c(89), limits = c(89, 89)) + labs(title = "TV Hours Watched by Age", subtitle = "Got to respect someone who watches TV 18 hours a day!", x = "TV Hours", y = "Age")
Note that the code in the code block is not saved. If you want to save the code, you can copy/paste it into an R script file.
On top of your RStudio window, click on "File" and then "New File." Choose "R Script." Save the script as analysis.R
.
On the Console, run:
list.files(pattern = "analysis")
CP/PR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
Do not worry if this fails. Directory locations are tricky.
Copy/paste the code from the plot into the script. Hit "Run" on the script. This will return the plot in the Plots window.
On the Console, run:
show_file("analysis.R")
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
This tutorial covered the Introduction from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You learned how to work with data sets using the R packages dplyr and ggplot2. You learned how to direct the result of one function to another using the pipe -- |>
--- and how to make a professional plot using the ggplot()
function, and how to store the plot-creation code in an R script.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.