Introduction to R

library(learnr)
library(tutorial.helpers)
library(knitr)
library(tidyverse)

# The df_print: default above is the incantation which shows tibbles normally,
# without the fancy formatting associated with the pillar package.

knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(out.width = '90%')
options(tutorial.exercise.timelimit = 60, 
        tutorial.storage = "local") 


Introduction

This tutorial introduces you to the R language. Our approach is inspired by R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn how to work with data sets using the tidyverse meta-package. You will learn how to direct the result of one function to another using the pipe -- |> --- and how to make a plot using the ggplot() function.

This tutorial assumes that you have already completed the "Getting Started" tutorial in the tutorial.helpers package. If you haven't, do so now. It is quick!

From the main Positron menu, start a new window with File -> New Window. This new window is the location in which you will do all the work for the tutorial. The current window, the one in which you are reading these words, is just used to run this tutorial.

Working with data

Learn how to explore a data set using functions like summary(), glimpse(), and slice_sample().

Exercise 1

Before you start doing data science, you must load the packages you are going to use. Use the function library() to load the tidyverse package. Click "Run Code." The check mark which appears next to "Exercise 1" above indicates that you have submitted your answer. It doesn't verify that you have answered the question correctly.


library(...)
library(tidyverse)

"Library" and "package" mean the same thing in R. We have different words for historical reasons. However, only the library() command will load a package/library, giving us access to the functions and data which it contains.

Exercise 2

In this tutorial, you will sometimes enter code into the exercise blocks, as you did above. But we will also ask you to run code in the Console. (You will do this in the other Positron window, since the Console in this window is currently busy running this tutorial.) Example:

In the Console, run library(tidyverse).

With Console questions, we will usually ask you to Copy/Paste the Command/Response into an answer block, like the one below. We usually shorten those instructions as CP/CR. Do that now.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 15)

Your answer should look like:

> library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
>

Your answer never needs to match ours perfectly. Our goal is just to ensure that you are actually following the instructions.

Exercise 3

Data frames, also referred to as "tibbles," are spreadsheet-type data sets.

In the Console, run diamonds.

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 15)

diamonds

Whenever we show outputs like this after a question, then we are showing our answer to the previous question, even if we do not label it as such.

Exercise 4

In the Console, run summary() on diamonds.

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)

summary(diamonds)

This function provides a quick statistics overview of each variable in the data set. In some cases, as here, the tutorial displays the same object differently from what you were able to copy/paste. And that is OK! Your answer does not need to match our answer.

Exercise 5

In the Console, run slice_sample() on diamonds. This selects a random row from the data set.

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 7)

slice_sample(diamonds)

Your answer will differ from this answer because of the inherent randomness in functions like slice_sample().

Exercise 6

In the Console, hit the Up Arrow to retrieve the previous command. Edit it to add the argument n = 4 to slice_sample(diamonds). This will return 10 random rows from the diamonds data set.

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 8)

slice_sample(diamonds, n = 4)

Editing code directly in the Console quickly becomes annoying. See the positron.tutorials package for tutorials about using Positron to write and organize your code.

Exercise 7

In the Console, run print() on diamonds. This returns the same result as typing diamonds.

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)

print(diamonds)

You can choose how many rows to display by using the n argument in the print() function, and how many columns to display by using the width argument.

Exercise 8

In the Console, run print() on diamonds with the argument n = 3. This returns the first 3 rows of the diamonds data set.

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)

print(diamonds, n = 3)

print(), by default, gives the top of the tibble, so your answer should match our answer. slice_sample(), on the other hand, picks random rows to return. But, in both cases, the result is a tibble.

A central organizing principal of the Tidyverse is that most functions take a tibble as their first and return a tibble. This allows us to "chain" commands together, one after the other.

Exercise 9

In the Console, run ?diamonds.

This will look up the help page for the diamonds tibble from the ggplot2 package, which is one of the core packages in the Tidyverse. The help page will appear on the right-side of your Positron window, in the Secondary Activity Bar, which you might need to activate in order to see.

Copy/paste the Description section of the help page below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)

You can find help about an entire package with help(package = "ggplot2"). It is confusing, but unavoidable, that package names are sometimes unquoted, as in library(ggplot2), and sometimes quoted, as in help(package = "ggplot2"). If one does not work, try the other.

Exercise 10

In the Console, run glimpse() on diamonds. CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)

glimpse(diamonds, width = 60)

glimpse() displays columns running down the page and the data running across across. Note how the "type" of each variable is listed next to the variable name. For example, price is listed as <int>, meaning that it is an integer variable. To learn more about the glimpse() function, run ?glimpse.

view() is another useful function, but, because it is interactive, we should not use it within a tutorial.

Exercise 11

In the Console, run sqrt(144).

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)

sqrt(144)

The square root function is one of many built-in functions in R. Most return their result, which R then, by default, prints out.

Exercise 12

In the Console, run x <- sqrt(144).

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)

x <- sqrt(144)

The symbol <- is the assignment operator. In this case, we are assigning the value of sqrt(144) to the variable x. Nothing is printed out because of that assignment.

Also, you can see x in the "Variables" tab under the "Session" pane in the Secondary Activity Bar on the right-hand side of the Positron window.

Exercise 13

In the Console, run x.

CP/CR.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 5)

x

Now that x has been defined in the Console, it is available for your use. Above, we just print it out. But we could also use it in other calculations, i.e., x + 5.

Pipes and plots

Although the Tidyverse includes hundreds of commands for data manipulation, the most important are filter(), select(), arrange(), mutate(), and summarize().

Exercise 1

Let's warm up by examining the gss_cat tibble from the forcats package. Since forcats is a core tidyverse package, you have already loaded it. Type gss_cat and hit "Run Code."

Instead of using the Console, we will be doing the exercises in this section using excercise blocks.


...
gss_cat

As the help page notes, gss_cat is a "sample of categorical variables from the General Social Survey."

Exercise 2

Run summary() on gss_cat.


summary(...)
summary(gss_cat)

Note that there are missing values in some columns. The word NA stands for "Not Available" and is used to represent missing data in R.

Exercise 3

Pipe gss_cat to drop_na(). This function removes rows with missing values. The pipe symbol --- -> --- allows us to chain R commands together, one after the other, with each one connected to the next with the pipe symbol. In this case, we want:

gss_cat |> 
  drop_na()

... |> 
  drop_na()
gss_cat |> 
  drop_na()

Note the number of rows in the tibble after drop_na(). Since drop_na() removes rows with missing values, the number of rows in the tibble will be less than the original number of rows.

We could achieve the same result by running drop_na(gss_cat). The symbol |> just "pipes" gss_cat into drop_na() as its first argument.

Exercise 4

Pipe gss_cat to filter(). Within filter(), use the argument year == 2014.


gss_cat |> 
  ...(year == 2014)
gss_cat |> 
  filter(year == 2014)

This workflow --- in which we pipe a tibble to a function, which then outputs another tibble, which we can then pipe to another function, and so on --- is very common in R programming.

The resulting tibble has the same number of columns as gss_cat because filter() only affects the rows. But there are many fewer rows.

Exercise 5

Continue the code and pipe with select(), using the argument age, marital, race, relig, tvhours. Note that you do not need to retype the code from the last exercise. You can just click the "Copy Code" button.


... |> 
  select(age, ..., race, ..., tvhours)
gss_cat |> 
  filter(year == 2014) |> 
  select(age, marital, race, relig, tvhours)

Note how the Hint only gives the most recent line of the pipe. Because select() does not affect the rows, we have the same number as after filter(). But we only have 5 columns now, consistent with what we told select() to do.

Exercise 6

Copy previous code. Continue the pipe with summary()


... |> 
  summary()
gss_cat |> 
  filter(year == 2014) |> 
  select(age, marital, race, relig, tvhours) |>
  summary()

Note that there are missing values in the tvhours column. Let's remove them.

Exercise 7

Copy previous code. Replace the summary() with drop_na().


... |> 
  drop_na()
gss_cat |> 
  filter(year == 2014) |> 
  select(age, marital, race, relig, tvhours) |>
  drop_na()

The number of rows has decreased because we removed rows with missing values. drop_na() removes all rows which have a missing value for any of the variables. If we wanted to just remove the rows which are missing tvhours, we would use drop_na(tvhours).

Exercise 8

Continue the pipe with arrange(), using tvhours as the argument.


... |> 
  arrange(...)
gss_cat |> 
  filter(year == 2014) |> 
  select(age, marital, race, relig, tvhours) |>
  drop_na() |>
  arrange(tvhours)

The arrange() function sorts the rows of a tibble. By default, it sorts in ascending order.

Exercise 9

Copy the previous code. Put desc() around tvhours to sort in descending order.


... |> 
  arrange(desc(...))
gss_cat |> 
  filter(year == 2014) |> 
  select(age, marital, race, relig, tvhours) |>
  drop_na() |>
  arrange(desc(tvhours))

Got to respect someone who watches TV 24 hours a day!

Exercise 10

Let's make a plot. Copy the previous code, and pipe to ggplot(). Set aes(x = age, y = tvhours).


... |> 
  ggplot(aes(x = ..., y = ...))
gss_cat |> 
  filter(year == 2014) |> 
  select(age, marital, race, relig, tvhours) |>
  drop_na() |>
  arrange(desc(tvhours)) |>
  ggplot(aes(x = age, y = tvhours))

This will return a plain graph as we have not mapped any data to specific "aesthetics" yet.

Exercise 11

Add another layer with geom_jitter() using the + sign. Plotting code in the ggplot2 package uses +, not |>, to connect different commands together. This difference comes from the fact that ggplot2 was written 10+ years before the pipe was invented.


... + 
  geom_jitter()
gss_cat |> 
  filter(year == 2014) |> 
  select(age, marital, race, relig, tvhours) |>
  drop_na() |>
  arrange(desc(tvhours)) |>
  ggplot(aes(x = age, y = tvhours)) + 
    geom_jitter()

This is a scatterplot of age versus tvhours. The x-axis is age, and the y-axis is the number of hours of TV watched per day.

Exercise 12

Finally, add a title, subtitle, labels for x and y axes using labs(). The subtitle should be the one sentence of information about the graph with which you would hope a reader walks away. What is the most important fact demonstrated in the graphic?

Consider this example graph:

gss_cat |> 
  filter(year == 2014) |> 
  select(age, marital, race, relig, tvhours) |>
  drop_na() |>
  arrange(desc(tvhours)) |>
  ggplot(aes(x = age, y = tvhours)) + 
  geom_jitter() + 
  labs(title = "TV Hours Watched by Age", 
       subtitle = "Got to respect someone who watches TV 24 hours a day!", 
       x = "Age", 
       y = "TV Hours")

You can make yours look like ours, if you like.


... + 
  labs(title = "...", 
       subtitle = "...", 
       x = "...", 
       y = "...")
gss_cat |> 
  filter(year == 2014) |> 
  select(age, marital, race, relig, tvhours) |>
  drop_na() |>
  arrange(desc(tvhours)) |>
  ggplot(aes(x = age, y = tvhours)) + 
  geom_jitter() + 
  labs(title = "TV Hours Watched by Age", 
       subtitle = "Got to respect someone who watches TV 24 hours a day!", 
       x = "Age", 
       y = "TV Hours")

Note that the code in the exercise block is not saved. If you want to save the code, you can copy/paste it into an R script file.

Summary

This tutorial introduced you to the R language. Our approach was inspired by R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You learned how to work with data sets using the tidyverse meta-package. You learned how to direct the result of one function to another using the pipe -- |> --- and how to make a plot using the ggplot() function.




Try the tutorial.helpers package in your browser

Any scripts or data that you put into this service are public.

tutorial.helpers documentation built on Sept. 11, 2025, 9:09 a.m.