In NUstat/ISDStutorials: Tutorial Lessons for Introduction to Statistics and Data Science

library(learnr)
library(tidyverse)
library(nycflights13)
library(tutorialExtras)
library(gradethis)
library(tutorial.helpers)
library(ggcheck)

gradethis_setup()
knitr::opts_chunk$set(echo = FALSE)
options(
  tutorial.exercise.timelimit = 60
  #tutorial.storage = "local"
  )

grade_server("grade")

question_text("Name:",
              answer_fn(function(value){
                              if(length(value) >= 1 ) {
                                return(mark_as(TRUE))
                                }
                              return(mark_as(FALSE) )
                              }),
              correct = "submitted",
              allow_retry = FALSE )

Instructions

Complete this tutorial while reading Sections 3.0 - 3.3 of the textbook. Each question allows 3 'free' attempts. After the third attempt a 10% deduction occurs per attempt.

You can check your current grade and the number of attempts you are on in the "View grade" section. You can click this button as often and as many times as you would like as you progress through the tutorial. Before submitting, make sure your grade is as expected.

Goals

Understand and execute the basics of data wrangling.
Know how to use the pipe operator.
Use the filter() and summarize() functions.
Understand how to handle missing values.

The pipe operator: %>%

The pipe operator %>% allows you to combine multiple data wrangling verb-named functions into a single sequential chain of actions.

Say you have a data frame x and would like to apply 3 functions f(), then g(), and finally h().

One way to achieve this is

h(g(f(x)))

However, this can get messy and difficult to read. Instead we can use the pipe operator to chain the sequence of events together.

x %>% 
  f() %>% 
  g() %>% 
  h()

You would read this above sequence as:

Take x then
Use this output as the input to the next function f() then
Use this output as the input to the next function g() then
Use this output as the input to the next function h()

`filter()` rows

The filter() allows you to specify criteria about the values of a variable in your dataset and then filters out only those rows that match that criteria.

Exercise 1

We will begin by focusing only on flights from New York City to Portland, Oregon. But before we do that, we need to load the needed packages.

Load the dplyr package followed by the nycflights13 package by using the library() command.

library(...)
library(...)

library(dplyr)
library(nycflights13)

grade_this_code()

Exercise 2

Pipe flights to filter(dest == "PDX").

flights %>% 
  filter(...)

flights %>%
  filter(dest == "PDX")

grade_this_code()

The easiest way to pronounce the pipe is “then”.

The pipe, takes the flights dataset and "then" filters it to only contain observations where the destination is equal to "PDX".

filter() changes which rows are present without changing their order.

Note that only r scales::comma(nrow(nycflights13::flights %>% filter(dest == "PDX"))) rows remain after we filter for such a long departure delay. Why do we only see 1,000 rows here? Because Quarto, by default, only keeps 1,000 rows for display purposes.

Exercise 3

In the last Exercise our filtered dataset was being printed. If we want to use the dataset in the future, it is useful to store the wrangled dataset as a new object.

Use the assignment arrow <- to assign this new dataset the name portland_flights.

In other words type portland_flights <- before flights.

... <- flights %>% 
  filter(dest == "PDX")

portland_flights <- flights %>%
  filter(dest == "PDX")

grade_this_code()

If you run this code in RStudio you will see portland_flights appear in the Environment pane.

Exercise 4

We test for equality using the double equal sign == and not a single equal sign =.

question_wordbank("Match the following definitions with their mathematical operators.",
        choices = c("equal to",
                    "greater than",
                    "and",
                    "less than or equal to",
                    "not equal to",
                    "or"),
        wordbank = c(">", "&", "<=", "!=", ">=", "|", "<", "=", "=="),
        answer(c("==",">", "&", "<=", "!=", "|"), 
        correct = TRUE), 
        allow_retry = TRUE )

`summarize()` variables

The next common task when working with data is to return summary statistics: a single numerical value that summarizes a large number of values, for example the mean/average or the median.

Exercise 1

Let’s calculate the mean and the standard deviation of the temperature variable temp in the weather data frame included in the nycflights13 package.

Pipe weather to summarize(mean(temp)).

weather %>% 
  summarize(mean(temp))

weather %>% 
  summarize(mean(temp))

grade_this_code()

Notice two things.

1) the name of the new variable was called `mean(temp)` 2) the result is NA

Exercise 2

Let's start with fixing the variable name. You always want to name your new variable or the output get's very messy.

Let's assign the calculation to the name mean_temp by instead typing summarize(mean_temp = mean(temp))

weather %>% 
  summarize(... = mean(temp))

weather %>% 
  summarize(mean_temp = mean(temp))

grade_this_code()

Much better! Now let's fix our NA issue.

Exercise 3

NA is how R encodes missing values where NA indicates “not available” or “not applicable.”

We can work around this by "removing" the NA values by setting na.rm = TRUE within the mean() function as follows: mean(temp, na.rm = TRUE).

weather %>% 
  summarize(mean_temp = mean(temp, ...))

weather %>% 
  summarize(mean_temp = mean(temp, na.rm = TRUE))

grade_this_code()

The average temperature is 55.3 degrees.

Exercise 4

We also wanted to calculate the standard deviation.

Within the same summarize function, add a comma (,) after mean(temp, na.rm = TRUE) and then add std_dev = sd(temp, na.rm = TRUE).

weather %>% 
  summarize(mean_temp = mean(temp, na.rm = TRUE),
            ... = ...)

weather %>% 
  summarize(mean_temp = mean(temp, na.rm = TRUE),
            std_dev = sd(temp, na.rm = TRUE))

grade_this_code()

The name of the new variable you calculated is std_dev and the function that calculated the standard deviation was sd().

It is good practice to always run your code after each step or function to ensure there are no errors and to check the output.

Exercise 5

Notice in the previous exercise the summary statistics were being directly printed out.

This is because we did not store the results. Copy the previous code and type summary_temp <- before weather.

... <- weather %>% 
  summarize(mean_temp = mean(temp, na.rm = TRUE),
            std_dev = sd(temp, na.rm = TRUE))

summary_temp <- weather %>% 
  summarize(mean_temp = mean(temp, na.rm = TRUE),
            std_dev = sd(temp, na.rm = TRUE))

grade_this_code()

Now there is no output because the results have been stored in a new data frame called summary_temp. If you run this code in RStudio this new object would appear in the Environment pane.

Exercise 6

Now we can print/use this new data frame by typing summary_temp.

Copy the previous code and on the next line type summary_temp.

summary_temp <- weather %>% 
  summarize(mean_temp = mean(temp, na.rm = TRUE),
            std_dev = sd(temp, na.rm = TRUE))
...

summary_temp <- weather %>% 
  summarize(mean_temp = mean(temp, na.rm = TRUE),
            std_dev = sd(temp, na.rm = TRUE))
summary_temp

grade_this_code()

Exercise 7

There are a variety of different functions you could also use within summarize().

question_wordbank("Match the following definitions with their functions.",
        choices = c("interquartile range",
                    "mean/average",
                    "minimum",
                    "number of observations"),
        wordbank = c("max()", "IQR()", "mean()", "min()", "n()", "sd()", "iqr()", "avg()", "count()"),
        answer(c("IQR()", "mean()", "min()", "n()"), 
        correct = TRUE), 
        allow_retry = TRUE )

View grade

grade_button_ui(id = "grade")

Submit

Once you are finished:

Click the 'Download Grade' button below. This will download an html document of your grade summary.
Make sure your grade is correct and as expected!
Submit the downloaded html to Canvas.

grade_print_ui("grade")

NUstat/ISDStutorials documentation built on April 17, 2025, 6:15 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

NUstat/ISDStutorials
Tutorial Lessons for Introduction to Statistics and Data Science

In NUstat/ISDStutorials: Tutorial Lessons for Introduction to Statistics and Data Science

Instructions

Goals

The pipe operator: %>%

`filter()` rows

Exercise 1

Exercise 2

Exercise 3

Exercise 4

`summarize()` variables

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

View grade

Submit

R Package Documentation

Browse R Packages

We want your feedback!

NUstat/ISDStutorials Tutorial Lessons for Introduction to Statistics and Data Science

In NUstat/ISDStutorials: Tutorial Lessons for Introduction to Statistics and Data Science

Instructions

Goals

The pipe operator: %>%

filter() rows

Exercise 1

Exercise 2

Exercise 3

Exercise 4

summarize() variables

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

View grade

Submit

R Package Documentation

Browse R Packages

We want your feedback!

NUstat/ISDStutorials
Tutorial Lessons for Introduction to Statistics and Data Science

`filter()` rows

`summarize()` variables