Summarise Tables
In learnr: Interactive Tutorials for R

library(learnr)
library(tidyverse)
library(nycflights13)
library(Lahman)

tutorial_options(
  exercise.timelimit = 60,
  # A simple checker function that just returns the message in the check chunk
  exercise.checker = function(check_code, ...) {
    list(
      message = eval(parse(text = check_code)),
      correct = logical(0),
      type = "info",
      location = "append"
    )
  }
)
knitr::opts_chunk$set(error = TRUE)

Welcome

In this tutorial, you will learn how to summarise a table of data, including:

How to summarise tables with summarise()
How to recognize the most useful families of functions to combine with summarise()
How to combine multiple dplyr operations with the pipe, %>%
How to compute counts of observations with n()
How to summarise groups of observations with group_by() and summarise()

The readings in this tutorial follow R for Data Science, section 5.6.

Setup

To practice these skills, we will use the flights data set from the nycflights13 package, which you met in Data Basics. This data frame comes from the US Bureau of Transportation Statistics and contains all r format(nrow(nycflights13::flights), big.mark = ",") flights that departed from New York City in 2013. It is documented in ?flights.

To visualize the data, we will use the ggplot2 package that you met in Data Visualization Basics.

I've preloaded the packages for this tutorial with

library(tidyverse) # loads dplyr, ggplot2, and others
library(nycflights13)

Summarise groups with summarise()

summarise()

summarise() collapses a data frame to a single row of summaries. You get to choose how many summaries appear in the row and how they are computed:

summarise(flights, delay = mean(dep_delay, na.rm = TRUE),
                  total =  sum(dep_delay, na.rm = TRUE))

(We'll come back to what that na.rm = TRUE means very shortly.)

Notice that the syntax of summarise() is similar to mutate(). As with mutate(), you give summarise:

The name of a data frame to transform
One or more column names to appear in the transformed output. Each column name is set equal to the R expression that will generate the content of the column.

The main difference between summarise() and mutate() is the type of function that you use to generate the new columns. mutate() takes functions that return an entire vector of output (to append to the original data frame). summarise() takes functions that return a single value (or summary). These values will appear in a new data frame that has only one row.

group_by()

summarise() is not terribly useful unless you pair it with group_by(). group_by() changes the unit of analysis of the data frame: it assigns observations in the data frame to separate groups, and it instructs dplyr to apply functions separately to each group. group_by() assigns groups by grouping together observations that have the same combinations of values for the variables that you pass to group_by().

For example, the summarise() code above computes the average delay for the entire data set. If we apply exactly the same code to a data set that has been grouped by date (i.e. the unique combinations of year, month, and day), we get the average delay per date. Click "Run Code" to see what I mean:

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE),
                  total = sum(dep_delay, na.rm = TRUE))

"Good job!"

Exercise 1

Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

flights %>%
  group_by(carrier) %>%
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  mutate(rank = min_rank(desc(avg_delay))) %>%
  filter(rank == 1)

**Hint:** Use`min_rank(desc(avg_delay))` to rank `avg_delay` (for example) such that the largest delay receives rank one.

"Great work! Frontier airlines (`F9`) was the highest average departure delay."

Exercise 2

For each plane, count the number of flights before the first delay of greater than 1 hour.

flights %>%
  filter(!is.na(dep_delay)) %>%
  group_by(tailnum) %>%
  mutate(big_delay = dep_delay > 60,
         before = !cumany(big_delay)) %>%
  summarise(sum(before))

**Hint:** One strategy would be to: * filter out all rows where `dep_delay` is `NA`. * Then group by plane, * create a variable that tests whether each flight was delayed longer than an hour * create a variable that identifies flights that occur before the first big delay with `!cumany()` * sum up the number of trues

"Great work! That was tough. Be sure you understand each of the steps and functions involved."

Grouping by multiple variables

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset. Run the code below and inspect each result to see how its grouping criteria has changed (the grouping criteria is displayed at the top of the tibble).

daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, total = sum(dep_delay, na.rm = TRUE)))
(per_month <- summarise(per_day, total = sum(total, na.rm = TRUE)))
(per_year  <- summarise(per_month, total = sum(total, na.rm = TRUE)))

Be careful when you progressively roll up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median. In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.

Ungrouping

If you need to remove grouping, and return to operations on ungrouped data, use ungroup().

daily <- group_by(flights, year, month, day)

daily <- ungroup(daily) # no longer grouped by date
summarise(daily, total = sum(dep_delay, na.rm = TRUE))  # all flights

Groups and dplyr

group_by() also works with the other dplyr functions; dplyr will apply filter(), select(), arrange(), and mutate() in a groupwise fashion to grouped data. However, group_by() is the most useful when combined with summarise(). Together group_by() and summarise() provide one of the tools that you'll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe.

Combining multiple operations

Multiple steps

Imagine that we want to explore the relationship between the distance and average delay for each destination in flights. Using what you know about dplyr, you might write code like this:

by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)

The code works, and we find an interesting effect: It looks like delays increase with distance up to ~750 miles and then decrease. Maybe as flights get longer there's more ability to make up delays in the air?

Now let's look at how we prepared the data. There are three steps:

Group flights by destination.
Summarise to compute distance, average delay, and number of flights.
Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.

This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things is hard, so this slows down our analysis.

Pipes

There's another way to tackle the same problem. We can turn the code into a pipe with the pipe operator, %>%:

delays <- flights %>%
  group_by(dest) %>%
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count > 20, dest != "HNL")

Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.

This focuses on the transformations, not what's being transformed, which makes the code easier to read. You can read it as a series of imperative statements: group, then summarise, then filter. As suggested by this reading, a good way to pronounce %>% when reading code is "then".

We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Pipes.

The pipe is a defining feature of the tidyverse: packages in the tidyverse all contain functions that are designed to work with the pipe. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet.

Useful summary functions {#summarise-funs}

Aggregating functions

You can get a long way with means and sum; but R provides many other useful functions to use with summary. Each of these functions acts as an aggregating function: it takes a vector of values and returns a single value.

Let's demonstrate some of the most useful aggregating functions with this data set, which removes flights that have no delay information (because they were cancelled).

not_cancelled <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))

Measures of location: we've used mean(x), but median(x) is also useful. The mean is the sum divided by the length; the median is a value where 50% of x is above it, and 50% is below it.

It's sometimes useful to combine aggregation with logical subsetting. We haven't talked about this sort of subsetting yet, but you'll learn more about it in Subsetting.

r not_cancelled %>% group_by(year, month, day) %>% summarise( avg_delay1 = mean(arr_delay), avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay )
Measures of spread: sd(x), IQR(x), mad(x). The mean squared deviation, or standard deviation or sd for short, is the standard measure of spread. The interquartile range IQR() and median absolute deviation mad(x)are robust equivalents that may be more useful if you have outliers.

```r

Why is distance to some destinations more variable than to others?

not_cancelled %>% group_by(dest) %>% summarise(distance_sd = sd(distance)) %>% arrange(desc(distance_sd)) ```
Measures of rank: min(x), quantile(x, 0.25), max(x). Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.

```r

When do the first and last flights leave each day?

not_cancelled %>% group_by(year, month, day) %>% summarise( first = min(dep_time), last = max(dep_time) ) ```
Measures of position: first(x), nth(x, 2), last(x). These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:

r not_cancelled %>% group_by(year, month, day) %>% summarise( first_dep = first(dep_time), last_dep = last(dep_time) )

These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:

r not_cancelled %>% group_by(year, month, day) %>% mutate(r = min_rank(desc(dep_time))) %>% filter(r %in% range(r))
Counts: In the next section, you will meet n(), which takes no arguments, and returns the size of the current group. You can count other useful quantities as well. To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).

```r

Which destinations have the most carriers?

not_cancelled %>% group_by(dest) %>% summarise(carriers = n_distinct(carrier)) %>% arrange(desc(carriers)) ```
Counts and proportions of logical values: sum(x > 10), mean(y == 0). When used with numeric functions, TRUE is converted to 1 and FALSE to 0. This makes sum() and mean() very useful: sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.

```r

How many flights left before 5am? (these usually indicate delayed

flights from the previous day)

not_cancelled %>% group_by(year, month, day) %>% summarise(n_early = sum(dep_time < 500))

What proportion of flights are delayed by more than an hour?

not_cancelled %>% group_by(year, month, day) %>% summarise(hour_perc = mean(arr_delay > 60)) ```

Exercise 3

Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
A flight is always 10 minutes late.
A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
99% of the time a flight is on time. 1% of the time it's 2 hours late.

Which is more important: arrival delay or departure delay?

**Hint:** Consider R's measures of location and measures of spread.

Missing values

You may have wondered about the na.rm argument we used in a previous section. What happens if we don't set it?

flights %>%
  group_by(year, month, day) %>%
  summarise(mean = mean(dep_delay))

We get a lot of missing values! That's because aggregation functions obey the usual rule of missing values: if there's any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation:

flights %>%
  group_by(year, month, day) %>%
  summarise(mean = mean(dep_delay, na.rm = TRUE))

In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights, as we did to create not_cancelled.

not_cancelled <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>%
  group_by(year, month, day) %>%
  summarise(mean = mean(dep_delay))

Exercise 4

Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?

Counts

n()

Whenever you do any aggregation, it's always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))). That way you can check that you're not drawing conclusions based on very small amounts of data. For example, let's look at the planes (identified by their tail number) that have the highest average delays:

delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise(
    delay = mean(arr_delay)
  )

ggplot(data = delays, mapping = aes(x = delay)) +
  geom_freqpoly(binwidth = 10)

Wow, there are some planes that have an average delay of 5 hours (300 minutes)!

The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay. Fill in the blank code below to compute and then plot the number of flights by the mean arrival delay (arr_delay).

# delays <- not_cancelled %>%
#   group_by(tailnum) %>%
#   summarise(
#     delay = _________,
#     n = n()
#   )
#
# ggplot(data = delays, mapping = aes(x = n, y = delay)) +
#   geom_point(alpha = 1/10)

delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise(
    delay = mean(arr_delay),
    n = n()
  )

ggplot(data = delays, mapping = aes(x = n, y = delay)) +
  geom_point(alpha = 1/10)

Not surprisingly, there is much greater variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you'll see that the variation decreases as the sample size increases.

Accounting for variation based on sample size

When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, as well as showing you a handy pattern for integrating ggplot2 into dplyr flows. It's a bit painful that you have to switch from %>% to +, but once you get the hang of it, it's quite convenient.

delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise(
    delay = mean(arr_delay),
    n = n()
  )

delays %>%
  filter(n > 25) %>%
  ggplot(mapping = aes(x = n, y = delay)) +
    geom_point(alpha = 1/10)

RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P. This resends the previously sent chunk from the editor to the console. This is very convenient when you're (e.g.) exploring the value of n in the example above. You send the whole block once with Cmd/Ctrl + Enter, then you modify the value of n and press Cmd/Ctrl + Shift + P to resend the complete block.

Sample size, average performance, and rank

There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the Lahman package to compute the batting average (number of hits / number of attempts) of every major league baseball player.

When I plot the skill of the batter (measured by the batting average, ba) against the number of opportunities to hit the ball (measured by at bat, ab), you see two patterns:

As above, the variation in our aggregate decreases as we get more data points.
There's a positive correlation between skill (ba) and opportunities to hit the ball (ab). This is because teams control who gets to play, and obviously they'll pick their best players.

# Convert to a tibble so it prints nicely
batting <- as_tibble(Lahman::Batting)

batters <- batting %>%
  group_by(playerID) %>%
  summarise(
    ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    ab = sum(AB, na.rm = TRUE)
  )

batters %>%
  filter(ab > 100) %>%
  ggplot(mapping = aes(x = ab, y = ba)) +
    geom_point() +
    geom_smooth(se = FALSE)

This also has important implications for ranking. If you look closely, the people with the best batting averages are clearly lucky, not skilled.

You can find a good explanation of this problem at http://varianceexplained.org/r/empirical_bayes_baseball/ and http://www.evanmiller.org/how-not-to-sort-by-average-rating.html.

count()

Counts are so useful that dplyr provides a simple helper if all you want is a count:

not_cancelled %>%
  count(dest)

You can optionally provide a weight variable. For example, you could use this to "count" (sum) the total number of miles a plane flew:

not_cancelled %>%
  count(tailnum, wt = distance)

Exercise 5

Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

not_cancelled %>%
  group_by(dest) %>%
  summarise(n = n())

not_cancelled %>%
  group_by(tailnum) %>%
  summarise(n = sum(distance))

**Hint:** Consider the tools at your disposal" `group_by()`, `summarise()`, `n()`, `sum()`, and `?count`

"Excellent Job! This was a tricky one, but you can now see that `count()` is a handy short cut for `group_by()` + `summarise()` + `n()` (or `sum()`)."

Exercise 6

What does the sort argument to count() do. When might you use it?

?count

Exercise 7

Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

# Task 1
# begin with a variable that shows the day of the year
# flights %>%
#   mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
# create a variable that shows whether a flight is cancelled
# group by day
# create a summary by counting up the number of flights where cancelled is TRUE
# Plot the result against day

# Task 2
# recreate the grouped data above
# create a summary by taking the mean of cancelled variable
# ...as well as the average delay
# plot one against the other

flights %>%
  mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
  mutate(cancelled = is.na(dep_delay) | is.na(arr_delay)) %>%
  group_by(date) %>%
  summarise(n = sum(cancelled)) %>%
  ggplot(aes(x = date, y = n)) +
    geom_point() +
    geom_smooth()

flights %>%
  mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
  mutate(cancelled = is.na(dep_delay) | is.na(arr_delay)) %>%
  group_by(date) %>%
  summarise(prop = mean(cancelled), avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = prop, y = avg_delay)) +
    geom_point()

**Hint:** Don't forget to use `na.rm = TRUE` where appropriate.

"Wow! You did awesome."

Any scripts or data that you put into this service are public.

learnr documentation built on Sept. 28, 2023, 9:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

learnr
Interactive Tutorials for R

Summarise Tables
In learnr: Interactive Tutorials for R

Welcome

Setup

Summarise groups with summarise()

summarise()

group_by()

Exercise 1

Exercise 2

Grouping by multiple variables

Ungrouping

Groups and dplyr

Combining multiple operations

Multiple steps

Pipes

Useful summary functions {#summarise-funs}

Aggregating functions

Why is distance to some destinations more variable than to others?

When do the first and last flights leave each day?

Which destinations have the most carriers?

How many flights left before 5am? (these usually indicate delayed

flights from the previous day)

What proportion of flights are delayed by more than an hour?

Exercise 3

Missing values

Exercise 4

Counts

n()

Accounting for variation based on sample size

Sample size, average performance, and rank

count()

Exercise 5

Exercise 6

Exercise 7

Try the learnr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

learnr Interactive Tutorials for R

Summarise Tables In learnr: Interactive Tutorials for R

Welcome

Setup

Summarise groups with summarise()

summarise()

group_by()

Exercise 1

Exercise 2

Grouping by multiple variables

Ungrouping

Groups and dplyr

Combining multiple operations

Multiple steps

Pipes

Useful summary functions {#summarise-funs}

Aggregating functions

Why is distance to some destinations more variable than to others?

When do the first and last flights leave each day?

Which destinations have the most carriers?

How many flights left before 5am? (these usually indicate delayed

flights from the previous day)

What proportion of flights are delayed by more than an hour?

Exercise 3

Missing values

Exercise 4

Counts

n()

Accounting for variation based on sample size

Sample size, average performance, and rank

count()

Exercise 5

Exercise 6

Exercise 7

Try the learnr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

learnr
Interactive Tutorials for R

Summarise Tables
In learnr: Interactive Tutorials for R