Summarise Tables

library(learnr)
library(tidyverse)
library(nycflights13)
library(Lahman)

tutorial_options(
  exercise.timelimit = 60,
  # A simple checker function that just returns the message in the check chunk
  exercise.checker = function(check_code, ...) {
    list(
      message = eval(parse(text = check_code)),
      correct = logical(0),
      type = "info",
      location = "append"
    )
  }
)
knitr::opts_chunk$set(error = TRUE)

Welcome

In this tutorial, you will learn how to summarise a table of data, including:

The readings in this tutorial follow R for Data Science, section 5.6.

Setup

To practice these skills, we will use the flights data set from the nycflights13 package, which you met in Data Basics. This data frame comes from the US Bureau of Transportation Statistics and contains all r format(nrow(nycflights13::flights), big.mark = ",") flights that departed from New York City in 2013. It is documented in ?flights.

To visualize the data, we will use the ggplot2 package that you met in Data Visualization Basics.

I've preloaded the packages for this tutorial with

library(tidyverse) # loads dplyr, ggplot2, and others
library(nycflights13)

Summarise groups with summarise()

summarise()

summarise() collapses a data frame to a single row of summaries. You get to choose how many summaries appear in the row and how they are computed:

summarise(flights, delay = mean(dep_delay, na.rm = TRUE),
                  total =  sum(dep_delay, na.rm = TRUE))

(We'll come back to what that na.rm = TRUE means very shortly.)

Notice that the syntax of summarise() is similar to mutate(). As with mutate(), you give summarise:

  1. The name of a data frame to transform
  2. One or more column names to appear in the transformed output. Each column name is set equal to the R expression that will generate the content of the column.

The main difference between summarise() and mutate() is the type of function that you use to generate the new columns. mutate() takes functions that return an entire vector of output (to append to the original data frame). summarise() takes functions that return a single value (or summary). These values will appear in a new data frame that has only one row.

group_by()

summarise() is not terribly useful unless you pair it with group_by(). group_by() changes the unit of analysis of the data frame: it assigns observations in the data frame to separate groups, and it instructs dplyr to apply functions separately to each group. group_by() assigns groups by grouping together observations that have the same combinations of values for the variables that you pass to group_by().

For example, the summarise() code above computes the average delay for the entire data set. If we apply exactly the same code to a data set that has been grouped by date (i.e. the unique combinations of year, month, and day), we get the average delay per date. Click "Run Code" to see what I mean:

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE),
                  total = sum(dep_delay, na.rm = TRUE))
"Good job!"

Exercise 1

Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))


flights %>%
  group_by(carrier) %>%
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  mutate(rank = min_rank(desc(avg_delay))) %>%
  filter(rank == 1)
**Hint:** Use`min_rank(desc(avg_delay))` to rank `avg_delay` (for example) such that the largest delay receives rank one.
"Great work! Frontier airlines (`F9`) was the highest average departure delay."

Exercise 2

For each plane, count the number of flights before the first delay of greater than 1 hour.


flights %>%
  filter(!is.na(dep_delay)) %>%
  group_by(tailnum) %>%
  mutate(big_delay = dep_delay > 60,
         before = !cumany(big_delay)) %>%
  summarise(sum(before))
**Hint:** One strategy would be to: * filter out all rows where `dep_delay` is `NA`. * Then group by plane, * create a variable that tests whether each flight was delayed longer than an hour * create a variable that identifies flights that occur before the first big delay with `!cumany()` * sum up the number of trues
"Great work! That was tough. Be sure you understand each of the steps and functions involved."

Grouping by multiple variables

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset. Run the code below and inspect each result to see how its grouping criteria has changed (the grouping criteria is displayed at the top of the tibble).

daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, total = sum(dep_delay, na.rm = TRUE)))
(per_month <- summarise(per_day, total = sum(total, na.rm = TRUE)))
(per_year  <- summarise(per_month, total = sum(total, na.rm = TRUE)))

Be careful when you progressively roll up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median. In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.

Ungrouping

If you need to remove grouping, and return to operations on ungrouped data, use ungroup().

daily <- group_by(flights, year, month, day)
daily <- ungroup(daily) # no longer grouped by date
summarise(daily, total = sum(dep_delay, na.rm = TRUE))  # all flights

Groups and dplyr

group_by() also works with the other dplyr functions; dplyr will apply filter(), select(), arrange(), and mutate() in a groupwise fashion to grouped data. However, group_by() is the most useful when combined with summarise(). Together group_by() and summarise() provide one of the tools that you'll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe.

Combining multiple operations

Multiple steps

Imagine that we want to explore the relationship between the distance and average delay for each destination in flights. Using what you know about dplyr, you might write code like this:

by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)

The code works, and we find an interesting effect: It looks like delays increase with distance up to ~750 miles and then decrease. Maybe as flights get longer there's more ability to make up delays in the air?

Now let's look at how we prepared the data. There are three steps:

  1. Group flights by destination.

  2. Summarise to compute distance, average delay, and number of flights.

  3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.

This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things is hard, so this slows down our analysis.

Pipes

There's another way to tackle the same problem. We can turn the code into a pipe with the pipe operator, %>%:

delays <- flights %>%
  group_by(dest) %>%
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count > 20, dest != "HNL")

Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.

This focuses on the transformations, not what's being transformed, which makes the code easier to read. You can read it as a series of imperative statements: group, then summarise, then filter. As suggested by this reading, a good way to pronounce %>% when reading code is "then".

We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Pipes.

The pipe is a defining feature of the tidyverse: packages in the tidyverse all contain functions that are designed to work with the pipe. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet.

Useful summary functions {#summarise-funs}

Aggregating functions

You can get a long way with means and sum; but R provides many other useful functions to use with summary. Each of these functions acts as an aggregating function: it takes a vector of values and returns a single value.

Let's demonstrate some of the most useful aggregating functions with this data set, which removes flights that have no delay information (because they were cancelled).

not_cancelled <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))

Exercise 3

Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

Which is more important: arrival delay or departure delay?


**Hint:** Consider R's measures of location and measures of spread.

Missing values

You may have wondered about the na.rm argument we used in a previous section. What happens if we don't set it?

flights %>%
  group_by(year, month, day) %>%
  summarise(mean = mean(dep_delay))

We get a lot of missing values! That's because aggregation functions obey the usual rule of missing values: if there's any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation:

flights %>%
  group_by(year, month, day) %>%
  summarise(mean = mean(dep_delay, na.rm = TRUE))

In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights, as we did to create not_cancelled.

not_cancelled <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>%
  group_by(year, month, day) %>%
  summarise(mean = mean(dep_delay))

Exercise 4

Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?

Counts

n()

Whenever you do any aggregation, it's always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))). That way you can check that you're not drawing conclusions based on very small amounts of data. For example, let's look at the planes (identified by their tail number) that have the highest average delays:

delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise(
    delay = mean(arr_delay)
  )

ggplot(data = delays, mapping = aes(x = delay)) +
  geom_freqpoly(binwidth = 10)

Wow, there are some planes that have an average delay of 5 hours (300 minutes)!

The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay. Fill in the blank code below to compute and then plot the number of flights by the mean arrival delay (arr_delay).

# delays <- not_cancelled %>%
#   group_by(tailnum) %>%
#   summarise(
#     delay = _________,
#     n = n()
#   )
#
# ggplot(data = delays, mapping = aes(x = n, y = delay)) +
#   geom_point(alpha = 1/10)
delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise(
    delay = mean(arr_delay),
    n = n()
  )

ggplot(data = delays, mapping = aes(x = n, y = delay)) +
  geom_point(alpha = 1/10)

Not surprisingly, there is much greater variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you'll see that the variation decreases as the sample size increases.

Accounting for variation based on sample size

When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, as well as showing you a handy pattern for integrating ggplot2 into dplyr flows. It's a bit painful that you have to switch from %>% to +, but once you get the hang of it, it's quite convenient.

delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise(
    delay = mean(arr_delay),
    n = n()
  )
delays %>%
  filter(n > 25) %>%
  ggplot(mapping = aes(x = n, y = delay)) +
    geom_point(alpha = 1/10)

RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P. This resends the previously sent chunk from the editor to the console. This is very convenient when you're (e.g.) exploring the value of n in the example above. You send the whole block once with Cmd/Ctrl + Enter, then you modify the value of n and press Cmd/Ctrl + Shift + P to resend the complete block.


Sample size, average performance, and rank

There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the Lahman package to compute the batting average (number of hits / number of attempts) of every major league baseball player.

When I plot the skill of the batter (measured by the batting average, ba) against the number of opportunities to hit the ball (measured by at bat, ab), you see two patterns:

  1. As above, the variation in our aggregate decreases as we get more data points.

  2. There's a positive correlation between skill (ba) and opportunities to hit the ball (ab). This is because teams control who gets to play, and obviously they'll pick their best players.

# Convert to a tibble so it prints nicely
batting <- as_tibble(Lahman::Batting)

batters <- batting %>%
  group_by(playerID) %>%
  summarise(
    ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    ab = sum(AB, na.rm = TRUE)
  )

batters %>%
  filter(ab > 100) %>%
  ggplot(mapping = aes(x = ab, y = ba)) +
    geom_point() +
    geom_smooth(se = FALSE)

This also has important implications for ranking. If you look closely, the people with the best batting averages are clearly lucky, not skilled.

You can find a good explanation of this problem at http://varianceexplained.org/r/empirical_bayes_baseball/ and http://www.evanmiller.org/how-not-to-sort-by-average-rating.html.

count()

Counts are so useful that dplyr provides a simple helper if all you want is a count:

not_cancelled %>%
  count(dest)
You can optionally provide a weight variable. For example, you could use this to "count" (sum) the total number of miles a plane flew:
not_cancelled %>%
  count(tailnum, wt = distance)

Exercise 5

Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).


not_cancelled %>%
  group_by(dest) %>%
  summarise(n = n())

not_cancelled %>%
  group_by(tailnum) %>%
  summarise(n = sum(distance))
**Hint:** Consider the tools at your disposal" `group_by()`, `summarise()`, `n()`, `sum()`, and `?count`
"Excellent Job! This was a tricky one, but you can now see that `count()` is a handy short cut for `group_by()` + `summarise()` + `n()` (or `sum()`)."

Exercise 6

What does the sort argument to count() do. When might you use it?

?count

Exercise 7

Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

# Task 1
# begin with a variable that shows the day of the year
# flights %>%
#   mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
# create a variable that shows whether a flight is cancelled
# group by day
# create a summary by counting up the number of flights where cancelled is TRUE
# Plot the result against day

# Task 2
# recreate the grouped data above
# create a summary by taking the mean of cancelled variable
# ...as well as the average delay
# plot one against the other
flights %>%
  mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
  mutate(cancelled = is.na(dep_delay) | is.na(arr_delay)) %>%
  group_by(date) %>%
  summarise(n = sum(cancelled)) %>%
  ggplot(aes(x = date, y = n)) +
    geom_point() +
    geom_smooth()

flights %>%
  mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>%
  mutate(cancelled = is.na(dep_delay) | is.na(arr_delay)) %>%
  group_by(date) %>%
  summarise(prop = mean(cancelled), avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = prop, y = avg_delay)) +
    geom_point()
**Hint:** Don't forget to use `na.rm = TRUE` where appropriate.
"Wow! You did awesome."


Try the learnr package in your browser

Any scripts or data that you put into this service are public.

learnr documentation built on Sept. 28, 2023, 9:06 a.m.