library(learnr) library(tidyverse) library(nycflights13) library(Lahman) tutorial_options(exercise.timelimit = 60,exercise.eval = TRUE) knitr::opts_chunk$set(error = TRUE)
In this tutorial, you will learn how to summarise a table of data, including:
summarise()
summarise()
%>%
n()
group_by()
and summarise()
The readings in this tutorial follow R for Data Science, section 5.6.
To practice these skills, we will use the flights
data set from the nycflights13 package, which you met in Data Basics. This data frame comes from the US Bureau of Transportation Statistics and contains all r format(nrow(nycflights13::flights), big.mark = ",")
flights that departed from New York City in 2013. It is documented in ?flights
.
To visualize the data, we will use the ggplot2 package that you met in Data Visualization Basics.
I've preloaded the packages for this tutorial with
library(tidyverse) # loads dplyr, ggplot2, and others library(nycflights13)
summarise()
collapses a data frame to a single row of summaries. You get to choose how many summaries appear in the row and how they are computed:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE), total = sum(dep_delay, na.rm = TRUE))
(We'll come back to what that na.rm = TRUE
means very shortly.)
Notice that the syntax of summarise()
is similar to mutate()
. As with mutate()
, you give summarise:
The main difference between summarise()
and mutate()
is the type of function that you use to generate the new columns. mutate()
takes functions that return an entire vector of output (to append to the original data frame). summarise()
takes functions that return a single value (or summary). These values will appear in a new data frame that has only one row.
summarise()
is not terribly useful unless you pair it with group_by()
. group_by()
changes the unit of analysis of the data frame: it assigns observations in the data frame to separate groups, and it instructs dplyr to apply functions separately to each group. group_by()
assigns groups by grouping together observations that have the same combinations of values for the variables that you pass to group_by()
.
For example, the summarise()
code above computes the average delay for the entire data set. If we apply exactly the same code to a data set that has been grouped by date (i.e. the unique combinations of year
, month
, and day
), we get the average delay per date. Click "Run Code" to see what I mean:
by_day <- group_by(flights, year, month, day) summarise(by_day, delay = mean(dep_delay, na.rm = TRUE), total = sum(dep_delay, na.rm = TRUE))
"Good job!"
Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n())
)
flights %>% group_by(carrier) %>% summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>% mutate(rank = min_rank(desc(avg_delay))) %>% filter(rank == 1)
"Great work! Frontier airlines (`F9`) was the highest average departure delay."
For each plane, count the number of flights before the first delay of greater than 1 hour.
flights %>% filter(!is.na(dep_delay)) %>% group_by(tailnum) %>% mutate(big_delay = dep_delay > 60, before = !cumany(big_delay)) %>% summarise(sum(before))
"Great work! That was tough. Be sure you understand each of the steps and functions involved."
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset. Run the code below and inspect each result to see how its grouping criteria has changed (the grouping criteria is displayed at the top of the tibble).
daily <- group_by(flights, year, month, day) (per_day <- summarise(daily, total = sum(dep_delay, na.rm = TRUE))) (per_month <- summarise(per_day, total = sum(total, na.rm = TRUE))) (per_year <- summarise(per_month, total = sum(total, na.rm = TRUE)))
Be careful when you progressively roll up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median. In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.
If you need to remove grouping, and return to operations on ungrouped data, use ungroup()
.
daily <- group_by(flights, year, month, day)
daily <- ungroup(daily) # no longer grouped by date summarise(daily, total = sum(dep_delay, na.rm = TRUE)) # all flights
group_by()
also works with the other dplyr functions; dplyr will apply filter()
, select()
, arrange()
, and mutate()
in a groupwise fashion to grouped data. However, group_by()
is the most useful when combined with summarise()
. Together group_by()
and summarise()
provide one of the tools that you'll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe.
Imagine that we want to explore the relationship between the distance and average delay for each destination in flights
. Using what you know about dplyr, you might write code like this:
by_dest <- group_by(flights, dest) delay <- summarise(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) delay <- filter(delay, count > 20, dest != "HNL") ggplot(data = delay, mapping = aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE)
The code works, and we find an interesting effect: It looks like delays increase with distance up to ~750 miles and then decrease. Maybe as flights get longer there's more ability to make up delays in the air?
Now let's look at how we prepared the data. There are three steps:
Group flights by destination.
Summarise to compute distance, average delay, and number of flights.
Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things is hard, so this slows down our analysis.
There's another way to tackle the same problem. We can turn the code into a pipe with the pipe operator, %>%
:
delays <- flights %>% group_by(dest) %>% summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL")
Behind the scenes, x %>% f(y)
turns into f(x, y)
, and x %>% f(y) %>% g(z)
turns into g(f(x, y), z)
and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
This focuses on the transformations, not what's being transformed, which makes the code easier to read. You can read it as a series of imperative statements: group, then summarise, then filter. As suggested by this reading, a good way to pronounce %>%
when reading code is "then".
We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Pipes.
The pipe is a defining feature of the tidyverse: packages in the tidyverse all contain functions that are designed to work with the pipe. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet.
You can get a long way with means and sum; but R provides many other useful functions to use with summary. Each of these functions acts as an aggregating function: it takes a vector of values and returns a single value.
Let's demonstrate some of the most useful aggregating functions with this data set, which removes flights that have no delay information (because they were cancelled).
not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay))
Measures of location: we've used mean(x)
, but median(x)
is also useful. The mean is the sum divided by the length; the median is a value where 50% of x
is above it, and 50% is below it.
It's sometimes useful to combine aggregation with logical subsetting. We haven't talked about this sort of subsetting yet, but you'll learn more about it in Subsetting.
r
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
Measures of spread: sd(x)
, IQR(x)
, mad(x)
. The mean squared deviation, or standard deviation or sd for short, is the standard measure of spread. The interquartile range IQR()
and median absolute deviation mad(x)
are robust equivalents that may be more useful if you have outliers.
```r
not_cancelled %>% group_by(dest) %>% summarise(distance_sd = sd(distance)) %>% arrange(desc(distance_sd)) ```
Measures of rank: min(x)
, quantile(x, 0.25)
, max(x)
. Quantiles are a generalisation of the median. For example, quantile(x, 0.25)
will find a value of x
that is greater than 25% of the values, and less than the remaining 75%.
```r
not_cancelled %>% group_by(year, month, day) %>% summarise( first = min(dep_time), last = max(dep_time) ) ```
Measures of position: first(x)
, nth(x, 2)
, last(x)
. These work similarly to x[1]
, x[2]
, and x[length(x)]
but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:
r
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:
r
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
Counts: In the next section, you will meet n()
, which takes no arguments, and returns the size of the current group. You can count other useful quantities as well. To count the number of non-missing values, use sum(!is.na(x))
. To count the number of distinct (unique) values, use n_distinct(x)
.
```r
not_cancelled %>% group_by(dest) %>% summarise(carriers = n_distinct(carrier)) %>% arrange(desc(carriers)) ```
Counts and proportions of logical values: sum(x > 10)
, mean(y == 0)
. When used with numeric functions, TRUE
is converted to 1 and FALSE
to 0. This makes sum()
and mean()
very useful: sum(x)
gives the number of TRUE
s in x
, and mean(x)
gives the proportion.
```r
not_cancelled %>% group_by(year, month, day) %>% summarise(n_early = sum(dep_time < 500))
not_cancelled %>% group_by(year, month, day) %>% summarise(hour_perc = mean(arr_delay > 60)) ```
Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
A flight is always 10 minutes late.
A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
99% of the time a flight is on time. 1% of the time it's 2 hours late.
Which is more important: arrival delay or departure delay?
You may have wondered about the na.rm
argument we used in a previous section. What happens if we don't set it?
flights %>% group_by(year, month, day) %>% summarise(mean = mean(dep_delay))
We get a lot of missing values! That's because aggregation functions obey the usual rule of missing values: if there's any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an na.rm
argument which removes the missing values prior to computation:
flights %>% group_by(year, month, day) %>% summarise(mean = mean(dep_delay, na.rm = TRUE))
In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights, as we did to create not_cancelled
.
not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay)) not_cancelled %>% group_by(year, month, day) %>% summarise(mean = mean(dep_delay))
Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay)
) is slightly suboptimal. Why? Which is the most important column?
Whenever you do any aggregation, it's always a good idea to include either a count (n()
), or a count of non-missing values (sum(!is.na(x))
). That way you can check that you're not drawing conclusions based on very small amounts of data. For example, let's look at the planes (identified by their tail number) that have the highest average delays:
delays <- not_cancelled %>% group_by(tailnum) %>% summarise( delay = mean(arr_delay) ) ggplot(data = delays, mapping = aes(x = delay)) + geom_freqpoly(binwidth = 10)
Wow, there are some planes that have an average delay of 5 hours (300 minutes)!
The story is actually a little more nuanced. We can get more insight if we draw a scatterplot of number of flights vs. average delay. Fill in the blank code below to compute and then plot the number of flights by the mean arrival delay (arr_delay
).
# delays <- not_cancelled %>% # group_by(tailnum) %>% # summarise( # delay = _________, # n = n() # ) # # ggplot(data = delays, mapping = aes(x = n, y = delay)) + # geom_point(alpha = 1/10)
delays <- not_cancelled %>% group_by(tailnum) %>% summarise( delay = mean(arr_delay), n = n() ) ggplot(data = delays, mapping = aes(x = n, y = delay)) + geom_point(alpha = 1/10)
Not surprisingly, there is much greater variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you'll see that the variation decreases as the sample size increases.
When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, as well as showing you a handy pattern for integrating ggplot2 into dplyr flows. It's a bit painful that you have to switch from %>%
to +
, but once you get the hang of it, it's quite convenient.
delays <- not_cancelled %>% group_by(tailnum) %>% summarise( delay = mean(arr_delay), n = n() )
delays %>% filter(n > 25) %>% ggplot(mapping = aes(x = n, y = delay)) + geom_point(alpha = 1/10)
RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P. This resends the previously sent chunk from the editor to the console. This is very convenient when you're (e.g.) exploring the value of n
in the example above. You send the whole block once with Cmd/Ctrl + Enter, then you modify the value of n
and press Cmd/Ctrl + Shift + P to resend the complete block.
There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the Lahman package to compute the batting average (number of hits / number of attempts) of every major league baseball player.
When I plot the skill of the batter (measured by the batting average, ba
) against the number of opportunities to hit the ball (measured by at bat, ab
), you see two patterns:
As above, the variation in our aggregate decreases as we get more data points.
There's a positive correlation between skill (ba
) and opportunities to hit the ball (ab
). This is because teams control who gets to play, and obviously they'll pick their best players.
# Convert to a tibble so it prints nicely batting <- as_tibble(Lahman::Batting) batters <- batting %>% group_by(playerID) %>% summarise( ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE), ab = sum(AB, na.rm = TRUE) ) batters %>% filter(ab > 100) %>% ggplot(mapping = aes(x = ab, y = ba)) + geom_point() + geom_smooth(se = FALSE)
This also has important implications for ranking. If you look closely, the people with the best batting averages are clearly lucky, not skilled.
You can find a good explanation of this problem at http://varianceexplained.org/r/empirical_bayes_baseball/ and http://www.evanmiller.org/how-not-to-sort-by-average-rating.html.
Counts are so useful that dplyr provides a simple helper if all you want is a count:
not_cancelled %>% count(dest)
You can optionally provide a weight variable. For example, you could use this to "count" (sum) the total number of miles a plane flew:
not_cancelled %>% count(tailnum, wt = distance)
Come up with another approach that will give you the same output as not_cancelled %>% count(dest)
and not_cancelled %>% count(tailnum, wt = distance)
(without using count()
).
not_cancelled %>% group_by(dest) %>% summarise(n = n()) not_cancelled %>% group_by(tailnum) %>% summarise(n = sum(distance))
"Excellent Job! This was a tricky one, but you can now see that `count()` is a handy short cut for `group_by()` + `summarise()` + `n()` (or `sum()`)."
What does the sort
argument to count()
do. When might you use it?
?count
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
# Task 1 # begin with a variable that shows the day of the year # flights %>% # mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>% # create a variable that shows whether a flight is cancelled # group by day # create a summary by counting up the number of flights where cancelled is TRUE # Plot the result against day # Task 2 # recreate the grouped data above # create a summary by taking the mean of cancelled variable # ...as well as the average delay # plot one against the other
flights %>% mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>% mutate(cancelled = is.na(dep_delay) | is.na(arr_delay)) %>% group_by(date) %>% summarise(n = sum(cancelled)) %>% ggplot(aes(x = date, y = n)) + geom_point() + geom_smooth() flights %>% mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>% mutate(cancelled = is.na(dep_delay) | is.na(arr_delay)) %>% group_by(date) %>% summarise(prop = mean(cancelled), avg_delay = mean(dep_delay, na.rm = TRUE)) %>% ggplot(aes(x = prop, y = avg_delay)) + geom_point()
"Wow! You did awesome."
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.