knitr::opts_chunk$set(echo = TRUE)
Questions
How can I create summary tables of my data?
Objectives
To be able to understand how to group data to create convenient summaries.
Next to visualizing data, creating summaries of the data in tables is a quick way to get an idea of what type of data you have at hand. It might help you spot incorrect data or extreme values, or whether specific analysis approaches are needed.
To summarize data with the {tidyverse} efficiently, we need to utilize the tools we have learned the previous days, like adding new variables, tidy-selections, pivots and grouping data. All these tools combine amazingly when we start making summaries.
Let us start from the beginning with summaries, and work our way up to the more complex variations as we go.
First, we must again prepare our workspace with our packages and data.
library(tidyverse) penguins <- palmerpenguins::penguins
We should start to feel quite familiar with our penguins by now. Let us start by finding the mean of the bill length
penguins %>% summarise(bill_length_mean = mean(bill_length_mm))
NA
. as we remember, there are some NA
values in our data. We can omit these by adding the na.rm = TRUE
argument, which will remove all NA
's before calculating the mean.
penguins %>% summarise(bill_length_mean = mean(bill_length_mm, na.rm = TRUE))
An alternative way to remove missing values from a column is to pass the column to {tidyr}'s drop_na()
function.
penguins %>% drop_na(bill_length_mm) %>% summarise(bill_length_mean = mean(bill_length_mm))
penguins %>% drop_na(bill_length_mm) %>% summarise(bill_length_mean = mean(bill_length_mm), bill_length_min = min(bill_length_mm), bill_length_max = max(bill_length_mm))
Room: break-out
Duration: 10 minutes
1a: First start by trying to summarise a single column,
body_mass_g
, by calculating its mean in kilograms.1b: Add a column with the standard deviation of
body_mass_g
on kilogram scale.1c: Now add the same two metrics for
flipper_length_mm
on centimeter scale and give the columns clear names. Why could thedrop_na()
step give us wrong results?
## 1a penguins %>% drop_na(body_mass_g) %>% summarise(body_mass_kg_mean = mean(body_mass_g / 1000)) # 1b penguins %>% drop_na(body_mass_g) %>% summarise(body_mass_kg_mean = mean(body_mass_g / 1000), body_mass_kg_sd = sd(body_mass_g / 1000)) ## 1c penguins %>% summarise(body_mass_kg_mean = mean(body_mass_g / 1000, na.rm = TRUE), body_mass_kg_sd = sd(body_mass_g / 1000, na.rm = TRUE), flipper_length_cm_mean = mean(flipper_length_mm / 10, na.rm = TRUE), flipper_length_cm_sd = sd(flipper_length_mm / 10, na.rm = TRUE)) penguins %>% drop_na(body_mass_g, flipper_length_mm) %>% summarise(body_mass_kg_mean = mean(body_mass_g / 1000), body_mass_kg_sd = sd(body_mass_g / 1000), flipper_length_cm_mean = mean(flipper_length_mm / 10), flipper_length_cm_sd = sd(flipper_length_mm / 10))
Here, we also added some extra space after the column names, to align the functions up. This is a fairly common coding practice for this type of code, that usually makes it easier for others to read.
All the examples we have gone through so far with summarizing data, we have summarized the entire data set. But most times, we want to have a look at groups in our data, and summarize based on these groups. How can we manage to summarize while preserving grouping information?
We've already worked a little with the group_by()
function, and we will use it again! Because, once we know how to summarize data, summarizing data by groups is as simple as adding one more line to our code.
Let us start with our first example of getting the mean of a single column.
penguins %>% drop_na(body_mass_g) %>% summarise(body_mass_g_mean = mean(body_mass_g))
Here, we are getting a single mean for the entire data set. In order to get, for instance the means of each of the species, we can group by species before we summarize.
penguins %>% drop_na(body_mass_g) %>% group_by(species) %>% summarise(body_mass_kg_mean = mean(body_mass_g / 1000))
And now we suddenly have three means! And they are tidily collected in each their row. To this we can keep adding as we did before.
penguins %>% drop_na(body_mass_g) %>% group_by(species) %>% summarise(body_mass_kg_mean = mean(body_mass_g / 1000), body_mass_kg_min = min(body_mass_g / 1000), body_mass_kg_max = max(body_mass_g / 1000))
Now we are suddenly able to easily compare groups within our data, since they are so neatly summarized here.
We've been grouping a lot and not ungrouping. Which might seem fine now, because we have not really done anything more after the summarize. But in many cases we might continue our merry data handling way and do lots more, and then the preserving of the grouping can give us some unexpected results. Let us explore that a little.
penguins %>% group_by(species) %>% summarise(records = n())
When we group by a single column and summarize, the output data is no longer grouped. In a way, the summarize()
uses up one group while summarizing, as based on species, the data can not be condensed any further than this.
penguins %>% group_by(species, island) %>% summarise(records = n())
When we group by two columns, it actually has the same behavior. But because we used to have two groups, we now are left with one. In this case "species" is still a grouping variable. Lets say we want a column now, that counts the total number of penguins observations. That would be the sum of the "n" column.
penguins %>% group_by(species, island) %>% summarise(records = n()) %>% mutate(total = sum(records))
But that is not what we are expecting! why? Because the data is still grouped by species, it is now taking the sum within each species, rather than the whole. To get the whole we need first to ungroup()
, and then try again.
penguins %>% group_by(species, island) %>% summarise(records = n()) %>% ungroup() %>% mutate(total = sum(records))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.