In Athanasiamo/swc.tidyverse: Introduction to Tidyverse

knitr::opts_chunk$set(echo = TRUE)

Questions

How can I create summary tables of my data?

Objectives

To be able to understand how to group data to create convenient summaries.

Data summaries

Next to visualizing data, creating summaries of the data in tables is a quick way to get an idea of what type of data you have at hand. It might help you spot incorrect data or extreme values, or whether specific analysis approaches are needed.

To summarize data with the {tidyverse} efficiently, we need to utilize the tools we have learned the previous days, like adding new variables, tidy-selections, pivots and grouping data. All these tools combine amazingly when we start making summaries.

Let us start from the beginning with summaries, and work our way up to the more complex variations as we go.

First, we must again prepare our workspace with our packages and data.

library(tidyverse)
penguins <- palmerpenguins::penguins

We should start to feel quite familiar with our penguins by now. Let us start by finding the mean of the bill length

penguins %>% 
  summarise(bill_length_mean = mean(bill_length_mm))

NA. as we remember, there are some NA values in our data. We can omit these by adding the na.rm = TRUE argument, which will remove all NA's before calculating the mean.

penguins %>% 
  summarise(bill_length_mean = mean(bill_length_mm, na.rm = TRUE))

An alternative way to remove missing values from a column is to pass the column to {tidyr}'s drop_na() function.

penguins %>% 
  drop_na(bill_length_mm) %>% 
  summarise(bill_length_mean = mean(bill_length_mm))

penguins %>% 
  drop_na(bill_length_mm) %>% 
  summarise(bill_length_mean = mean(bill_length_mm),
            bill_length_min = min(bill_length_mm),
            bill_length_max = max(bill_length_mm))

Data summaries, challenges. {.tabset}

Assignment

Room: break-out
Duration: 10 minutes

1a: First start by trying to summarise a single column, body_mass_g, by calculating its mean in kilograms.

1b: Add a column with the standard deviation of body_mass_g on kilogram scale.

1c: Now add the same two metrics for flipper_length_mm on centimeter scale and give the columns clear names. Why could the drop_na() step give us wrong results?

Solution

## 1a
penguins %>% 
  drop_na(body_mass_g) %>% 
  summarise(body_mass_kg_mean = mean(body_mass_g / 1000))

# 1b
penguins %>% 
  drop_na(body_mass_g) %>% 
  summarise(body_mass_kg_mean = mean(body_mass_g / 1000),
            body_mass_kg_sd = sd(body_mass_g / 1000))

## 1c 
penguins %>% 
  summarise(body_mass_kg_mean      = mean(body_mass_g / 1000, na.rm = TRUE),
            body_mass_kg_sd        = sd(body_mass_g / 1000, na.rm = TRUE),
            flipper_length_cm_mean = mean(flipper_length_mm / 10, na.rm = TRUE),
            flipper_length_cm_sd   = sd(flipper_length_mm / 10, na.rm = TRUE))

penguins %>% 
  drop_na(body_mass_g, flipper_length_mm) %>% 
  summarise(body_mass_kg_mean      = mean(body_mass_g / 1000),
            body_mass_kg_sd        = sd(body_mass_g / 1000),
            flipper_length_cm_mean = mean(flipper_length_mm / 10),
            flipper_length_cm_sd   = sd(flipper_length_mm / 10))

Here, we also added some extra space after the column names, to align the functions up. This is a fairly common coding practice for this type of code, that usually makes it easier for others to read.

Summarising grouped data

All the examples we have gone through so far with summarizing data, we have summarized the entire data set. But most times, we want to have a look at groups in our data, and summarize based on these groups. How can we manage to summarize while preserving grouping information?

We've already worked a little with the group_by() function, and we will use it again! Because, once we know how to summarize data, summarizing data by groups is as simple as adding one more line to our code.

Let us start with our first example of getting the mean of a single column.

penguins %>% 
  drop_na(body_mass_g) %>% 
  summarise(body_mass_g_mean = mean(body_mass_g))

Here, we are getting a single mean for the entire data set. In order to get, for instance the means of each of the species, we can group by species before we summarize.

penguins %>% 
  drop_na(body_mass_g) %>% 
  group_by(species) %>% 
  summarise(body_mass_kg_mean = mean(body_mass_g / 1000))

And now we suddenly have three means! And they are tidily collected in each their row. To this we can keep adding as we did before.

penguins %>% 
  drop_na(body_mass_g) %>% 
  group_by(species) %>%
  summarise(body_mass_kg_mean = mean(body_mass_g / 1000),
            body_mass_kg_min = min(body_mass_g / 1000),
            body_mass_kg_max = max(body_mass_g / 1000))

Now we are suddenly able to easily compare groups within our data, since they are so neatly summarized here.

Ungrouping for future control

We've been grouping a lot and not ungrouping. Which might seem fine now, because we have not really done anything more after the summarize. But in many cases we might continue our merry data handling way and do lots more, and then the preserving of the grouping can give us some unexpected results. Let us explore that a little.

penguins %>% 
  group_by(species) %>% 
  summarise(records = n())

When we group by a single column and summarize, the output data is no longer grouped. In a way, the summarize() uses up one group while summarizing, as based on species, the data can not be condensed any further than this.

penguins %>% 
  group_by(species, island) %>% 
  summarise(records = n())

When we group by two columns, it actually has the same behavior. But because we used to have two groups, we now are left with one. In this case "species" is still a grouping variable. Lets say we want a column now, that counts the total number of penguins observations. That would be the sum of the "n" column.

penguins %>% 
  group_by(species, island) %>% 
  summarise(records = n()) %>% 
  mutate(total = sum(records))

But that is not what we are expecting! why? Because the data is still grouped by species, it is now taking the sum within each species, rather than the whole. To get the whole we need first to ungroup(), and then try again.

penguins %>% 
  group_by(species, island) %>% 
  summarise(records = n()) %>% 
  ungroup() %>% 
  mutate(total = sum(records))

Athanasiamo/swc.tidyverse documentation built on Dec. 17, 2021, 9:48 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Athanasiamo/swc.tidyverse
Introduction to Tidyverse

In Athanasiamo/swc.tidyverse: Introduction to Tidyverse

Data summaries

Data summaries, challenges. {.tabset}

Assignment

Solution

Summarising grouped data

Ungrouping for future control

R Package Documentation

Browse R Packages

We want your feedback!

Athanasiamo/swc.tidyverse Introduction to Tidyverse

In Athanasiamo/swc.tidyverse: Introduction to Tidyverse

Data summaries

Data summaries, challenges. {.tabset}

Assignment

Solution

Summarising grouped data

Ungrouping for future control

R Package Documentation

Browse R Packages

We want your feedback!

Athanasiamo/swc.tidyverse
Introduction to Tidyverse