knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

We’re going to work with the PanTHERIA database, a dataset of mammal life-history, geography, and ecology traits:

Jones, K.E., et al. PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. Ecology 90:2648. http://esapubs.org/archive/ecol/E090/184/

First we’ll load the data:

library(traitdata)
data(pantheria)

Next we’ll load the dplyr package:

library(dplyr)

Looking at the data

Data frames look a bit different in dplyr. Above, I called the tbl_df() function on our data. This provides more useful printing of data frames in the console. Ever accidentally printed a massive data frame in the console before? Yeah, this avoids that. You don’t need to change your data to a data frame tbl first — the dplyr functions will automatically convert your data when you call them. This is what the data look like on the console:

tbl_df(pantheria)

dplyr also provides a function glimpse() that makes it easy to look at our data in a transposed view. It’s similar to the str() (structure) function, but has a few advantages (see ?glimpse).

glimpse(pantheria)

Selecting columns

select() lets you subset by columns. This is similar to subset() in base R, but it also allows for some fancy use of helper functions such as contains(), starts_with() and, ends_with(). I think these examples are self explanatory, so I’ll just include them here:

select(pantheria, AdultHeadBodyLen_mm:LitterSize) %>% head()

Filtering rows

filter() lets you subset by rows. You can use any valid logical statements:

filter(pantheria, Order == "Carnivora" & AdultBodyMass_g < 200) %>% head()

Arranging rows

arrange() lets you order the rows by one or more columns in ascending or descending order. I’m selecting the first three columns only to make the output easier to read:

arrange(pantheria, AdultBodyMass_g)[ , 1:3] %>% head()

Mutating columns

mutate() lets you add new columns. Notice that the new columns you create can build on each other. I will wrap these in glimpse() to make the new columns easy to see:

head(mutate(pantheria, 
    g_per_mm = AdultBodyMass_g / AdultHeadBodyLen_mm))

Summarising columns

Finally, summarise() lets you calculate summary statistics. On its own summarise() isn’t that useful, but when combined with group_by() you can summarise by chunks of data. This is similar to what you might be familiar with through ddply() and summarise() from the plyr package:

head(summarise(group_by(pantheria, Order),
  mean_mass = mean(AdultBodyMass_g, na.rm = TRUE)))

Piping data

Pipes take the output from one function and feed it to the first argument of the next function.

The magrittr R package contains the pipe function %>%. Yes it might look bizarre at first but it makes more sense when you think about it. The R language allows symbols wrapped in % to be defined as functions, the > helps imply a chain, and you can hit these 2 characters one after the other very quickly on a keyboard by holding down the Shift key. Try pronouncing %>% “then” whenever you see it.

Let’s try a single-pipe dplyr example. We’ll pipe the pantheria data frame to the arrange function’s first argument, and choose to arrange by the adult_body_mass_g column:

pantheria %>% arrange(AdultBodyMass_g) %>% head()

An awesome example

OK, here’s where it gets cool. We can chain dplyr functions in succession. This lets us write data manipulation steps in the order we think of them and avoid creating temporary variables in the middle to capture the output. This works because the output from every dplyr function is a data frame and the first argument of every dplyr function is a data frame.

Say we wanted to find the species with the highest body-mass-to-length ratio:

pantheria %>%
  mutate(mass_to_length = AdultBodyMass_g / AdultHeadBodyLen_mm) %>%
  arrange(desc(mass_to_length)) %>%
  select(scientificNameStd, mass_to_length) %>% head()

So, we took pantheria, fed it to mutate() to create a mass-length ratio column, arranged the resulting data frame in descending order by that ratio, and selected the columns we wanted to see. This is just the beginning. If you can imagine it, you can string it together. If you want to debug your code, just pull a pipe off the end and run the code down to that step. Or build your analysis up and add successive pipes.

Here’s one more example. Let’s ask what taxonomic orders have a median litter size greater than 3.

pantheria %>% group_by(Order) %>%
  summarise(median_litter = median(LitterSize, na.rm = TRUE)) %>%
  filter(median_litter > 3) %>%
  arrange(desc(median_litter)) %>%
  select(Order, median_litter)


RS-eco/traitdata documentation built on Oct. 29, 2022, 7:52 p.m.