knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )

The `purrr`

package has a comprehensive set of tools for working with lists and vectors, and especially for iterating operations and analyses with its `map()`

. This style of iteration is particularly powerful when used with the `tibble`

package, because—unlike regular data frames—tibbles allow you to use lists as data frame columns as well as vectors. That means that complex objects, including models, plots and even other data frames can be stored inside a data frame list-column alongside other data.

The biggest downside of being able to do complex analysis on many list elements is that one little error can bring a lot of computation down. Another is that a warning or other message might get lost: if the 37th statistical model you fit of 45 has a problem, are you going to catch it?

`purrr`

comes with tools for dealing with errors, warnings and other "side effects", but it's difficult to pair them effectively with `purrr`

's massively powerful iteration tools. And that's where `collateral`

comes in: it gives you ways to drop-in replacements for `map()`

that use employ these side-effect capturing tools, as well as an array of other helpers to make navigating everything you've captured more pleasant.

- Basic exploratory analysis
- Grouping the traditional way
- Nesting and handling side effects with purrr
- Collateral: capture, identify and isolate side effects
- Filtering on, and summarising, side effects

The aim of this vignette isn't just to get you acquainted with `collateral`

's tools: it's also to demonstrate the value of a tidy list-column workflow. (If you're already a pro at this stuff, skip ahead to section 4!)

We'll be using the `diamonds`

dataset, which comes with the `ggplot2`

package. Let's take a look:

library(tibble) library(dplyr) library(tidyr) library(purrr) library(ggplot2) diamonds

This dataset describes the prices and properties of about fifty thousand diamonds. We can see a number of categorical variables, including `cut`

, `color`

and `clarity`

, and several continuous variables, like the `price`

.

How is `price`

distributed?

ggplot(diamonds) + geom_histogram(aes(x = price, y = stat(count))) ggplot(diamonds) + geom_histogram(aes(x = price, y = stat(count))) + scale_x_log10()

Looking at a histogram with a logarithmic scale, we can see that it has multiple peaks. That probably means that there are several distinct groups in the data. Maybe there's a relationship between `price`

and one (or more) of the categorical variables.

ggplot(diamonds) + geom_histogram(aes(x = price, y = stat(count))) + facet_wrap(vars(cut), ncol = 1) + scale_x_log10() + ggtitle("Price vs. cut") ggplot(diamonds) + geom_histogram(aes(x = price, y = stat(count))) + facet_wrap(vars(color), ncol = 1) + scale_x_log10() + ggtitle("Price vs. color")

It looks like there might be several relationships here, but let's focus on `cut`

and `color`

for now. `ggplot2`

makes it very easy to visualise differences across a couple of grouping variables, but we may not always want to create a facetted plot. In fact, there may be lots of activities we want to do that don't involve plotting at all.

This is a broad class of problem in data analysis called **split-apply-combine**:

*split*data up into groups,*apply*operations to those groups, and*combine*the results.

One way of doing this with base R is to use `split()`

, which takes in a data frame and some grouping variables and gives you back a list of identically-structured data frames broken up row-wise according to the values of the grouping variables.

diamonds %>% split(diamonds$cut)

We could use a loop to iterate over each of these data frames and perform our analysis, adding the results to a larger data frame or list as we go, but there's a better solution.

As it happens, the `purrr::map()`

functions work quite well on the output of `split()`

:

diamonds %>% split(diamonds$cut) %>% map_dbl(~ mean(.$price))

But this starts to get really unwieldy as you add more grouping variables. In particular, "meta data"— data about the groups—are either stuck inside the original data frames or are encoded in the name of each data frame in the list. It's also difficult to do a sequence of operations and keep everything associated with the original groupings:

diamonds_list <- diamonds %>% split(list(diamonds$cut, diamonds$color)) map_dbl(diamonds_list, ~ mean(.$price)) map_dbl(diamonds_list, ~ cor(.$price, .$depth))

Yuck. Nested data frames, which are supported by tibbles, make this much easier to handle:

nested_diamonds <- diamonds %>% select(cut, color, clarity, depth, price) %>% nest(data = c(clarity, depth, price)) nested_diamonds

What are we looking at here? We still have our grouping variables as columns, but each unique combination only takes up one row—and we have another, `data`

, that says it's a tibble. The column isn't a vector but a list, and if we look at the elements in it we can see the other columns.

Let's see the first two elements of `nested_diamonds$data`

:

nested_diamonds$data[[1]] nested_diamonds$data[[2]]

Not only do we have our group identifiers associated with each data frame now, but we can create other outputs and keep *them* associated with the groups too.

For example, let's extract some summary statistics from each group:

nested_diamonds %>% mutate( mean_price = map_dbl(data, ~ mean(.$price)), pricedepth_cor = map_dbl(data, ~ cor(.$price, .$depth)))

We can even do complex things like build regression models on the groups. (Note that we use the regular `map`

when we're returning a complex object or data frame for each group, where as we use the typed variants, like `map_dbl`

or `map_chr`

, when we're returning atomic elements like numbers and strings.)

diamonds_models <- nested_diamonds %>% mutate( price_mod = map(data, ~ lm(.$price ~ .$depth)), price_summary = map(price_mod, summary), price_rsq = map_dbl(price_summary, "r.squared")) diamonds_models

(These are pretty lousy models, but they'll do for our purposes!)

We can carry on like this, adding analyses to the groups, but it's also a pretty reckless way to operate. There could be anything going on in these list columns, and without being able to see them from the outside it'd be easy to miss a problem that turns up. In addition, we might hit an error and be unsure which group is causing it.

For example, if we remove all of the rows in one of the `data`

groups, we'll get this error:

# sabotage a group by removing all its rows nested_diamonds$data[[5]] <- nested_diamonds$data[[5]] %>% filter(price < 300) # now attempt to calculate summary statistics diamonds_models = nested_diamonds %>% mutate( price_mod = map(data, ~ lm(.$price ~ .$depth)), price_summary = map(price_mod, summary), price_rsq = map_dbl(price_summary, "r.squared")) diamonds_models #> Error in mutate_impl(.data, dots) : Evaluation error: 0 (non-NA) cases.

We can see that there's *a* problem here, but we have no idea which group is causing it without inspecting them (at least, we wouldn't if we hadn't just set this up!).

`purrr`

tries to tackle this problem with two functions: `safely()`

and `quietly()`

. The former catches errors; the latter catches warnings, messages and other output. You use them by wrapping other functions with them, like this:

safe_lm <- safely(lm) purrr_models <- nested_diamonds %>% mutate(price_mod = map(data, ~ safe_lm(.$price ~ .$depth))) purrr_models

This is great! We bulldozed our way through the error, so we still have the other 34 models! In lieu of a list of model objects, `price_mod`

is now a list of *lists*: each element of the column is a list with two elements: `result`

and `error`

(`quietly()`

returns four components).

Let's see what our first two component lists, corresponding to the first two groups, look like:

purrr_models$price_mod[[1]] purrr_models$price_mod[[5]] purrr_models %>% mutate(mod_result = map(price_mod, "result"))

We can now extract the `result`

element using `map()`

, and there's a `NULL`

value for the group that failed. But there are still some big problems here:

- We have to remember to wrap each function in
`safely()`

or`quietly()`

ahead of time; and - We have to check each element of the list column, or carefully extract the results, to locate the problem.

The idea of `collateral`

is to both make these functions easier to use and to make them more powerful. Instead of wrapping our functions ourselves, we use `map()`

variants:

library(collateral) nested_diamonds$data[[5]] <- nested_diamonds$data[[5]] %>% filter(price < 300) collat_models <- nested_diamonds %>% mutate(price_mod = map_peacefully(data, ~ lm(.x$price ~ .x$depth))) print(collat_models)

The `price_mod`

column is still a list of lists containing `result`

and `error`

components, but it now prints nicely when we view the entire tibble. Scanning down it, we can see:

- which group hit an error immediately,
- that all rows but the fifth result a result (labelled "R"), and
- that the fifth row threw an error (labelled "E").

You can, of course, pull out the results you would've gotten from a regular map (assuming you didn't hit any errors), and you'll probably *want* to do that in order to continue operating on those results. You can also extract other side effects, although keep in mind that:

- The regular typed
`map`

variants, like`map_chr`

, are excellent for extracting side effects quickly. - Unlike other kinds of output, errors aren't just simple character vectors. Generally, you'll want the
`error$message`

. - Sometimes an operation might deliver more than one warning, message or output per row. That means that you may need to concatenate several warnings together---using the
`paste`

function with the`collapse`

option, for example.

collat_models %>% mutate( # this returns a list of `lm` objects mod_result = map(price_mod, "result"), # this returns a character vector mod_error = map_chr(price_mod, c("error", "message"), .null = NA))

`collateral`

also comes with some additional helpers. You can get a quick `summary`

of the side effects (which also invisibly returns a named vector for non-interactive use):

summary(collat_models$price_mod)

You can also use the `tally_*()`

functions in conjunction with `dplyr::summarise()`

to get a count of each component (even if the top-level data frame grouped again!) or `has_*()`

to `dplyr::filter()`

out the rows that didn't make it (or the ones that did):

collat_models %>% group_by(color) %>% summarise( n_res = tally_results(price_mod), n_err = tally_errors(price_mod)) collat_models %>% filter(has_errors(price_mod)) collat_models %>% filter(!has_results(price_mod))

Together, these tools make debugging problems drastically easier, even if you need to run a thousand statistical models or analyse 500 datasets.

You might've noticed that our `map`

variant was called `map_peacefully`

, not `map_safely`

or `map_quietly`

. Those functions are available too, but `map_peacefully`

does both; giving you returned results, errors, warnings, messages and other output.

Why combine the two? We find that, more often that not operations can return either errors, warnings or *both*. For example, `log`

outputs a warning when it's given negative numbers but throws an error if its input is non-numeric. Regression functions are particularly prone to this.

You're welcome to use `map_safely`

or `map_quietly`

if you prefer, but for nearly all use cases, `map_peacefully`

is the easier option. Good luck!

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.