knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%", warning = FALSE, message = FALSE )
The goal of pwiser is to make applying arbitrary functions across combinations of columns within {dplyr}
easy. Currently, the only function is pairwise()
, which applies a function to all pairs of columns.
pairwise()
is an altered version of dplyr::across()
and, similarly, is meant to be used within mutate()
/ transmute()
and summarise()
verbs. pwiser sprang from conversations on an Rstudio Community thread and related conversations.
summarise()
library(dplyr) library(pwiser) library(palmerpenguins) penguins <- na.omit(penguins)
pairwise()
respects grouped dataframes:
# When using `pairwise()` within `summarise()` the function(s) applied should # have an output length of 1 (for each group). (Though could wrap in `list()` to make a list column output.) cor_p_value <- function(x, y){ stats::cor.test(x, y)$p.value } penguins %>% group_by(species) %>% summarise(pairwise(contains("_mm"), cor_p_value, .is_commutative = TRUE), n = n())
Setting .is_commutative = TRUE
can save time on redundant calculations.
Equivalently, could have written with .x
and .y
in a lambda function:
penguins %>% group_by(species) %>% summarise(pairwise(contains("_mm"), ~stats::cor.test(.x, .y)$p.value, .is_commutative = TRUE), n = n())
mutate()
Can apply multiple functions via a named list:
penguins %>% mutate(pairwise(contains("_mm"), list(ratio = `/`, difference = `-`), .names = "features_{.fn}_{.col_x}_{.col_y}")) %>% glimpse()
Can use .names
to customize outputted column names.
Install from GitHub with:
# install.packages("devtools") devtools::install_github("brshallo/pwiser")
There are other tools in R for doing tidy pairwise operations. widyr (by David Robinson) and corrr (in the tidymodels
suite) offer solutions (primarily) for summarising contexts (corrr::colpair_map()
is the closest comparison as it also supports arbitrary functions). recipes::step_ratio()
and recipes::step_interact()
can be used for making pairwise products or ratios in mutating contexts. (See Appendix section of prior blog post on Tidy Pairwise Operations for a few cataloged tweets on these approaches.)
The novelty of pwiser::pairwise()
is its integration in both mutating and summarising verbs in {dplyr}
.
dplyover
The dplyover package offers a wide range of extensions on across()
for iteration problems (and was identified after sharing the initial version of {pwiser}
). dplyover::across2x()
can be used to do the same things as pwiser::pairwise()
. As {dplyover}
continues to mature its interface and improve its performance, we may eventually mark {pwiser}
as superseded.
For problems with lots of data you should use more efficient approaches.
Matrix operations (compared to dataframes) are much more computationally efficient for problems involving combinations (which can get big very quickly). We've done nothing to optimize the computation of functions run through pwiser.
For example, when calculating pearson correlations, pairwise()
calculates the correlation separately for each pair, whereas stats::cor()
(or corrr::correlate()
which calls cor()
under the hood) uses R's matrix operations to calculate all correlations simultaneously.
library(modeldata) data(cells) cells_numeric <- select(cells, where(is.numeric)) dim(cells_numeric)
Let's do a speed test using the 56 numeric columns from the cells
dataset (which means r choose(56, 2)
pairwise combinations or r 56*55
permutations) imported from {modeltime}
.
library(corrr) if (!requireNamespace("dplyover")) devtools::install_github('TimTeaFan/dplyover') library(dplyover) set.seed(123) microbenchmark::microbenchmark( cor = cor(cells_numeric), correlate = correlate(cells_numeric), colpair_map = colpair_map(cells_numeric, cor), pairwise = summarise(cells_numeric, pairwise(where(is.numeric), cor, .is_commutative = TRUE)), dplyover = summarise(cells_numeric, across2x(where(is.numeric), where(is.numeric), cor, .comb = "minimal")), times = 10L, unit = "ms")
The stats::cor()
and corrr::correlate()
approaches are many times faster than using pairwise()
. However pairwise()
still only takes about one fifth of a second to calculate 1540 correlations in this case. Hence on relatively constrained problems pairwise()
is still quite usable. (Though there are many cases where you should go for a matrix based solution.)
pairwise()
seems to be faster than corrr::colpair_map()
(a more apples-to-apples comparison as both can handle arbitrary functions), though much of this speed difference goes away when .is_commutative = FALSE
.
pairwise()
(at the moment) seems to also be faster than running the equivalent operation with dplyover::across2x()
.
Session info
sessionInfo()
See issue #1 for notes on limitations in current set-up.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.