knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The goal of this Article is to compare the performances of lay()
to alternatives described here.
As you will see, the code using lay()
is quite efficient.
The only alternative that is clearly more efficient is the one labeled below "vectorized".
Unfortunately, such a vectorized approach imply to refer explicitly to all column names which data are used.
Furthermore, such a vectorized approach is not applicable generally, as it can only deal with logical
and arithmetic operators and does allow the use of other types of functions.
This Article requires you to load the following packages:
library(lay) ## for lay() and the data library(dplyr) ## for many things library(tidyr) ## for pivot_longer() and pivot_wider() library(purrr) ## for pmap_lgl() library(slider) ## for slide() library(data.table) ## for an alternative to base and dplyr library(bench) ## for running the benchmarks library(forcats) ## for sorting levels in plot with fct_reorder()
Please install them if they are not present on your system.
Consider the dataset drugs_full
from our package {lay}:
drugs_full
In this dataset, all columns but caseid
record the use of pain relievers for non medical purpose.
For each drug there is a certain number of users and non-users:
drugs_full |> pivot_longer(-caseid, names_to = "drug", values_to = "used") |> count(drug, used) |> mutate(used = if_else(used == 1, "have_used", "have_not_used")) |> pivot_wider(names_from = used, values_from = n)
In this Article, we compare the efficiency of alternative ways to create a new variable named everused
which indicates if each respondent has used any of the considered pain relievers for non medical purpose or not.
We will run benchmarks on the dataset drugs_full
and its r nrow(drugs_full)
rows, as well as on a subset of this data called drugs
that only contains r nrow(drugs)
rows.
r nrow(drugs_full)
rows)Let's compare the running time of different methods to do this job on the full dataset:
drugs_full_dt <- data.table(drugs_full) ## coercion to data.table benchmark1 <- mark( vectorized = { drugs_full |> mutate(everused = codeine | hydrocd | methdon | morphin | oxycodp | tramadl | vicolor)}, lay = { drugs_full |> select(-caseid) |> mutate(everused = lay(pick(everything()), any))}, lay_alternative = { drugs_full |> mutate(everused = lay(pick(-caseid), any, .method = "tidy"))}, c_across = { drugs_full |> rowwise() |> mutate(everused = any(c_across(-caseid))) |> ungroup()}, pivot_pivot = { drugs_full |> pivot_longer(-caseid) |> group_by(caseid) |> mutate(everused = any(value)) |> ungroup() |> pivot_wider()}, pmap = { drugs_full |> mutate(everused = pmap_lgl(pick(-caseid), ~ any(...)))}, slider = { drugs_full |> mutate(everused = slide_vec(pick(-caseid), any))}, data.table = { drugs_full_dt[, ..I := .I] drugs_full_dt[, everused := any(.SD), by = ..I, .SDcols = -"caseid"]}, apply = { drugs_full |> mutate(everused = apply(pick(-caseid), 1, any))}, 'for' = { everused <- logical(nrow(drugs_full)) columns_in <- colnames(drugs_full) != "caseid" for (i in seq_len(nrow(drugs_full))) everused[i] <- any(drugs_full[i, columns_in])}, iterations = 5, time_unit = "ms", check = FALSE )
Here are the results of this first series of benchmarks:
benchmark1 |> arrange(median)
benchmark1 |> mutate(expression = fct_reorder(as.character(expression), median, .desc = TRUE)) |> plot()
Note that the x-axis of the plot is on a logarithmic scale.
r nrow(drugs)
rows)Let's repeat our benchmarks using a only a subset of the original dataset:
drugs_dt <- data.table(drugs) ## coercion to data.table benchmark2 <- mark( vectorized = { drugs |> mutate(everused = codeine | hydrocd | methdon | morphin | oxycodp | tramadl | vicolor)}, lay = { drugs |> select(-caseid) |> mutate(everused = lay(pick(everything()), any))}, lay_alternative = { drugs |> mutate(everused = lay(pick(-caseid), any, .method = "tidy"))}, c_across = { drugs |> rowwise() |> mutate(everused = any(c_across(-caseid))) |> ungroup()}, pivot_pivot = { drugs |> pivot_longer(-caseid) |> group_by(caseid) |> mutate(everused = any(value)) |> ungroup() |> pivot_wider()}, pmap = { drugs |> mutate(everused = pmap_lgl(pick(-caseid), ~ any(...)))}, slider = { drugs |> mutate(everused = slide_vec(pick(-caseid), any))}, data.table = { drugs_dt[, ..I := .I] drugs_dt[, everused := any(.SD), by = ..I, .SDcols = -"caseid"]}, apply = { drugs |> mutate(everused = apply(pick(-caseid), 1, any))}, 'for' = { everused <- logical(nrow(drugs)) columns_in <- colnames(drugs) != "caseid" for (i in seq_len(nrow(drugs))) everused[i] <- any(drugs[i, columns_in])}, iterations = 30, time_unit = "ms", check = FALSE )
Here are the results of this second series of benchmarks:
benchmark2 |> arrange(median)
benchmark2 |> mutate(expression = fct_reorder(as.character(expression), median, .desc = TRUE)) |> plot(type = "violin")
Note again that the x-axis of the plot is on a logarithmic scale.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.