knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Article overview

The goal of this Article is to compare the performances of lay() to alternatives described here. As you will see, the code using lay() is quite efficient. The only alternative that is clearly more efficient is the one labeled below "vectorized". Unfortunately, such a vectorized approach imply to refer explicitly to all column names which data are used. Furthermore, such a vectorized approach is not applicable generally, as it can only deal with logical and arithmetic operators and does allow the use of other types of functions.

Loading packages

This Article requires you to load the following packages:

library(lay)        ## for lay() and the data
library(dplyr)      ## for many things
library(tidyr)      ## for pivot_longer() and pivot_wider()
library(purrr)      ## for pmap_lgl()
library(slider)     ## for slide()
library(data.table) ## for an alternative to base and dplyr
library(bench)      ## for running the benchmarks
library(forcats)    ## for sorting levels in plot with fct_reorder()

Please install them if they are not present on your system.

An example of a rowwise task

Consider the dataset drugs_full from our package {lay}:

drugs_full

In this dataset, all columns but caseid record the use of pain relievers for non medical purpose.

For each drug there is a certain number of users and non-users:

drugs_full |>
  pivot_longer(-caseid, names_to = "drug", values_to = "used") |>
  count(drug, used) |>
  mutate(used = if_else(used == 1, "have_used", "have_not_used")) |>
  pivot_wider(names_from = used, values_from = n)

In this Article, we compare the efficiency of alternative ways to create a new variable named everused which indicates if each respondent has used any of the considered pain relievers for non medical purpose or not.

We will run benchmarks on the dataset drugs_full and its r nrow(drugs_full) rows, as well as on a subset of this data called drugs that only contains r nrow(drugs) rows.

Benchmarks on the full dataset (r nrow(drugs_full) rows)

Let's compare the running time of different methods to do this job on the full dataset:

drugs_full_dt <- data.table(drugs_full) ## coercion to data.table

benchmark1 <- mark(
  vectorized = {
    drugs_full |>
      mutate(everused = codeine | hydrocd | methdon | morphin | oxycodp | tramadl | vicolor)},
  lay = {
    drugs_full |>
      select(-caseid) |>
      mutate(everused = lay(pick(everything()), any))},
  lay_alternative = {
    drugs_full |>
      mutate(everused = lay(pick(-caseid), any, .method = "tidy"))},
  c_across = {
    drugs_full |>
      rowwise() |>
      mutate(everused = any(c_across(-caseid))) |>
      ungroup()},
  pivot_pivot = {
    drugs_full |>
      pivot_longer(-caseid) |>
      group_by(caseid) |>
      mutate(everused = any(value)) |>
      ungroup() |>
      pivot_wider()},
  pmap = {
    drugs_full |>
      mutate(everused = pmap_lgl(pick(-caseid), ~ any(...)))},
  slider = {
    drugs_full |>
      mutate(everused = slide_vec(pick(-caseid), any))},
  data.table = {
    drugs_full_dt[, ..I := .I]
    drugs_full_dt[, everused := any(.SD), by = ..I, .SDcols = -"caseid"]},
  apply = {
    drugs_full |>
      mutate(everused = apply(pick(-caseid), 1, any))},
  'for' = {
    everused <- logical(nrow(drugs_full))
    columns_in <- colnames(drugs_full) != "caseid"
    for (i in seq_len(nrow(drugs_full))) everused[i] <- any(drugs_full[i, columns_in])},
  iterations = 5,
  time_unit = "ms",
  check = FALSE
  )

Here are the results of this first series of benchmarks:

benchmark1 |>
  arrange(median)
benchmark1 |>
  mutate(expression = fct_reorder(as.character(expression), median, .desc = TRUE)) |>
  plot()

Note that the x-axis of the plot is on a logarithmic scale.

Benchmarks on a subset of the data (r nrow(drugs) rows)

Let's repeat our benchmarks using a only a subset of the original dataset:

drugs_dt <- data.table(drugs) ## coercion to data.table

benchmark2 <- mark(
  vectorized = {
    drugs |>
      mutate(everused = codeine | hydrocd | methdon | morphin | oxycodp | tramadl | vicolor)},
  lay = {
    drugs |>
      select(-caseid) |>
      mutate(everused = lay(pick(everything()), any))},
  lay_alternative = {
    drugs |>
      mutate(everused = lay(pick(-caseid), any, .method = "tidy"))},
  c_across = {
    drugs |>
      rowwise() |>
      mutate(everused = any(c_across(-caseid))) |>
      ungroup()},
  pivot_pivot = {
    drugs |>
      pivot_longer(-caseid) |>
      group_by(caseid) |>
      mutate(everused = any(value)) |>
      ungroup() |>
      pivot_wider()},
  pmap = {
    drugs |>
      mutate(everused = pmap_lgl(pick(-caseid), ~ any(...)))},
  slider = {
    drugs |>
      mutate(everused = slide_vec(pick(-caseid), any))},
  data.table = {
    drugs_dt[, ..I := .I]
    drugs_dt[, everused := any(.SD), by = ..I, .SDcols = -"caseid"]},
  apply = {
    drugs |>
      mutate(everused = apply(pick(-caseid), 1, any))},
  'for' = {
    everused <- logical(nrow(drugs))
    columns_in <- colnames(drugs) != "caseid"
    for (i in seq_len(nrow(drugs))) everused[i] <- any(drugs[i, columns_in])},
  iterations = 30,
  time_unit = "ms",
  check = FALSE
  )

Here are the results of this second series of benchmarks:

benchmark2 |>
  arrange(median)
benchmark2 |>
  mutate(expression = fct_reorder(as.character(expression), median, .desc = TRUE)) |>
  plot(type = "violin")

Note again that the x-axis of the plot is on a logarithmic scale.

Benchmarks' environment

sessionInfo()


romainfrancois/lay documentation built on Nov. 4, 2023, 6:20 p.m.