over: Apply functions to a list or vector in 'dplyr'

Description Usage Arguments Value Note Examples See Also

View source: R/over.R

Description

over() makes it easy to create new colums inside a dplyr::mutate() or dplyr::summarise() call by applying a function (or a set of functions) to an atomic vector or list using a syntax similar to dplyr::across(). The main difference is that dplyr::across() transforms or creates new columns based on existing ones, while over() can create new columns based on a vector or list to which it will apply one or several functions. Whereas dplyr::across() allows tidy-selection helpers to select columns, over() provides its own helper functions to select strings or values based on either (1) values of specified columns or (2) column names. See the examples below and the vignette("why_dplyover") for more details.

Usage

1
over(.x, .fns, ..., .names = NULL, .names_fn = NULL)

Arguments

.x

An atomic vector or list to apply functions to. Alternatively a <selection helper> can be used to create a vector.

.fns

Functions to apply to each of the elements in .x. For functions that expect variable names as input, the selected strings need to be turned into symbols and evaluated. dplyrover comes with a genuine helper function that evaluates strings as names .().

Possible values are:

  • A function

  • A purrr-style lambda

  • A list of functions/lambdas

For examples see the example section below.

Note that, unlike across(), over() does not accept NULL as a value to .fns.

...

Additional arguments for the function calls in .fns.

.names

A glue specification that describes how to name the output columns. This can use {x} to stand for the selected vector element, and {fn} to stand for the name of the function being applied. The default (NULL) is equivalent to "{x}" for the single function case and "{x}_{fn}" for the case where a list is used for .fns.

Note that, depending on the nature of the underlying object in .x, specifying {x} will yield different results:

  • If .x is an unnamed atomic vector, {x} will represent each value.

  • If .x is a named list or atomic vector, {x} will represent each name.

  • If .x is an unnamed list, {x} will be the index number running from 1 to length(x).

This standard behavior (interpretation of {x}) can be overwritten by directly specifying:

  • {x_val} for .x's values

  • {x_nm} for its names

  • {x_idx} for its index numbers

Alternatively, a character vector of length equal to the number of columns to be created can be supplied to .names. Note that in this case, the glue specification described above is not supported.

.names_fn

Optionally, a function that is applied after the glue specification in .names has been evaluated. This is, for example, helpful in case the resulting names need to be further cleaned or trimmed.

Value

A tibble with one column for each element in .x and each function in .fns.

Note

Similar to dplyr::across() over() works only inside dplyr verbs.

Examples

It has two main use cases. They differ in how the elements in .x are used. Let's first attach dplyr:

library(dplyr)

# For better printing
iris <- as_tibble(iris)

(1) The General Use Case

Here the values in .x are used as inputs to one or more functions in .fns. This is useful, when we want to create several new variables based on the same function with varying arguments. A good example is creating a bunch of lagged variables.

tibble(x = 1:25) %>%
  mutate(over(c(1:3),
              ~ lag(x, .x)))
#> # A tibble: 25 x 4
#>       x   `1`   `2`   `3`
#>   <int> <int> <int> <int>
#> 1     1    NA    NA    NA
#> 2     2     1    NA    NA
#> 3     3     2     1    NA
#> 4     4     3     2     1
#> # ... with 21 more rows

Lets create a dummy variable for each unique value in 'Species':

iris %>%
  mutate(over(unique(Species),
             ~ if_else(Species == .x, 1, 0)),
         .keep = "none")
#> # A tibble: 150 x 3
#>   setosa versicolor virginica
#>    <dbl>      <dbl>     <dbl>
#> 1      1          0         0
#> 2      1          0         0
#> 3      1          0         0
#> 4      1          0         0
#> # ... with 146 more rows

With over() it is also possible to create several dummy variables with different thresholds. We can use the .names argument to control the output names:

iris %>%
mutate(over(seq(4, 7, by = 1),
            ~ if_else(Sepal.Length < .x, 1, 0),
            .names = "Sepal.Length_{x}"),
         .keep = "none")
#> # A tibble: 150 x 4
#>   Sepal.Length_4 Sepal.Length_5 Sepal.Length_6 Sepal.Length_7
#>            <dbl>          <dbl>          <dbl>          <dbl>
#> 1              0              0              1              1
#> 2              0              1              1              1
#> 3              0              1              1              1
#> 4              0              1              1              1
#> # ... with 146 more rows

A similar approach can be used with dates. Below we loop over a date sequence to check whether the date falls within a given start and end date. We can use the .names_fn argument to clean the resulting output names:

# some dates
dat_tbl <- tibble(start = seq.Date(as.Date("2020-01-01"),
                                   as.Date("2020-01-15"),
                                   by = "days"),
                  end = start + 10)

dat_tbl %>%
  mutate(over(seq(as.Date("2020-01-01"),
                  as.Date("2020-01-21"),
                  by = "weeks"),
              ~ .x >= start & .x <= end,
              .names = "day_{x}",
              .names_fn = ~ gsub("-", "", .x)))
#> # A tibble: 15 x 5
#>    start      end        day_20200101 day_20200108 day_20200115
#>    <date>     <date>     <lgl>        <lgl>        <lgl>       
#>  1 2020-01-01 2020-01-11 TRUE         TRUE         FALSE       
#>  2 2020-01-02 2020-01-12 FALSE        TRUE         FALSE       
#>  3 2020-01-03 2020-01-13 FALSE        TRUE         FALSE       
#>  4 2020-01-04 2020-01-14 FALSE        TRUE         FALSE       
#>  5 2020-01-05 2020-01-15 FALSE        TRUE         TRUE        
#>  6 2020-01-06 2020-01-16 FALSE        TRUE         TRUE        
#>  7 2020-01-07 2020-01-17 FALSE        TRUE         TRUE        
#>  8 2020-01-08 2020-01-18 FALSE        TRUE         TRUE        
#>  9 2020-01-09 2020-01-19 FALSE        FALSE        TRUE        
#> 10 2020-01-10 2020-01-20 FALSE        FALSE        TRUE        
#> 11 2020-01-11 2020-01-21 FALSE        FALSE        TRUE        
#> 12 2020-01-12 2020-01-22 FALSE        FALSE        TRUE        
#> 13 2020-01-13 2020-01-23 FALSE        FALSE        TRUE        
#> 14 2020-01-14 2020-01-24 FALSE        FALSE        TRUE        
#> 15 2020-01-15 2020-01-25 FALSE        FALSE        TRUE

over() can summarise data in wide format. In the example below, we want to know for each group of customers (new, existing, reactivate), how much percent of the respondents gave which rating on a five point likert scale (item1). A usual approach in the tidyverse would be to use count %>% group_by %>% mutate, which yields the same result in the usually prefered long format. Sometimes, however, we might want this kind of summary in the wide format, and in this case over() comes in handy:

csatraw %>%
  group_by(type) %>%
  summarise(over(c(1:5),
                 ~ mean(item1 == .x)))
#> # A tibble: 3 x 6
#>   type          `1`   `2`   `3`   `4`    `5`
#>   <chr>       <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1 existing   0.156  0.234 0.234 0.266 0.109 
#> 2 new        0.0714 0.268 0.357 0.214 0.0893
#> 3 reactivate 0.0667 0.267 0.133 0.4   0.133

Instead of a vector we can provide a named list of vectors to calculate the top two and bottom two categories on the fly:

csatraw %>%
  group_by(type) %>%
  summarise(over(list(bot2 = c(1:2),
                      mid  = 3,
                      top2 = c(4:5)),
                 ~ mean(item1 %in% .x)))
#> # A tibble: 3 x 4
#>   type        bot2   mid  top2
#>   <chr>      <dbl> <dbl> <dbl>
#> 1 existing   0.391 0.234 0.375
#> 2 new        0.339 0.357 0.304
#> 3 reactivate 0.333 0.133 0.533

over() can also loop over columns of a data.frame. In the example below we want to create four different dummy variables of item1: (i) the top and (ii) bottom category as well as (iii) the top two and (iv) the bottom two categories. We can create a lookup data.frame and use all columns but the first as input to over(). In the function call we make use of base R's match(), where .x represents the new values and recode_df[, 1] refers to the old values.

recode_df <- data.frame(old  = c(1, 2, 3, 4, 5),
                        top1 = c(0, 0, 0, 0, 1),
                        top2 = c(0, 0, 0, 1, 1),
                        bot1 = c(1, 0, 0, 0, 0),
                        bot2 = c(1, 1, 0, 0, 0))

csatraw %>%
  mutate(over(recode_df[,-1],
              ~ .x[match(item1, recode_df[, 1])],
              .names = "item1_{x}")) %>%
  select(starts_with("item1"))
#> # A tibble: 150 x 6
#>   item1 item1_open item1_top1 item1_top2 item1_bot1 item1_bot2
#>   <dbl> <chr>           <dbl>      <dbl>      <dbl>      <dbl>
#> 1     3 12                  0          0          0          0
#> 2     2 22                  0          0          0          1
#> 3     2 21, 22, 23          0          0          0          1
#> 4     4 12, 13, 11          0          1          0          0
#> # ... with 146 more rows

over() work nicely with comma separated values stored in character vectors. In the example below, the colum csat_open contains one or more comma separated reasons why a specific customer satisfaction rating was given. We can easily create a column for each response category with the help of dist_values - a wrapper around unique which can split vector elements using a separator:

csat %>%
  mutate(over(dist_values(csat_open, .sep = ", "),
              ~ as.integer(grepl(.x, csat_open)),
              .names = "rsp_{x}",
              .names_fn = ~ gsub("\\s", "_", .x)),
              .keep = "none") %>% glimpse
#> Rows: 150
#> Columns: 6
#> $ rsp_friendly_staff <int> 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,~
#> $ rsp_good_service   <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,~
#> $ rsp_great_product  <int> 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,~
#> $ rsp_no_response    <int> 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1,~
#> $ rsp_too_expensive  <int> 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,~
#> $ rsp_unfriendly     <int> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,~

(2) A Very Specific Use Case

Here strings are supplied to .x to construct column names (sharing the same stem). This allows us to dynamically use more than one column in the function calls in .fns. To work properly, the strings need to be turned into symbols and evaluated. For this dplyover provides a genuine helper function .() that evaluates strings and helps to declutter the otherwise rather verbose code. .() supports glue syntax and takes a string as argument.

Below are a few examples using two colums in the function calls in .fns. For the two column case across2() provides a more intuitive API that is closer to the original dplyr::across. Using .() inside over is really useful for cases with more than two columns.

Consider the following example of a purrr-style formula in .fns using .():

iris %>%
  mutate(over(c("Sepal", "Petal"),
              ~ .("{.x}.Width") + .("{.x}.Length")
              ))
#> # A tibble: 150 x 7
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal Petal
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <dbl> <dbl>
#> 1          5.1         3.5          1.4         0.2 setosa    8.6   1.6
#> 2          4.9         3            1.4         0.2 setosa    7.9   1.6
#> 3          4.7         3.2          1.3         0.2 setosa    7.9   1.5
#> 4          4.6         3.1          1.5         0.2 setosa    7.7   1.7
#> # ... with 146 more rows

The above syntax is equal to the more verbose:

iris %>%
  mutate(over(c("Sepal", "Petal"),
              ~ eval(sym(paste0(.x, ".Width"))) +
                eval(sym(paste0(.x, ".Length")))
              ))
#> # A tibble: 150 x 7
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal Petal
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <dbl> <dbl>
#> 1          5.1         3.5          1.4         0.2 setosa    8.6   1.6
#> 2          4.9         3            1.4         0.2 setosa    7.9   1.6
#> 3          4.7         3.2          1.3         0.2 setosa    7.9   1.5
#> 4          4.6         3.1          1.5         0.2 setosa    7.7   1.7
#> # ... with 146 more rows

.() also works with anonymous functions:

iris %>%
  summarise(over(c("Sepal", "Petal"),
                function(x) mean(.("{x}.Width"))
                ))
#> # A tibble: 1 x 2
#>   Sepal Petal
#>   <dbl> <dbl>
#> 1  3.06  1.20

A named list of functions:

iris %>%
  mutate(over(c("Sepal", "Petal"),
              list(product = ~ .("{.x}.Width") * .("{.x}.Length"),
                   sum = ~ .("{.x}.Width") + .("{.x}.Length"))
                   ),
         .keep = "none")
#> # A tibble: 150 x 4
#>   Sepal_product Sepal_sum Petal_product Petal_sum
#>           <dbl>     <dbl>         <dbl>     <dbl>
#> 1          17.8       8.6          0.28       1.6
#> 2          14.7       7.9          0.28       1.6
#> 3          15.0       7.9          0.26       1.5
#> 4          14.3       7.7          0.3        1.7
#> # ... with 146 more rows

Again, use the .names argument to control the output names:

iris %>%
  mutate(over(c("Sepal", "Petal"),
              list(product = ~ .("{.x}.Width") * .("{.x}.Length"),
                   sum = ~ .("{.x}.Width") + .("{.x}.Length")),
              .names = "{fn}_{x}"),
         .keep = "none")
#> # A tibble: 150 x 4
#>   product_Sepal sum_Sepal product_Petal sum_Petal
#>           <dbl>     <dbl>         <dbl>     <dbl>
#> 1          17.8       8.6          0.28       1.6
#> 2          14.7       7.9          0.28       1.6
#> 3          15.0       7.9          0.26       1.5
#> 4          14.3       7.7          0.3        1.7
#> # ... with 146 more rows

See Also

over2() to apply a function to two objects.

All members of the <over-across function family>.


TimTeaFan/dplyover documentation built on Sept. 27, 2021, 3:14 p.m.