#' Apply functions to a list or vector in 'dplyr'
#'
#' @description
#' `over()` makes it easy to create new colums inside a [dplyr::mutate()] or
#' [dplyr::summarise()] call by applying a function (or a set of functions) to
#' an atomic vector or list using a syntax similar to [dplyr::across()].
#' The main difference is that [dplyr::across()] transforms or creates new columns
#' based on existing ones, while `over()` can create new columns based on a
#' vector or list to which it will apply one or several functions.
#' Whereas [dplyr::across()] allows `tidy-selection` helpers to select columns,
#' `over()` provides its own helper functions to select strings or values based
#' on either (1) values of specified columns or (2) column names. See the
#' examples below and the `vignette("why_dplyover")` for more details.
#'
#' @param .x An atomic vector or list to apply functions to. Alternatively a
#' <[`selection helper`][selection_helpers]> can be used to create
#' a vector.
#'
#' @param .fns Functions to apply to each of the elements in `.x`. For
#' functions that expect variable names as input, the selected strings need to
#' be turned into symbols and evaluated. `dplyrover` comes with a genuine helper
#' function that evaluates strings as names [`.()`].
#'
#' Possible values are:
#'
#' - A function
#' - A purrr-style lambda
#' - A list of functions/lambdas
#'
#' For examples see the example section below.
#'
#' Note that, unlike `across()`, `over()` does not accept `NULL` as a
#' value to `.fns`.
#'
#' @param ... Additional arguments for the function calls in `.fns`.
#'
#' @param .names A glue specification that describes how to name the output
#' columns. This can use `{x}` to stand for the selected vector element, and
#' `{fn}` to stand for the name of the function being applied. The default
#' (`NULL`) is equivalent to `"{x}"` for the single function case and
#' `"{x}_{fn}"` for the case where a list is used for `.fns`.
#'
#' Note that, depending on the nature of the underlying object in `.x`,
#' specifying `{x}` will yield different results:
#'
#' - If `.x` is an unnamed atomic vector, `{x}` will represent each value.
#' - If `.x` is a named list or atomic vector, `{x}` will represent each name.
#' - If `.x` is an unnamed list, `{x}` will be the index number running from 1 to `length(x)`.
#'
#' This standard behavior (interpretation of `{x}`) can be overwritten by
#' directly specifying:
#'
#' - `{x_val}` for `.x`'s values
#' - `{x_nm}` for its names
#' - `{x_idx}` for its index numbers
#'
#' Alternatively, a character vector of length equal to the number of columns to
#' be created can be supplied to `.names`. Note that in this case, the glue
#' specification described above is not supported.
#'
#' @param .names_fn Optionally, a function that is applied after the glue
#' specification in `.names` has been evaluated. This is, for example, helpful
#' in case the resulting names need to be further cleaned or trimmed.
#'
#' @returns
#' A tibble with one column for each element in `.x` and each function in `.fns`.
#'
#' @section Note:
#' Similar to `dplyr::across()` `over()` works only inside dplyr verbs.
#'
#' @seealso
#' [over2()] to apply a function to two objects.
#'
#' All members of the <[`over-across function family`][over_across_family]>.
#'
#' @section Examples:
#'
#' ```{r, child = "man/rmd/setup.Rmd"}
#' ```
#'
#' It has two main use cases. They differ in how the elements in `.x`
#' are used. Let's first attach `dplyr`:
#'
#' ```{r, comment = "#>", collapse = TRUE}
#' library(dplyr)
#'
#' # For better printing
#' iris <- as_tibble(iris)
#' ```
#'
#'
#' #### (1) The General Use Case
#' Here the values in `.x` are used as inputs to one or more functions in `.fns`.
#' This is useful, when we want to create several new variables based on the same
#' function with varying arguments. A good example is creating a bunch of lagged
#' variables.
#'
#' ```{r, comment = "#>", collapse = TRUE}
#' tibble(x = 1:25) %>%
#' mutate(over(c(1:3),
#' ~ lag(x, .x)))
#' ```
#'
#' Lets create a dummy variable for each unique value in 'Species':
#' ```{r, comment = "#>", collapse = TRUE}
#' iris %>%
#' mutate(over(unique(Species),
#' ~ if_else(Species == .x, 1, 0)),
#' .keep = "none")
#' ```
#'
#' With `over()` it is also possible to create several dummy variables with
#' different thresholds. We can use the `.names` argument to control the output
#' names:
#'
#' ```{r, comment = "#>", collapse = TRUE}
#' iris %>%
#' mutate(over(seq(4, 7, by = 1),
#' ~ if_else(Sepal.Length < .x, 1, 0),
#' .names = "Sepal.Length_{x}"),
#' .keep = "none")
#' ```
#'
#' A similar approach can be used with dates. Below we loop over a date
#' sequence to check whether the date falls within a given start and end
#' date. We can use the `.names_fn` argument to clean the resulting output
#' names:
#'
#' ```{r, comment = "#>", collapse = TRUE}
#' # some dates
#' dat_tbl <- tibble(start = seq.Date(as.Date("2020-01-01"),
#' as.Date("2020-01-15"),
#' by = "days"),
#' end = start + 10)
#'
#' dat_tbl %>%
#' mutate(over(seq(as.Date("2020-01-01"),
#' as.Date("2020-01-21"),
#' by = "weeks"),
#' ~ .x >= start & .x <= end,
#' .names = "day_{x}",
#' .names_fn = ~ gsub("-", "", .x)))
#' ```
#'
#' `over()` can summarise data in wide format. In the example below, we want to
#' know for each group of customers (`new`, `existing`, `reactivate`), how much
#' percent of the respondents gave which rating on a five point likert scale
#' (`item1`). A usual approach in the tidyverse would be to use
#' `count %>% group_by %>% mutate`, which yields the same result in the usually
#' prefered long format. Sometimes, however, we might want this kind of summary
#' in the wide format, and in this case `over()` comes in handy:
#'
#' ```{r, comment = "#>", collapse = TRUE}
#' csatraw %>%
#' group_by(type) %>%
#' summarise(over(c(1:5),
#' ~ mean(item1 == .x)))
#' ```
#'
#' Instead of a vector we can provide a named list of vectors to calculate the
#' top two and bottom two categories on the fly:
#'
#' ```{r, comment = "#>", collapse = TRUE}
#' csatraw %>%
#' group_by(type) %>%
#' summarise(over(list(bot2 = c(1:2),
#' mid = 3,
#' top2 = c(4:5)),
#' ~ mean(item1 %in% .x)))
#' ```
#'
#' `over()` can also loop over columns of a data.frame. In the example below we
#' want to create four different dummy variables of `item1`: (i) the top and (ii)
#' bottom category as well as (iii) the top two and (iv) the bottom two categories.
#' We can create a lookup `data.frame` and use all columns but the first as input to
#' `over()`. In the function call we make use of base R's `match()`, where `.x`
#' represents the new values and `recode_df[, 1]` refers to the old values.
#'
#' ```{r, comment = "#>", collapse = TRUE}
#'
#' recode_df <- data.frame(old = c(1, 2, 3, 4, 5),
#' top1 = c(0, 0, 0, 0, 1),
#' top2 = c(0, 0, 0, 1, 1),
#' bot1 = c(1, 0, 0, 0, 0),
#' bot2 = c(1, 1, 0, 0, 0))
#'
#' csatraw %>%
#' mutate(over(recode_df[,-1],
#' ~ .x[match(item1, recode_df[, 1])],
#' .names = "item1_{x}")) %>%
#' select(starts_with("item1"))
#' ```
#'
#' `over()` work nicely with comma separated values stored in character vectors.
#' In the example below, the colum `csat_open` contains one or more comma
#' separated reasons why a specific customer satisfaction rating was given.
#' We can easily create a column for each response category with the help of
#' `dist_values` - a wrapper around `unique` which can split vector elements
#' using a separator:
#'
#' ```{r, comment = "#>", collapse = TRUE}
#' csat %>%
#' mutate(over(dist_values(csat_open, .sep = ", "),
#' ~ as.integer(grepl(.x, csat_open)),
#' .names = "rsp_{x}",
#' .names_fn = ~ gsub("\\s", "_", .x)),
#' .keep = "none") %>% glimpse
#' ```
#'
#'
#' #### (2) A Very Specific Use Case
#' Here strings are supplied to `.x` to construct column names (sharing the
#' same stem). This allows us to dynamically use more than one column in the
#' function calls in `.fns`. To work properly, the strings need to be
#' turned into symbols and evaluated. For this {dplyover} provides a genuine
#' helper function `.()` that evaluates strings and helps to declutter the
#' otherwise rather verbose code. `.()` supports glue syntax and takes a string
#' as argument.
#'
#' Below are a few examples using two colums in the function calls in `.fns`.
#' For the two column case [across2()] provides a more intuitive API that is
#' closer to the original `dplyr::across`. Using `.()` inside `over` is really
#' useful for cases with more than two columns.
#'
#' Consider the following example of a purrr-style formula in `.fns` using `.()`:
#'
#' ```{r, comment = "#>", collapse = TRUE}
#' iris %>%
#' mutate(over(c("Sepal", "Petal"),
#' ~ .("{.x}.Width") + .("{.x}.Length")
#' ))
#' ```
#'
#' The above syntax is equal to the more verbose:
#' ```{r, comment = "#>", collapse = TRUE}
#' iris %>%
#' mutate(over(c("Sepal", "Petal"),
#' ~ eval(sym(paste0(.x, ".Width"))) +
#' eval(sym(paste0(.x, ".Length")))
#' ))
#' ```
#'
#' `.()` also works with anonymous functions:
#' ```{r, comment = "#>", collapse = TRUE}
#' iris %>%
#' summarise(over(c("Sepal", "Petal"),
#' function(x) mean(.("{x}.Width"))
#' ))
#' ```
#'
#' A named list of functions:
#' ```{r, comment = "#>", collapse = TRUE}
#' iris %>%
#' mutate(over(c("Sepal", "Petal"),
#' list(product = ~ .("{.x}.Width") * .("{.x}.Length"),
#' sum = ~ .("{.x}.Width") + .("{.x}.Length"))
#' ),
#' .keep = "none")
#' ```
#'
#' Again, use the `.names` argument to control the output names:
#' ```{r, comment = "#>", collapse = TRUE}
#' iris %>%
#' mutate(over(c("Sepal", "Petal"),
#' list(product = ~ .("{.x}.Width") * .("{.x}.Length"),
#' sum = ~ .("{.x}.Width") + .("{.x}.Length")),
#' .names = "{fn}_{x}"),
#' .keep = "none")
#' ```
#' @export
over <- function(.x, .fns, ..., .names = NULL, .names_fn = NULL){
grp_id <- tryCatch({
dplyr::cur_group_id()
}, error = function(e) {
rlang::abort("`over()` must only be used inside dplyr verbs.")
})
setup <- meta_setup(dep_call = deparse_call(sys.call()),
grp_id = grp_id,
par_frame = parent.frame(),
setup_fn = "over_setup",
x1 = .x,
fns = .fns,
names = .names,
names_fn = .names_fn)
x <- setup$x
fns <- setup$fns
names <- setup$names
# check empty input
if (length(x) == 0L) {
return(tibble::new_tibble(list(), nrow = 1L))
}
n_x <- length(x)
n_fns <- length(fns)
seq_n_x <- seq_len(n_x)
seq_fns <- seq_len(n_fns)
k <- 1L
out <- vector("list", n_x * n_fns)
for (i in seq_n_x) {
xi <- x[[i]]
for (j in seq_fns) {
fn <- fns[[j]]
out[[k]] <- fn(xi, ...)
k <- k + 1L
}
}
size <- vctrs::vec_size_common(!!!out)
out <- vctrs::vec_recycle_common(!!!out, .size = size)
names(out) <- names
tibble::new_tibble(out, nrow = size)
}
over_setup <- function(x1, fns, names, names_fn) {
# setup name variants
x1_nm <- names(x1)
x1_idx <- as.character(seq_along(x1))
x1_val <- if (is.data.frame(x1) && nrow(x1) != 1) {
NULL
} else if (is.list(x1) && is.vector(x1) &&
any(purrr::map_lgl(x1, ~ length(.x) != 1))) {
NULL
} else {
x1
}
# apply `.names` smart default
if (is.function(fns) || rlang::is_formula(fns)) {
names <- names %||% "{x}"
fns <- list(`1` = fns)
} else {
names <- names %||% "{x}_{fn}"
}
if (!is.list(fns)) {
rlang::abort(c("Problem with `over()` input `.fns`.",
i = "Input `.fns` must be a function, a formula, or a list of functions/formulas."))
}
# use index for unnamed lists
if (is.list(x1) && !rlang::is_named(x1)) {
names(x1) <- x1_idx
}
# handle formulas
fns <- purrr::map(fns, rlang::as_function)
# make sure fns has names, use number to replace unnamed
if (is.null(names(fns))) {
names_fns <- seq_along(fns)
} else {
names_fns <- names(fns)
empties <- which(names_fns == "")
if (length(empties)) {
names_fns[empties] <- empties
}
}
# setup control flow:
vars_no <- length(x1) * length(fns)
maybe_glue <- any(grepl("{.*}", names, perl = TRUE))
is_glue <- any(grepl("{(x|x_val|x_nm|x_idx|fn)}", names, perl = TRUE))
# if .names use glue syntax:
if (is_glue) {
if (length(names) > 1) {
rlang::abort(c("Problem with `over()` input `.names`.",
i = "Glue specification must be a character vector of length == 1.",
x = paste0("`.names` is of length: ", length(names), ".")))
}
# warn that default values are used if conditions not met
if (is.null(x1_val) && grepl("{x_val}", names, perl = TRUE)) {
rlang::warn("in `over()` `.names`: used 'x_idx' instead of 'x_val'. The latter only works with lists if all elements are length 1.")
}
if (is.null(x1_nm) && grepl("{x_nm}", names, perl = TRUE)) {
rlang::warn("in `over()` `.names`: used 'x_idx' instead of 'x_nm', since the input object is unnamed.")
}
names <- vctrs::vec_as_names(glue::glue(names,
x = rep(names(x1) %||% x1, each = length(fns)),
x_val = rep(x1_val %||% x1_idx, each = length(fns)),
x_nm = rep(x1_nm %||% x1_idx, each = length(fns)),
x_idx = rep(x1_idx, each = length(fns)),
fn = rep(names_fns, length(x1))),
repair = "check_unique")
# no correct glue syntax detected
} else {
# glue syntax might be wrong
if (maybe_glue && length(names) == 1 && vars_no > 1) {
rlang::abort(c("Problem with `over()` input `.names`.",
x = "Unrecognized glue specification `{...}` detected in `.names`.",
i = "`.names` only supports the following expressions: '{x}', '{x_val}', '{x_nm}', '{x_idx}' or '{fn}'."
))
}
# check that non-glue names are unique
vctrs::vec_as_names(names, repair = "check_unique")
# check number of names
if (length(names) != vars_no) {
rlang::abort(c("Problem with `over()` input `.names`.",
i = "The number of elements in `.names` must equal the number of new columns.",
x = paste0(length(names), " elements provided to `.names`, but the number of new columns is ", vars_no, ".")
))
}
}
# apply names_fn
if (!is.null(names_fn)) {
nm_f <- rlang::as_function(names_fn)
names <- purrr::map_chr(names, nm_f)
}
value <- list(x = x1, fns = fns, names = names)
value
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.