select_vars: Select string parts or patterns of column names

Description Usage Arguments Value Examples

Description

These functions are selection helpers. They are intended to be used inside over() to extract parts or patterns of the column names of the underlying data.

Usage

1
2
3
cut_names(.pattern, .remove = NULL, .vars = NULL)

extract_names(.pattern, .remove = NULL, .vars = NULL)

Arguments

.pattern

Pattern to look for.

.remove

Pattern to remove from the variable names provided in .vars. When this argument is provided, all variables names in .vars that match the pattern specified in .remove will be removed, before the .pattern to look for will be applied.

.vars

A character vector with variables names. When used inside over all column names of the underlying data are automatically supplied to .vars. This argument is useful when testing the functionality outside the context of over().

Value

A character vector.

Examples

Selection helpers can be used inside dplyover::over() which in turn must be used inside dplyr::mutate or dplyr::summarise. Let's first attach dplyr (and stringr for comparision):

library(dplyr)
library(stringr)

# For better printing
iris <- as_tibble(iris)

Let's first compare cut_names() and extract_names() to their stringr equivalents stringr::str_remove_all() and stringr::str_extract():

We can observe two main differences:

(1) cut_names() and extract_names() only return strings where the function was applied successfully (when characters have actually been removed or extracted). stringr::str_remove_all() returns unmatched strings as is, while stringr::str_extract() returns NA.

cut_names("Width", .vars = names(iris))
#> [1] "Sepal." "Petal."
str_remove_all(names(iris), "Width")
#> [1] "Sepal.Length" "Sepal."       "Petal.Length" "Petal."       "Species"

extract_names("Length|Width", .vars = names(iris))
#> [1] "Length" "Width"
str_extract(rep(names(iris), 2), "Length|Width")
#>  [1] "Length" "Width"  "Length" "Width"  NA       "Length" "Width"  "Length" "Width" 
#> [10] NA

(2) cut_names() and extract_names() return only unique values:

cut_names("Width", .vars = rep(names(iris), 2))
#> [1] "Sepal." "Petal."
str_remove_all(rep(names(iris), 2), "Width")
#>  [1] "Sepal.Length" "Sepal."       "Petal.Length" "Petal."       "Species"     
#>  [6] "Sepal.Length" "Sepal."       "Petal.Length" "Petal."       "Species"

extract_names("Length|Width", .vars = names(iris))
#> [1] "Length" "Width"
str_extract(rep(names(iris), 2), "Length|Width")
#>  [1] "Length" "Width"  "Length" "Width"  NA       "Length" "Width"  "Length" "Width" 
#> [10] NA

The examples above do not show that cut_names() removes all strings matching the .pattern argument, while extract_names() does only extract the .pattern one time:

cut_names("Width", .vars = "Width.Petal.Width")
#> [1] ".Petal."
str_remove_all("Width.Petal.Width", "Width")
#> [1] ".Petal."

extract_names("Width", .vars = "Width.Petal.Width")
#> [1] "Width"
str_extract("Width.Petal.Width", "Width")
#> [1] "Width"

Within over() cut_names() and extract_names() automatically use the column names of the underlying data:

iris %>%
mutate(over(cut_names(".Width"),
            ~ .("{.x}.Width") * .("{.x}.Length"),
            .names = "Product_{x}"))
#> # A tibble: 150 x 7
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Product_Sepal
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>           <dbl>
#> 1          5.1         3.5          1.4         0.2 setosa           17.8
#> 2          4.9         3            1.4         0.2 setosa           14.7
#> 3          4.7         3.2          1.3         0.2 setosa           15.0
#> 4          4.6         3.1          1.5         0.2 setosa           14.3
#> # ... with 146 more rows, and 1 more variable: Product_Petal <dbl>

iris %>%
  mutate(over(extract_names("Length|Width"),
              ~.("Petal.{.x}") * .("Sepal.{.x}"),
             .names = "Product_{x}"))
#> # A tibble: 150 x 7
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Product_Length
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>            <dbl>
#> 1          5.1         3.5          1.4         0.2 setosa            7.14
#> 2          4.9         3            1.4         0.2 setosa            6.86
#> 3          4.7         3.2          1.3         0.2 setosa            6.11
#> 4          4.6         3.1          1.5         0.2 setosa            6.9 
#> # ... with 146 more rows, and 1 more variable: Product_Width <dbl>

What problem does cut_names() solve? In the example above using cut_names() might not seem helpful, since we could easily use c("Sepal", "Petal") instead. However, there are cases where we have data with a lot of similar pairs of variables sharing a common prefix or suffix. If we want to loop over them using over() then cut_names() comes in handy.

The usage of extract_names() might be less obvious. Lets look at raw data from a customer satifsaction survey which contains the following variables.

csatraw %>% glimpse(width = 50)
#> Rows: 150
#> Columns: 15
#> $ cust_id    <chr> "61297", "07545", "03822", "8~
#> $ type       <chr> "existing", "existing", "exis~
#> $ product    <chr> "advanced", "advanced", "prem~
#> $ item1      <dbl> 3, 2, 2, 4, 4, 3, 1, 3, 3, 2,~
#> $ item1_open <chr> "12", "22", "21, 22, 23", "12~
#> $ item2a     <dbl> 2, 2, 2, 3, 3, 0, 3, 2, 2, 0,~
#> $ item2b     <dbl> 3, 2, 5, 5, 2, NA, 3, 3, 4, N~
#> $ item3a     <dbl> 2, 3, 3, 2, 3, 2, 3, 3, 0, 1,~
#> $ item3b     <dbl> 2, 4, 5, 3, 5, 3, 4, 2, NA, 2~
#> $ item4a     <dbl> 0, 2, 0, 0, 3, 3, 3, 2, 2, 2,~
#> $ item4b     <dbl> NA, 3, NA, NA, 5, 2, 3, 5, 3,~
#> $ item5a     <dbl> 2, 3, 2, 2, 3, 1, 3, 2, 3, 1,~
#> $ item5b     <dbl> 5, 2, 3, 4, 1, 3, 3, 1, 3, 2,~
#> $ item6a     <dbl> 2, 2, 3, 1, 3, 3, 3, 2, 3, 2,~
#> $ item6b     <dbl> 3, 1, 2, 2, 5, 4, 4, 2, 2, 2,~

The survey has several 'item's consisting of two sub-questions / variables 'a' and 'b'. Lets say we want to calculate the product of those two variables for each item. extract_names() helps us to select all variables containing 'item' followed by a digit using the regex "item\\d" as .pattern. However, there is 'item1' and 'item1_open' which are not followed by a and b. extract_names() lets us exclude these items by setting the .remove argument to [^item1]:

csatraw %>%
 transmute(over(extract_names("item\\d", "^item1"),
                ~ .("{.x}a") * .("{.x}b"))
 )
#> # A tibble: 150 x 5
#>   item2 item3 item4 item5 item6
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     6     4    NA    10     6
#> 2     4    12     6     6     2
#> 3    10    15    NA     6     6
#> 4    15     6    NA     8     2
#> # ... with 146 more rows

TimTeaFan/dplyover documentation built on Sept. 27, 2021, 3:14 p.m.