Description Usage Arguments Value Examples
These functions are selection helpers.
They are intended to be used inside over()
to extract parts or patterns of
the column names of the underlying data.
cut_names()
selects strings by removing (cutting off) the specified .pattern
.
This functionality resembles stringr::str_remove_all()
.
extract_names()
selects strings by extracting the specified .pattern
.
This functionality resembles stringr::str_extract()
.
1 2 3 |
.pattern |
Pattern to look for. |
.remove |
Pattern to remove from the variable names provided in |
.vars |
A character vector with variables names. When used inside |
A character vector.
Selection helpers can be used inside dplyover::over()
which in turn must be
used inside dplyr::mutate
or dplyr::summarise
. Let's first attach dplyr
(and stringr
for comparision):
library(dplyr) library(stringr) # For better printing iris <- as_tibble(iris)
Let's first compare cut_names()
and extract_names()
to their stringr
equivalents stringr::str_remove_all()
and stringr::str_extract()
:
We can observe two main differences:
(1) cut_names()
and extract_names()
only return strings where the function
was applied successfully (when characters have actually been removed or
extracted). stringr::str_remove_all()
returns unmatched strings as is, while
stringr::str_extract()
returns NA
.
cut_names("Width", .vars = names(iris)) #> [1] "Sepal." "Petal." str_remove_all(names(iris), "Width") #> [1] "Sepal.Length" "Sepal." "Petal.Length" "Petal." "Species" extract_names("Length|Width", .vars = names(iris)) #> [1] "Length" "Width" str_extract(rep(names(iris), 2), "Length|Width") #> [1] "Length" "Width" "Length" "Width" NA "Length" "Width" "Length" "Width" #> [10] NA
(2) cut_names()
and extract_names()
return only unique values:
cut_names("Width", .vars = rep(names(iris), 2)) #> [1] "Sepal." "Petal." str_remove_all(rep(names(iris), 2), "Width") #> [1] "Sepal.Length" "Sepal." "Petal.Length" "Petal." "Species" #> [6] "Sepal.Length" "Sepal." "Petal.Length" "Petal." "Species" extract_names("Length|Width", .vars = names(iris)) #> [1] "Length" "Width" str_extract(rep(names(iris), 2), "Length|Width") #> [1] "Length" "Width" "Length" "Width" NA "Length" "Width" "Length" "Width" #> [10] NA
The examples above do not show that cut_names()
removes all strings matching
the .pattern
argument, while extract_names()
does only extract the .pattern
one time:
cut_names("Width", .vars = "Width.Petal.Width") #> [1] ".Petal." str_remove_all("Width.Petal.Width", "Width") #> [1] ".Petal." extract_names("Width", .vars = "Width.Petal.Width") #> [1] "Width" str_extract("Width.Petal.Width", "Width") #> [1] "Width"
Within over()
cut_names()
and extract_names()
automatically use the
column names of the underlying data:
iris %>% mutate(over(cut_names(".Width"), ~ .("{.x}.Width") * .("{.x}.Length"), .names = "Product_{x}")) #> # A tibble: 150 x 7 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Product_Sepal #> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> #> 1 5.1 3.5 1.4 0.2 setosa 17.8 #> 2 4.9 3 1.4 0.2 setosa 14.7 #> 3 4.7 3.2 1.3 0.2 setosa 15.0 #> 4 4.6 3.1 1.5 0.2 setosa 14.3 #> # ... with 146 more rows, and 1 more variable: Product_Petal <dbl> iris %>% mutate(over(extract_names("Length|Width"), ~.("Petal.{.x}") * .("Sepal.{.x}"), .names = "Product_{x}")) #> # A tibble: 150 x 7 #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Product_Length #> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> #> 1 5.1 3.5 1.4 0.2 setosa 7.14 #> 2 4.9 3 1.4 0.2 setosa 6.86 #> 3 4.7 3.2 1.3 0.2 setosa 6.11 #> 4 4.6 3.1 1.5 0.2 setosa 6.9 #> # ... with 146 more rows, and 1 more variable: Product_Width <dbl>
What problem does cut_names()
solve?
In the example above using cut_names()
might not seem helpful, since we could easily
use c("Sepal", "Petal")
instead. However, there are cases where we have
data with a lot of similar pairs of variables sharing a common prefix or
suffix. If we want to loop over them using over()
then cut_names()
comes
in handy.
The usage of extract_names()
might be less obvious. Lets look at raw data
from a customer satifsaction survey which contains the following variables.
csatraw %>% glimpse(width = 50) #> Rows: 150 #> Columns: 15 #> $ cust_id <chr> "61297", "07545", "03822", "8~ #> $ type <chr> "existing", "existing", "exis~ #> $ product <chr> "advanced", "advanced", "prem~ #> $ item1 <dbl> 3, 2, 2, 4, 4, 3, 1, 3, 3, 2,~ #> $ item1_open <chr> "12", "22", "21, 22, 23", "12~ #> $ item2a <dbl> 2, 2, 2, 3, 3, 0, 3, 2, 2, 0,~ #> $ item2b <dbl> 3, 2, 5, 5, 2, NA, 3, 3, 4, N~ #> $ item3a <dbl> 2, 3, 3, 2, 3, 2, 3, 3, 0, 1,~ #> $ item3b <dbl> 2, 4, 5, 3, 5, 3, 4, 2, NA, 2~ #> $ item4a <dbl> 0, 2, 0, 0, 3, 3, 3, 2, 2, 2,~ #> $ item4b <dbl> NA, 3, NA, NA, 5, 2, 3, 5, 3,~ #> $ item5a <dbl> 2, 3, 2, 2, 3, 1, 3, 2, 3, 1,~ #> $ item5b <dbl> 5, 2, 3, 4, 1, 3, 3, 1, 3, 2,~ #> $ item6a <dbl> 2, 2, 3, 1, 3, 3, 3, 2, 3, 2,~ #> $ item6b <dbl> 3, 1, 2, 2, 5, 4, 4, 2, 2, 2,~
The survey has several 'item's consisting of two sub-questions / variables 'a'
and 'b'. Lets say we want to calculate the product of those two variables for
each item. extract_names()
helps us to select all variables containing
'item' followed by a digit using the regex "item\\d"
as .pattern
.
However, there is 'item1' and 'item1_open' which are not followed by a
and
b
. extract_names()
lets us exclude these items by setting the .remove
argument to [^item1]
:
csatraw %>% transmute(over(extract_names("item\\d", "^item1"), ~ .("{.x}a") * .("{.x}b")) ) #> # A tibble: 150 x 5 #> item2 item3 item4 item5 item6 #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 6 4 NA 10 6 #> 2 4 12 6 6 2 #> 3 10 15 NA 6 6 #> 4 15 6 NA 8 2 #> # ... with 146 more rows
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.