In m-clark/tidyext: Tidy Extensions for Data Processing

knitr::opts_chunk$set(echo = T, message=F, warning=F, error=F, collapse = TRUE,
                      comment=NA, R.options=list(width=220),   # code 
                      dev.args=list(bg = 'transparent'), dev='svglite',                                 # viz
                      fig.align='center', out.width='75%', fig.asp=.75,                 
                      cache.rebuild=F, cache=F)                                                         # cache

Getting started with tidyext

Data Summaries

To begin, we can load up the tidyverse and this package. I'll also create some data that will be useful for demonstration.

library(tidyverse)
library(tidyext)

set.seed(8675309)

df1 <- tibble(
  g1 = factor(sample(1:2, 50, replace = TRUE), labels = c('a', 'b')),
  g2 = sample(1:4, 50, replace = TRUE),
  a = rnorm(50),
  b = rpois(50, 10),
  c = sample(letters, 50, replace = TRUE),
  d = sample(c(T, F), 50, replace = TRUE)
)

df_miss = df1
df_miss[sample(1:nrow(df1), 10), sample(1:ncol(df1), 3)] = NA

We can start by getting a quick numerical summary for a single column. As the name suggests, this will only work with numeric data.

num_summary(mtcars$mpg)

num_summary(df_miss$a, extra = T)

Note that the result's class is a data.frame, which makes it easy to work with.

x = num_summary(mtcars$mpg)
glimpse(x)

mtcars %>% 
  map_dfr(num_summary, .id = 'Variable')

There are also functions for summarizing missingness.

sum_NA(df_miss$a)

sum_blank(c(letters, '', '   '))

sum_NaN(c(1, NaN, 2))

When dealing with a data frame of mixed types we can use the describe_* functions.

describe_all(df1)

describe_all_cat(df1)

describe_all_num(df1, digits = 1)

Note how the categorical data result is just as ready for visualization as the numeric, as it can be filtered by the Variable column. It also has an option to deal with NA and some other stuff.

describe_all_cat(df_miss, include_NAcat = TRUE, sort_by_freq = TRUE) %>% 
  filter(Variable == 'g1') %>% 
  ggplot(aes(x=Group, y=`%`)) +
  geom_col(width = .25)

Typically during data processing, we are performing grouped operations. As such there is a corresponding num_by and cat_by to provide the same information by some grouping variable. This basically is saving you from doing group_by %>% summarize() and creating variables for all these values. It can also take a set of variables to summarize using vars.

df_miss %>% 
  num_by(a, group_var = g2)

df_miss %>% 
  num_by(vars(a, b), group_var = g2)

For categorical variables summarized by group, you can select whether the resulting percentage is irrespective of the grouping.

df1 %>% 
  cat_by(d, 
         group_var = g1, 
         perc_by_group = TRUE)

df1 %>% 
  cat_by(d, 
         group_var = g1, 
         perc_by_group = FALSE, 
         sort_by_group = FALSE)

Data Processing

In addition there are some functions for data processing. We can start with the simple one-hot encoding function.

onehot(iris) %>% 
  slice(c(1:2, 51:52, 101:102))

It can do it sparsely.

iris %>% 
  slice(c(1:2, 51:52, 101:102)) %>% 
  onehot(sparse = TRUE)

Choose a specific variable, whether you want to keep the others, and how to deal with NA.

df_miss %>%
  onehot(var = c('g1', 'g2'), nas = 'na.omit', keep.original = FALSE) %>%
  head()

With create_prediction_data, we can quickly create data for use with predict after a model. By default it will put numeric variables at their mean, and categorical variables at their most common category.

create_prediction_data(iris)

create_prediction_data(iris, num = function(x) quantile(x, p=.25))

We can also supply specific values.

cd = data.frame(cyl=4, hp=100)
create_prediction_data(mtcars, conditional_data = cd)

For modeling purposes, we often want to center or scale the data, take logs etc. The pre_process function will standardize numeric data by default.

pre_process(df1)

Other options are to simply center the data (scale_by = 0), start some variables at zero (e.g. time indicators), log some variables (with chosen base), and scale some to range from zero to one.

pre_process(mtcars, 
            scale_by = 0, 
            log_vars = vars(mpg, wt), 
            zero_start = vars(cyl), 
            zero_one = vars(hp, starts_with('d'))) %>% 
  describe_all_num()

Note that center/standardizing is done to any numeric variables not chosen for log, zero_start, and zero_one.

Here's a specific function you will probably never need, but will be glad to have if you do. Some data columns have multiple entries for each observation/cell. While it's understandable why someone would do this, it's not very good practice. This will split out the entries, or any particular combination of them, into their own indicator column.

d = data.frame(id = 1:4, labs = c('A-B', 'B-C-D-E', 'A-E', 'D-E'))

combn_2_col(
  data = d,
  var = 'labs',
  max_m = 2,
  sep = '-',
  collapse = ':',
  toInteger = T
)

combn_2_col(
  data = d,
  var = 'labs',
  max_m = 2,
  sparse = T
)

Oftentimes I need to create a column that represents the total scores or means of just a few columns. This is a slight annoyance in the tidyverse, and there isn't much support behind the dplyr:rowwise function. As such, tidyext has a couple simple wrappers for row_sums, row_means, and row_apply.

d = data.frame(x = 1:3,
               y = 4:6,
               z = 7:9,
               q = NA)

d  %>%
 row_sums(x:y)

d  %>%
 row_means(matches('x|z'))

row_apply(
 d ,
 x:z,
 .fun = function(x)
   apply(x, 1, paste, collapse = '')
)