combn_2_col: Combinations to columns
In m-clark/tidyext: Tidy Extensions for Data Processing

Description Usage Arguments Details Value Examples

Convert a character or factor of multiple labels.

combn_2_col(
  data,
  var,
  sep = "[^[:alnum:]]+",
  max_m = 1,
  collapse = "_",
  toInteger = FALSE,
  sparse = FALSE
)

`data`	The data frame in question.
`var`	The quoted name for the variable in question. The variable can be character or factor.
`sep`	The label separator, for example a comma or space.
`max_m`	The maximum number of possible combinations. Default is 1.
`collapse`	In the names of the new columns, how do you want the label combinations separated?
`toInteger`	Convert the logical result to integers of 0,1.
`sparse`	Return only the new indicators as a sparse matrix?

This comes up every once in a while. Someone has for whatever reason coded multiple labels into cells within a single column, and now you need those individual labels for analysis. This function will create indicator columns for every combination of labels up to max_m labels. It will also return a list column, called 'combo', a version of the original, but where the entries are more usable vectors of labels, which might be useful for further processing.

Note that the number of possible combinations grows very quickly when there are many unique labels, well more than your machine can handle, so use sensible values. Check with combn(n_unique_labels, n_combinations) if you think there might be an issue.

Usually this situation is a result of poor data entry, and you'll likely need to do a little text pre-processing just to get started.

This can actually be used for one hot encoding if max_m is set to 1, though I'll make a more efficient version of that process in a later function. The combo column becomes superfluous in this case.

If you don't need combinations and each cell has the same pattern of entry, you could use tidyr::separate.

I tested this against a model.matrix approach and two text-analysis approaches (see examples), and with a problem that was notably more sizeable than the examples. Using model.matrix wasn't viable with even that size, and surprisingly, a simple tidytext approach was consistently fastest. However, this implementation is parallelizable in two parts, and requires nothing beyond what comes with a base R installation, so it wins.

A data frame with the new indicator columns, or a sparse matrix of only the indicator columns.

library(tidyext)

d = data.frame(id = 1:4, labs = c('A/B', 'B/C/D/E', 'A/E', 'D/E'))
test = combn_2_col(data = d, var = 'labs', max_m = 3)
test
str(test)

d$labs =  c('A B', 'B C D E', 'A E', 'D E')
combn_2_col(data = d, var = 'labs', max_m = 1)

d$labs =  c('Tom, Dick & Harriet', "J'Sean", "OBG, Andreas", NA)

combn_2_col(
data = d,
var = 'labs',
sep = ',',
max_m = 2,
collapse = '-'
)