combn_2_col: Combinations to columns

Description Usage Arguments Details Value Examples

View source: R/combn_2_col.R

Description

Convert a character or factor of multiple labels.

Usage

1
2
combn_2_col(data, var, sep = "[^[:alnum:]]+", max_m = 1, collapse = "_",
  toInteger = FALSE, sparse = FALSE)

Arguments

data

The data frame in question.

var

The quoted name for the variable in question. The variable can be character or factor.

sep

The label separator, for example a comma or space.

max_m

The maximum number of possible combinations. Default is 1.

collapse

In the names of the new columns, how do you want the label combinations separated?

toInteger

Convert the logical result to integers of 0,1.

sparse

Return only the new indicators as a sparse matrix?

Details

This comes up every once in a while. Someone has for whatever reason coded multiple labels into cells within a single column, and now you need those individual labels for analysis. This function will create indicator columns for every combination of labels up to max_m labels. It will also return a list column, called 'combo', a version of the original, but where the entries are more usable vectors of labels, which might be useful for further processing.

Note that the number of possible combinations grows very quickly when there are many unique labels, well more than your machine can handle, so use sensible values. Check with combn(n_unique_labels, n_combinations) if you think there might be an issue.

Usually this situation is a result of poor data entry, and you'll likely need to do a little text pre-processing just to get started.

This can actually be used for one hot encoding if max_m is set to 1, though I'll make a more efficient version of that process in a later function. The combo column becomes superfluous in this case.

If you don't need combinations and each cell has the same pattern of entry, you could use tidyr::separate.

I tested this against a model.matrix approach and two text-analysis approaches (see examples), and with a problem that was notably more sizeable than the examples. Using model.matrix wasn't viable with even that size, and surprisingly, a simple tidytext approach was consistently fastest. However, this implementation is parallelizable in two parts, and requires nothing beyond what comes with a base R installation, so it wins.

Value

A data frame with the new indicator columns, or a sparse matrix of only the indicator columns.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
library(lazerhawk)
d = data.frame(id = 1:4,
               labs = c('A/B', 'B/C/D/E', 'A/E', 'D/E'))
test = combn_2_col(data=d, var='labs', max_m=3)
test
str(test)
d$labs =  c('A B', 'B C D E', 'A E', 'D E')
combn_2_col(data=d, var='labs', max_m=1)
d$labs =  c('Tom, Dick & Harriet', "J'Sean", "OBG, Andreas", NA)
combn_2_col(data=d, var='labs', sep=',', max_m=2, collapse='-')

## Not run: 
# requires at least tidytext
tidy_dtm <- function(data, var, sep='-', max_m=3) {
  init = stringr::str_split(data[[var]], pattern = sep) # creates a list of separated letters

  # the following gets the combos with a dot separating drugs in a given combo
  # this first lapply could be parallelized if need be and is probably slowest
  # probably want to change to m = min(c(4, m)) so as to only limit to 4
  # see also, combinat::combn which is slightly faster than base R below
  observation_combos = init %>%
    lapply(function(x)
      sapply(seq_along(x), function(m)
        utils::combn(x,  min(max_m, m), FUN=paste, collapse = '_')))

  # now we have a standard text analysis problem in need of a document term matrix
  documents = observation_combos %>% lapply(unlist)

  # create a 'tidy' form of documents and terms; each term (i.e. combo) only
  occurs once in a document
  doc_df = data.frame(id=rep(data$id, sapply(documents, length)),
                      combos=unlist(documents),
                      count=1)  # each term only occurs once in the document
  doc_df %>%
    tidytext::cast_dfm(document=id, term=combos, value=count)
  }

# requires at least text2vec
ttv <- function(data, var, sep='-', max_m=3) {
  docs = sapply(stringr::str_split(data[[var]], pattern=sep),
                function(str_vec)
                  sapply(seq_along(str_vec),
                         function(m)
                           combn(str_vec,
                                 m = min(max_m, m),
                                 FUN = paste,
                                 collapse = '_')
                  ) %>% unlist()
  )

  toks = itoken(docs, progressbar = FALSE)
  vocab = create_vocabulary(toks)
  create_dtm(toks, vectorizer = vocab_vectorizer(vocab), progressbar = FALSE) %>%
    as.matrix() %>%
    cbind(data,.)
}


## End(Not run)

mclark--/lazerhawk documentation built on July 17, 2018, 3:11 a.m.