brew: New brew
In bcjaeger/midy: Imputation for Predictive Analytics

Description Usage Arguments Details Value Neighbor's brew Soft brew Note References Examples

brew() is the first function in the ipa workflow .

brew(data, outcome, flavor, bind_miss = FALSE)

brew_nbrs(data, outcome, bind_miss = FALSE)

brew_soft(data, outcome, bind_miss = FALSE)

`data`	a data frame with missing values.
`outcome`	column name(s) of outcomes. These values can be provided as symbols (e.g., outcome = c(a,b,c) for multiple outcomes or outcome = a for one outcome) or character values (e.g., outcome = c('a','b','c') for multiple outcomes or outcome = 'a' for a single outcome).
`flavor`	the computational approach that will be used to impute missing data. Valid options are 'kneighbors' and 'softImpute'. These values should be input as characters (e.g., 'kneighbors').
`bind_miss`	(`TRUE` / `FALSE`). If `TRUE`, a set of additional indicator columns (one for each non-outcome column) are added to `data`. The indicator columns take values of 0 and 1, with 0 indicating that this variable is not missing for this row and 1 indicating that this variable is missing for this row. If `FALSE`, no additional columns are added to `data`.

Brewing a great beer is not that different from imputing missing data. Once a brew is started, you can add spices (set primary parameters; see spice) and then mash the mixture (fitting imputation models; see mash). To finish the brew, add yeast (new data; see ferment), and then bottle it up (see bottle) as a tibble or matrix.

brew() includes an input variable called flavor that determines how data will be imputed. brew_nbrs() and brew_soft() are convenience functions, e.g. brew_nbrs() is a shortcut for calling brew(flavor = 'kneighbors').

an ipa_brew object with your specified flavor

an adaptation of Max Kuhn's nearest neighbor imputation functions in the recipes and caret packages. It also uses the gower package to implement algorithms that compute Gower's distance.

What makes this type of nearest neighbor imputation different is its flexibility in the number of neighbors used to impute missing values and the aggregation function applied. For example, to create 10 imputed datasets that use 1, 2, ..., 10 neighbors to impute missing values would require fitting 10 separate nearest neighbors models using conventional functions. The ipa package lets a user create all of these imputed sets with just one fitting of a nearest neighbor model. Additionally, for users who want to use nearest neighbors for multiple imputation, ipa gives the option to sample 1 neighbor value at random from a neighborhood, rather than aggregate values into a summary.

The softImpute algorithm is used to impute missing values with this brew. For more details on this strategy to handle missing values, please see softImpute.

Gower (1971) originally defined a similarity measure (s, say) with values ranging from 0 (completely dissimilar) to 1 (completely similar). The distance returned here equals 1-s.

Gower, John C. "A general coefficient of similarity and some of its properties." Biometrics (1971): 857-871.

Rahul Mazumder, Trevor Hastie and Rob Tibshirani (2010) Spectral Regularization Algorithms for Learning Large Incomplete Matrices, http://www.stanford.edu/~hastie/Papers/mazumder10a.pdf Journal of Machine Learning Research 11 (2010) 2287-2322

data <- data.frame(
  x1 = 1:10,
  x2 = 10:1,
  x3 = 1:10,
  outcome = 11 + runif(10)
)

data[1:2, 1:2] = NA

knn_brew <- brew(data, outcome = outcome, flavor = 'kneighbors')
sft_brew <- brew(data, outcome = outcome, flavor = 'softImpute')

knn_brew <- brew_nbrs(data, outcome = outcome)
sft_brew <- brew_soft(data, outcome = outcome)

print(knn_brew)