R/smirf.R

Defines functions smirf

Documented in smirf

#' Single of multiple imputation of missing data using random forests
#'
#' Missing data (multiple) imputation using the missForest algorithm by
#' Stekhoven and Buehlmann (2012) (default) or, alternatively, the MICE with
#' random forest procedure of Doove et al (2014). The ranger (Wright and
#' Ziegler, 2017) fast implementation of random forest (training) algorithm is
#' used.
#'
#' For a full description of the missForest algorithm, see Stekhoven and
#' Buehlmann (2012). In brief, at each iteration missing values are imputed for
#' each variable by the predictions of a random forest trained on the observed
#' cases of that variable using the values of predictors from the completed data
#' set from the previous iteration. This is repeated until some measure of the
#' relationship between iterations indicates convergence - usually by decreasing
#' from the measure at the previous iteration.
#'
#' By default the columns are imputed in the order of least missing to most
#' missing. This can be over-ridden by the \code{model} argument. Columns that
#' are entirely missing are excluded. Non-integer numeric data is treated as
#' continuous and predicted by regression forests while all other data,
#' including integer and logical data, are predicted via classification forests.
#' No special treatment is given to ordered categorical data.
#'
#' The call to \code{\link[ranger]{ranger}} may be modified by the \code{...}
#' arguments, and any variable-specific argument to pass may be specified in the
#' \code{overrides} argument.
#'
#' The key modifications to the missForest procedure governed by the arguments:
#' \describe{
#'     \item{\code{gibbs}}{use the most recent predictions for each variable
#'         in training and prediction as they become available, like a Gibbs
#'         sampler by setting this to \code{T} (default is \code{F});}
#'     \item{\code{tree.imp}}{predict using a randomly selected tree for each
#'         missing value rather than use the whole-of-forest aggregated
#'         prediction by setting this to \code{T} (default is \code{F});}
#'     \item{\code{boot.train}}{train on a boot-strapped resample of the data,
#'         (default is \code{F}), and;}
#'     \item{\code{obs.only}}{train on all rows in the data set by setting to
#'         \code{F} or train on observed data only (default is \code{T}).}
#' }
#'
#' Switching the first two to \code{T} invokes a similar procedure to Multiple
#' Imputation via Chained Equations of Doove et al (2014). The third option
#' can be used to improve CI coverage (Bartlett 2014). The final option (along
#' with changes to the first two) will mimic van Buuren and
#' Groothuis-Oudshoorn (2012), except for the process for drawing values from
#' leaf nodes.
#'
#' The convergence criterion can be modified by the \code{stop.measure}
#' argument. The default is to measure the mean rank correlation between
#' iterations of the ordered data and the stationary rate of the categorical
#' data (see \code{\link{measure_correlation}}. The procedure halts when both of
#' these values are less than or equal to the previous values (see
#' \code{\link{stop_condition}}). The original Stekhoven and Buehlmann (2012)
#' measure is provided by the \code{\link{measure_stekhoven_2012}} function.
#'
#' @param X data.frame;
#'            a incomplete data set including any of numeric, logical, integer,
#'            factor and ordered data types.
#' @param model matrix;
#'            logical matrix which indicates inclusion of a predictor (named 
#'            column) in the model of an imputed value (named row), with the
#'            order of imputation being the row order, default is a matrix of
#'            ones with rows for each partially but not-completely missing
#'            variable (in order of least to most missing), and columns
#'            for every partially complete variable.
#' @param n numeric scalar;
#'            the number of imputations - i.e. number of times the missForest
#'            algorithm is used.
#' @param gibbs logical;
#'            use Gibbs sampling in training steps (\code{T}) rather than the
#'            predictions from the previous iteration (default).
#' @param tree.imp logical;
#'            use a prediction of missing data from single tree in the forest 
#'            when training (\code{T}) rather than the bagged predicted value
#'            (default).
#' @param boot.train logical;
#'            train each forest on a bootstrap sample of the observed data
#'            when \code{T}, rather than the observed data (default).
#' @param obs.only logical;
#'            train on only observed outcomes (default) or use all data
#'            including predicted/sampled values of missing outcomes (\code{T}).
#' @param verbose logical;
#'            print additional output.
#' @param X.init.fn function;
#'            creates a completed data set to be used as the initial state of
#'            the missForest procedure given two arguments; \itemize{
#'                \item a data.frame
#'                \item a list with and item indicating the missing (\code{T})
#'                      or not-missing (\code{F}) status of each column of the
#'                      first argument,
#'            } the default \code{\link{no_information_impute}} serves as an
#'            example.
#' @param stop.measure function;
#'            evaluates the difference or relationship between the two most
#'            recently completed data sets during iteration, must accept the
#'            following arguments;
#'            \describe{
#'                 \item{\code{X}}{named list with imputed values (in order of
#'                     appearance by row) for each column in the data set;}
#'                 \item{\code{Y}}{named list with imputed values (in order of
#'                     appearance by row) for each column in the data set;}
#'                 \item{\code{X_init}}{the original (mised-type) data set with
#'                      missing values replaced as at the starting point of
#'                      missForest;}
#'                 \item{\code{indicator}}{a list with the missing (\code{=T})
#'                      or not missing (\code{=F}) status of the original data
#'                      set;}
#'            } and should return a numeric (vector), the default
#'            \code{\link{measure_correlation}} serves as an example, or see the
#'            original measure proposed by Stekhoven and Buehlmann (2012) in
#'            \code{\link{measure_stekhoven_2012}}.
#' @param loop.limit numeric;
#'            maximum number of iterations within missForest procedure. 
#' @param overrides named list;
#'            (variable-wise) over-rides for arguments passed to
#'            \code{\link[ranger]{ranger}} when training on the response
#'            variable given by the name of the item.
#' @param clean.step named list;
#'            each item is a function to clean or post-process the named imputed
#'            data immediately after it is imputed, taking two arguments;
#'            \itemize{
#'                \item the subset of the data used in the current training step
#'                      which had missing values of the named data,
#'                \item the most recently imputed values of the named data,
#'            } and should return (post-processed) data of the same length and
#'            type as the second argument.
#' @param ... further arguments passed to all calls to 
#'            \code{\link[ranger]{ranger}}, e.g. \code{num.trees} for the number
#'            of trees in each forest.
#' @return list;
#'             containing the following items; \describe{
#'                 \item{call}{the call used to create the multiply imputed
#'                     data sets;}
#'                 \item{results}{list where each item (numbered) is itself a
#'                     named list of the output for an imputed data set;
#'                     \describe{
#'                         \item{\code{converged}}{boolean convergence status;}
#'                         \item{\code{imputed}}{list of imputed data by
#'                             iteration and variable;}
#'                         \item{\code{iterations}}{numeric count of iterations
#'                             before stopping criteria met;}
#'                         \item{\code{oob_error}}{list of oob error by
#'                             iteration and variable;}
#'                         \item{\code{stop_measures}}{output of the call to
#'                             \code{stop.measure} at each iteration;}
#'                     }
#'                 }
#'                 \item{which_imputed}{named list of which rows the imputed
#'                     named data belong to.}
#'             }
#'
#' @seealso \code{\link{measure_correlation}}
#'          \code{\link{measure_stekhoven_2012}}
#'          \code{\link{stop_condition}}
#'          \code{\link{no_information_impute}} \code{\link{sample_impute}}
#'          \code{\link[missForest]{missForest}}
#'          \code{\link[ranger]{ranger}}
#'
#' @references
#'
#' Bartlett, J., 2014. 'Methodology for multiple imputation for missing data in
#' electronic health record data', presented to _27th International Biometric
#' Conference_, Florence, July 6-11.
#'
#' Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning
#' for missing data imputation in the presence of interaction effects. 
#' \emph{Computational Statistics & Data Analysis, 72}, pp. 92-104.
#' \href{https://dx.doi.org/10.1016/j.csda.2013.10.025}{doi.10.1016/j.csda.2013.10.025}
#'
#' Stekhoven, D.J. and Buehlmann, P., 2012. MissForest--non-parametric
#' missing value imputation for mixed-type data. \emph{Bioinformatics, 28}(1),
#' pp. 112-118.
#' \href{https://dx.doi.org/10.1093/bioinformatics/btr597}{doi.1.1093/bioinformatics/btr597}
#'
#' Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random
#' forests for high dimensional data in C++ and R. \emph{Journal of Statistical
#' Software, 77}(i01), pp. 1-17. \href{https://dx.doi.org/10.18637/jss.v077.i01}{doi.10.18637/jss.v077.i01}
#'
#' @export
smirf <- function(X,
                  model=NULL,
                  n=5L,
                  gibbs=F,
                  tree.imp=F,
                  boot.train=F,
                  obs.only=T,
                  verbose=F,
                  X.init.fn=no_information_impute,
                  stop.measure=measure_correlation,
                  loop.limit=10L,
                  overrides=list(),
                  clean.step=list(),
                  ...) {

    ranger_args <- enquos(...)

    when_verbose_print <- function(...) if (verbose) print(...)

    check_smirf_args(         X,     model,          n,     gibbs,     tree.imp,
                     boot.train,  obs.only,    verbose, X.init.fn, stop.measure,
                     loop.limit, overrides, clean.step)

    n_obs <- nrow(X)

    if (!identical(floor(n), as.numeric(n)))
        warning('non-integer value of n supplied.')

    if (!identical(floor(loop.limit), as.numeric(loop.limit)))
        warning('non-integer value of loop.limit supplied.')

    if (inherits(X, 'grouped_df'))
        warning('groups in X will be ignored.')

    # store location of missing data as list
    indicator <- lapply(X, is.na)

    # filter completely missing variables
    miss_tally <- sapply(indicator, sum)
    if (any(miss_tally == n_obs))
        warning(paste0('excluding the following entirely missing data in X:',
                       '\n  - ',
                       paste(names(miss_tally)[miss_tally == n_obs],
                             collapse=', '), '.'))

    v_use <- names(miss_tally)[miss_tally != n_obs]
    indicator <- indicator[v_use]
    miss_tally <- miss_tally[v_use]

    # sort columns by least to most missing
    if (is.null(model))
        model <- matrix(1, nrow=sum(miss_tally > 0), ncol=length(v_use),
                        dimnames=list(v_use[order(miss_tally)[miss_tally > 0]],
                                      v_use))

    # convert integers and logical values to factors
    to_categories <- get_maps_to_categories(X[v_use])
    to_categorical_functions <- lapply(X[names(to_categories)],
                                       function(j, x)
                                           unname(x[[storage.mode(j)]]),
                                       x=list('integer'=as.ordered,
                                              'logical'=as.factor))

    # invoke as.data.frame here due to ranger not supporting tibble argument
    X_init_call <- call2(X.init.fn,
                         X=as.data.frame(X[v_use]),
                         indicator=indicator)

    ranger_call <- call_modify(call2(ranger::ranger,
                                     write.forest=T,
                                     verbose=verbose),
                               !!! ranger_args)

    res <- list()
    when_verbose_print('smirf: begin imputations')

    for (j in seq_len(n)) {

        X_init <- eval_tidy(X_init_call)
        msgs <- is_valid_initial_state(X_init, X[v_use])
        if (length(msgs) > 0)
            stop(paste(msgs, collapse='\n'))

        X_init[names(to_categories)] <- mapply(function(f, x) f(x),
                                               to_categorical_functions,
                                               X_init[names(to_categories)],
                                               SIMPLIFY=F)

        res[[j]] <- perform_missforest(     X_init,     model,    indicator,
                                       ranger_call,     gibbs,     tree.imp,
                                        boot.train,  obs.only, stop.measure,
                                        loop.limit, overrides,   clean.step)

        # convert back to integer/logical
        res[[j]]$imputed <- lapply(res[[j]]$imputed,
                                   unmap_categories,
                                   to_categories)

        when_verbose_print(paste('  - imputation', j, 'complete.'))
    }

    list(call=match.call(),
         model=model,
         results=lapply(res,
                        post_process_missforest,
                        to_categories=to_categories),
         which_imputed=lapply(indicator, which))

}
stephematician/miForang documentation built on July 23, 2019, 5:11 p.m.