R/imp.rfemp.R

Defines functions imp.rfemp

Documented in imp.rfemp

#' Perform multiple imputation using the empirical error distributions and
#' predicted probabilities of random forests
#'
#' @description
#' \code{RfEmp} multiple imputation method is for mixed types of variables,
#' and calls corresponding functions based on variable types.
#' Categorical variables should be of type \code{factor} or \code{logical}, etc.
#'
#' \code{RfPred.Emp} is used for continuous variables, and \code{RfPred.Cate}
#' is used for categorical variables.
#'
#' @details
#' For continuous variables, \code{mice.impute.rfpred.emp} is called, performing
#' imputation based on the empirical distribution of out-of-bag
#' prediction errors of random forests.
#'
#' For categorical variables, \code{mice.impute.rfpred.cate} is called,
#' performing imputation based on predicted probabilities.
#'
#' @param data A data frame or a matrix containing the incomplete data. Missing
#' values should be coded as \code{NA}s.
#'
#' @param num.imp Number of multiple imputations. The default is
#' \code{num.imp = 5}.
#'
#' @param max.iter Number of iterations. The default is \code{max.iter = 5}.
#'
#' @param num.trees Number of trees to build. The default is
#' \code{num.trees = 10}.
#'
#' @param alpha.emp The "significance level" for the empirical distribution of
#' out-of-bag prediction errors, can be used for prevention for outliers
#' (helpful for highly skewed variables).
#' For example, set alpha = 0.05 to use 95\% confidence level.
#' The default is \code{alpha.emp = 0.0}, and the empirical distribution of
#' out-of-bag prediction errors will be kept intact.
#'
#' @param sym.dist If \code{TRUE}, the empirical distribution of out-of-bag
#' prediction errors will be assumed to be symmetric; if \code{FALSE}, the
#' empirical distribution will be kept intact. The default is
#' \code{sym.dist = TRUE}.
#'
#' @param pre.boot If \code{TRUE}, bootstrapping prior to imputation will be
#' performed to perform 'proper' multiple imputation, for accommodating sampling
#' variation in estimating population regression parameters
#' (refer to Shah et al. 2014).
#' It should be noted that if \code{TRUE}, this option is valid even if the
#' number of trees is set to one.
#'
#' @param num.trees.cont Number of trees to build for continuous variables.
#' The default is \code{num.trees.cont = NULL} and the value of \code{num.trees}
#' will be used.
#'
#' @param num.trees.cate Number of trees to build for categorical variables,
#' The default is \code{num.trees.cate = NULL} and the value of \code{num.trees}
#' will be used.
#'
#' @param num.threads Number of threads for parallel computing. The default is
#' \code{num.threads = NULL} and all the processors available can be used.
#'
#' @param print.flag If \code{TRUE}, details will be sent to console. The
#' default is \code{print.flag = FALSE}.
#'
#' @param ... Other arguments to pass down.
#'
#' @return An object of S3 class \code{mids}.
#'
#' @name imp.rfemp
#'
#' @author Shangzhi Hong
#'
#' @references
#' Hong, Shangzhi, et al. "Multiple imputation using chained random forests."
#' Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.
#'
#' Zhang, Haozhe, et al. "Random Forest Prediction Intervals."
#' The American Statistician (2019): 1-20.
#'
#' Shah, Anoop D., et al. "Comparison of random forest and parametric
#' imputation models for imputing missing data using MICE: a CALIBER study."
#' American journal of epidemiology 179.6 (2014): 764-774.
#'
#' Malley, James D., et al. "Probability machines." Methods of information
#' in medicine 51.01 (2012): 74-81.
#'
#' @examples
#' # Prepare data: convert categorical variables to factors
#' nhanes.fix <- nhanes
#' nhanes.fix[, c("age", "hyp")] <- lapply(nhanes[, c("age", "hyp")], as.factor)
#' # Perform imputation using imp.rfemp
#' imp <- imp.rfemp(nhanes.fix)
#' # Do repeated analyses
#' anl <- with(imp, lm(chl ~ bmi + hyp))
#' # Pool the results
#' pool <- pool(anl)
#' # Get pooled estimates
#' reg.ests(pool)
#'
#' @export
imp.rfemp <- function(
    data,
    num.imp = 5,
    max.iter = 5,
    num.trees = 10,
    alpha.emp = 0.0,
    sym.dist = TRUE,
    pre.boot = TRUE,
    num.trees.cont = NULL,
    num.trees.cate = NULL,
    num.threads = NULL,
    print.flag = FALSE,
    ...) {
    return(mice(
        data = data,
        method = "rfemp",
        m = num.imp,
        maxit = max.iter,
        num.trees = num.trees,
        alpha.emp = alpha.emp,
        sym.dist = sym.dist,
        pre.boot = pre.boot,
        num.trees.cont = num.trees.cont,
        num.trees.cate = num.trees.cate,
        num.threads = num.threads,
        printFlag = print.flag,
        # Bypass remove.lindep() in mice >= 3.9.0
        maxcor = 1.0,
        eps = 0,
        # Bypass collinearity and constant checks
        remove.collinear = FALSE,
        remove.constant = FALSE,
        ...))
}

Try the RfEmpImp package in your browser

Any scripts or data that you put into this service are public.

RfEmpImp documentation built on July 2, 2020, 2:13 a.m.