imp.rfemp: Perform multiple imputation using the empirical error...
In RfEmpImp: Multiple Imputation using Chained Random Forests

imp.rfemp

R Documentation

Perform multiple imputation using the empirical error distributions and predicted probabilities of random forests

Description

RfEmp multiple imputation method is for mixed types of variables, and calls corresponding functions based on variable types. Categorical variables should be of type factor or logical, etc.

RfPred.Emp is used for continuous variables, and RfPred.Cate is used for categorical variables.

Usage

imp.rfemp(
  data,
  num.imp = 5,
  max.iter = 5,
  num.trees = 10,
  alpha.emp = 0,
  sym.dist = TRUE,
  pre.boot = TRUE,
  num.trees.cont = NULL,
  num.trees.cate = NULL,
  num.threads = NULL,
  print.flag = FALSE,
  ...
)

Arguments

`data`	A data frame or a matrix containing the incomplete data. Missing values should be coded as `NA`s.
`num.imp`	Number of multiple imputations. The default is `num.imp = 5`.
`max.iter`	Number of iterations. The default is `max.iter = 5`.
`num.trees`	Number of trees to build. The default is `num.trees = 10`.
`alpha.emp`	The "significance level" for the empirical distribution of out-of-bag prediction errors, can be used for prevention for outliers (helpful for highly skewed variables). For example, set alpha = 0.05 to use 95% confidence level. The default is `alpha.emp = 0.0`, and the empirical distribution of out-of-bag prediction errors will be kept intact.
`sym.dist`	If `TRUE`, the empirical distribution of out-of-bag prediction errors will be assumed to be symmetric; if `FALSE`, the empirical distribution will be kept intact. The default is `sym.dist = TRUE`.
`pre.boot`	If `TRUE`, bootstrapping prior to imputation will be performed to perform 'proper' multiple imputation, for accommodating sampling variation in estimating population regression parameters (refer to Shah et al. 2014). It should be noted that if `TRUE`, this option is valid even if the number of trees is set to one.
`num.trees.cont`	Number of trees to build for continuous variables. The default is `num.trees.cont = NULL` and the value of `num.trees` will be used.
`num.trees.cate`	Number of trees to build for categorical variables, The default is `num.trees.cate = NULL` and the value of `num.trees` will be used.
`num.threads`	Number of threads for parallel computing. The default is `num.threads = NULL` and all the processors available can be used.
`print.flag`	If `TRUE`, details will be sent to console. The default is `print.flag = FALSE`.
`...`	Other arguments to pass down.

Details

For continuous variables, mice.impute.rfpred.emp is called, performing imputation based on the empirical distribution of out-of-bag prediction errors of random forests.

For categorical variables, mice.impute.rfpred.cate is called, performing imputation based on predicted probabilities.

Value

An object of S3 class mids.

Author(s)

Shangzhi Hong

References

Hong, Shangzhi, et al. "Multiple imputation using chained random forests." Preprint, submitted April 30, 2020. https://arxiv.org/abs/2004.14823.

Zhang, Haozhe, et al. "Random Forest Prediction Intervals." The American Statistician (2019): 1-20.

Shah, Anoop D., et al. "Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study." American journal of epidemiology 179.6 (2014): 764-774.

Malley, James D., et al. "Probability machines." Methods of information in medicine 51.01 (2012): 74-81.

Examples

# Prepare data: convert categorical variables to factors
nhanes.fix <- nhanes
nhanes.fix[, c("age", "hyp")] <- lapply(nhanes[, c("age", "hyp")], as.factor)
# Perform imputation using imp.rfemp
imp <- imp.rfemp(nhanes.fix)
# Do repeated analyses
anl <- with(imp, lm(chl ~ bmi + hyp))
# Pool the results
pool <- pool(anl)
# Get pooled estimates
reg.ests(pool)

RfEmpImp documentation built on Oct. 20, 2022, 9:06 a.m.