Description Usage Arguments Details Value Note Author(s) References See Also Examples
View source: R/mice.impute.rfcont.R
This method can be used to impute continuous variables in MICE by specifying
method = 'rfcont'. It was developed independently from the
mice.impute.rf
algorithm of Doove et al.,
and differs from it in drawing imputed values from a normal distribution.
1 2 |
y |
a vector of observed values and missing values of the variable to be imputed. |
ry |
a logical vector stating whether y is observed or not. |
x |
a matrix of predictors to impute y. |
ntree_cont |
number of trees, default = 10. A global option can be set thus: |
nodesize_cont |
minimum size of nodes, default = 5. A global option can be set thus: |
maxnodes_cont |
maximum number of nodes, default NULL. If NULL the number of nodes is determined by number of observations and nodesize_cont. |
ntree |
an alternative argument for specifying the number of trees, over-ridden by |
... |
other arguments to pass to randomForest. |
This Random Forest imputation algorithm has been developed as an alternative to normal-based linear regression, and can accommodate non-linear relations and interactions among the predictor variables without requiring them to be specified in the model. The algorithm takes a bootstrap sample of the data to simulate sampling variability, fits a regression forest trees and calculates the out-of-bag mean squared error. Each value is imputed as a random draw from a normal distribution with mean defined by the Random Forest prediction and variance equal to the out-of-bag mean squared error.
If only one tree is used (not recommended), a bootstrap sample is not taken in the first stage because the Random Forest algorithm performs an internal bootstrap sample before fitting the tree.
A vector of imputed values of y.
This algorithm has been tested on simulated data with linear regression, and in survival analysis of real data with artificially introduced missingness at random. On the simulated data there was slight bias if the distribution of missing values was very different from observed values, because imputed values were closer to the centre of the data than the missing values. However in the survival analysis the hazard ratios were unbiased and coverage of confidence intervals more conservative than normal-based MICE, but the mean length of confidence intervals was shorter with mice.impute.rfcont.
Anoop Shah
Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of Random Forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American Journal of Epidemiology 2014. doi: doi: 10.1093/aje/kwt312
setRFoptions
, mice.impute.rfcat
,
mice
,
mice.impute.rf
,
mice.impute.cart
,
randomForest
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | set.seed(1)
# A small dataset with a single row to be imputed
mydata <- data.frame(x1 = c(2, 3, NA, 4, 5, 1, 6, 8, 7, 9), x2 = 1:10,
x3 = c(1, 3, NA, 4, 2, 8, 7, 9, 6, 5))
mice(mydata, method = c('norm', 'norm', 'norm'), m = 2, maxit = 2)
mice(mydata[, 1:2], method = c('rfcont', 'rfcont'), m = 2, maxit = 2)
mice(mydata, method = c('rfcont', 'rfcont', 'rfcont'), m = 2, maxit = 2)
# A larger simulated dataset
mydata <- simdata(100)
cat('\nSimulated multivariate normal data:\n')
print(data.frame(mean = colMeans(mydata), sd = sapply(mydata, sd)))
# Apply missingness pattern
mymardata <- makemar(mydata)
cat('\nNumber of missing values:\n')
print(sapply(mymardata, function(x){sum(is.na(x))}))
# Test imputation of a single column in a two-column dataset
cat('\nTest imputation of a simple dataset')
print(mice(mymardata[, c('y', 'x1')], method = 'rfcont'))
# Analyse data
cat('\nFull data analysis:\n')
print(summary(lm(y ~ x1 + x2 + x3, data=mydata)))
cat('\nMICE using normal-based linear regression:\n')
print(summary(pool(with(mice(mymardata,
method = 'norm'), lm(y ~ x1 + x2 + x3)))))
# Set options for Random Forest
setRFoptions(ntree_cont = 10)
cat('\nMICE using Random Forest:\n')
print(summary(pool(with(mice(mymardata,
method = 'rfcont'), lm(y ~ x1 + x2 + x3)))))
|
Loading required package: mice
Loading required package: lattice
Attaching package: 'mice'
The following objects are masked from 'package:base':
cbind, rbind
iter imp variable
1 1 x1 x3
1 2 x1 x3
2 1 x1 x3
2 2 x1 x3
Class: mids
Number of multiple imputations: 2
Imputation methods:
x1 x2 x3
"norm" "" "norm"
PredictorMatrix:
x1 x2 x3
x1 0 1 1
x2 1 0 1
x3 1 1 0
iter imp variable
1 1 x1
1 2 x1
2 1 x1
2 2 x1
Class: mids
Number of multiple imputations: 2
Imputation methods:
x1 x2
"rfcont" ""
PredictorMatrix:
x1 x2
x1 0 1
x2 1 0
Warning message:
In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
iter imp variable
1 1 x1 x3
1 2 x1 x3
2 1 x1 x3
2 2 x1 x3
Class: mids
Number of multiple imputations: 2
Imputation methods:
x1 x2 x3
"rfcont" "" "rfcont"
PredictorMatrix:
x1 x2 x3
x1 0 1 1
x2 1 0 1
x3 1 1 0
Warning messages:
1: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
2: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
Simulated multivariate normal data:
mean sd
y 0.32938725 2.2081239
x1 0.13005773 1.0340474
x2 0.11864584 0.9351751
x3 0.03305713 1.0770718
x4 -0.16307731 0.9477759
Number of missing values:
y x1 x2 x3 x4
0 20 19 0 0
Test imputation of a simple dataset
iter imp variable
1 1 x1
1 2 x1
1 3 x1
1 4 x1
1 5 x1
2 1 x1
2 2 x1
2 3 x1
2 4 x1
2 5 x1
3 1 x1
3 2 x1
3 3 x1
3 4 x1
3 5 x1
4 1 x1
4 2 x1
4 3 x1
4 4 x1
4 5 x1
5 1 x1
5 2 x1
5 3 x1
5 4 x1
5 5 x1
Class: mids
Number of multiple imputations: 5
Imputation methods:
y x1
"" "rfcont"
PredictorMatrix:
y x1
y 0 1
x1 1 0
Full data analysis:
Call:
lm(formula = y ~ x1 + x2 + x3, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-3.5180 -0.5501 -0.0813 0.7073 2.1814
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.07842 0.10316 0.760 0.449
x1 0.76546 0.10197 7.507 3.10e-11 ***
x2 0.95361 0.11724 8.134 1.48e-12 ***
x3 1.15782 0.09918 11.673 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.018 on 96 degrees of freedom
Multiple R-squared: 0.7938, Adjusted R-squared: 0.7873
F-statistic: 123.2 on 3 and 96 DF, p-value: < 2.2e-16
MICE using normal-based linear regression:
iter imp variable
1 1 x1 x2
1 2 x1 x2
1 3 x1 x2
1 4 x1 x2
1 5 x1 x2
2 1 x1 x2
2 2 x1 x2
2 3 x1 x2
2 4 x1 x2
2 5 x1 x2
3 1 x1 x2
3 2 x1 x2
3 3 x1 x2
3 4 x1 x2
3 5 x1 x2
4 1 x1 x2
4 2 x1 x2
4 3 x1 x2
4 4 x1 x2
4 5 x1 x2
5 1 x1 x2
5 2 x1 x2
5 3 x1 x2
5 4 x1 x2
5 5 x1 x2
estimate std.error statistic df p.value
(Intercept) 0.01549562 0.1067859 0.1451093 61.72944 8.850974e-01
x1 0.75760755 0.1154612 6.5615756 33.43159 1.247462e-08
x2 0.88846315 0.1209948 7.3429887 38.98938 5.596301e-10
x3 1.07808609 0.1108527 9.7253973 41.42490 4.529710e-14
Setting option CALIBERrfimpute_ntree_cont = 10
MICE using Random Forest:
iter imp variable
1 1 x1 x2
1 2 x1 x2
1 3 x1 x2
1 4 x1 x2
1 5 x1 x2
2 1 x1 x2
2 2 x1 x2
2 3 x1 x2
2 4 x1 x2
2 5 x1 x2
3 1 x1 x2
3 2 x1 x2
3 3 x1 x2
3 4 x1 x2
3 5 x1 x2
4 1 x1 x2
4 2 x1 x2
4 3 x1 x2
4 4 x1 x2
4 5 x1 x2
5 1 x1 x2
5 2 x1 x2
5 3 x1 x2
5 4 x1 x2
5 5 x1 x2
estimate std.error statistic df p.value
(Intercept) 0.1345584 0.1242239 1.083193 52.93743 2.821021e-01
x1 0.7741545 0.1262587 6.131494 77.02406 3.469562e-08
x2 0.8320893 0.1462016 5.691381 29.27539 2.177719e-07
x3 1.2248730 0.1426900 8.584152 17.30520 7.585044e-13
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.