Description Usage Arguments Details Value See Also Examples
View source: R/forest_impute.R
In the unsupervised case, tree ensemble built on the imputed data (of the previous iteration) in an unsupervised way and used to impute data until a stopping criteria is reached. In the supervised case, forest is grown in a supervised way (a response is used) to impute for every iteration. See 'details'.
1 2 3 | forest_impute(dataset, responseVarName, method = "synthetic",
predictMethod = "terminalNodes", implementation = "ranger",
tol = 0.05, maxIter = 10L, seed = 1L, nproc = 1L, ...)
|
dataset |
A list with two components:
|
responseVarName |
(string) Name of the response variable (supervised case) |
method |
(string) A method to build the tree ensemble when object is missing. Currently, only "synthetic" is implemented. |
predictMethod |
(string) Method to to compute the proximity matrix. Currently, only "terminalNodes" is implemented. |
implementation |
(string) One among: 'ranger', 'randomForest' |
tol |
(number between 0 and 1) Threshold for the change of the metric. See 'details'. |
maxIter |
(positive integer) Maximum number of iterations. |
seed |
(positive integer) seed for growing a forest. |
nproc |
(positive integer) Number of parallel processes to be used |
... |
Arguments to be passed to synthetic_forest in the unsupervised case. |
In the unsupervised case, when "synthetic" method is chosen, a random forest is grown using 'datasetComplete' to separate actual data from synthetic data. When the predictMethod is "terminalNodes", the proximity matrix is computed. In the supervised case, forest is grown with a specified response.
The missing data in each covariate is imputed by averaging non-missing values of the covariate where the weights are the proximities. This is the new 'datasetComplete'.
This is repeated until maximum number of iterations specified by "maxiter" unless for consecutive iterations the change in the metric (MAPE for continuous data, Proportion of disagreements for factors) for each covariate is less than a threshold ("tol").
A list with these elements:
data: The imputed dataset.
iter: Number of iterations.
errors: A vector of metric of the last iteration corresponding to each covariate.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | ## Not run:
# example of unsupervised imputation
library("magrittr")
# create 20% artificial missings values at random
iris_with_na <- missRanger::generateNA(iris, 0.2, seed = 1)
# impute with mean/mode
iris_complete <- randomForest::na.roughfix(iris_with_na)
# dataframe of missing positions
iris_missing <- is.na(iris_with_na) %>% as.data.frame()
imp1 <- forest_impute(list(iris_complete, iris_missing)
, implementation = "ranger"
)
imp1 <- forest_impute(list(iris_complete, iris_missing)
, implementation = "randomForest"
)
imp1$iter # number of iterations
imp1$errors # errors of the last iteration
metric_relative <- function(x, y, z){
if(sum(z) == 0){
return(0)
}
if(is.numeric(x)){
mean(abs((y[z] - x[z])/y[z]))
} else {
sum(x[z] != y[z])/sum(z)
}
}
compare_roughimpute_with_actual <-
Map(metric_relative, iris_complete, iris, iris_missing) %>%
unlist()
compare_forest_impute_with_actual <-
Map(metric_relative, imp1$data, iris, iris_missing) %>%
unlist()
perf <- data.frame(
colnames = names(compare_forest_impute_with_actual)
, rough = round(compare_roughimpute_with_actual, 2)
, forest = round(compare_forest_impute_with_actual, 2)
)
rownames(perf) <- NULL
perf
# example of supervised imputation
# create data for supervised case
iris_complete2 <- iris_complete
iris_complete2$Species <- iris$Species
iris_missing2 <- iris_missing
iris_missing2$Species <- rep(FALSE, length(iris_missing))
imp2 <- forest_impute(list(iris_complete2, iris_missing2)
, "Species"
, implementation = "ranger"
)
imp2 <- forest_impute(list(iris_complete2, iris_missing2)
, "Species"
, implementation = "randomForest"
)
compare_forest_impute_sup_with_actual <-
Map(metric_relative, imp2$data, iris, iris_missing2) %>% unlist()
perf2 <- data.frame(
colnames = names(compare_forest_impute_sup_with_actual)
, rough = round(compare_roughimpute_with_actual, 2)
, forest_sup = round(compare_forest_impute_sup_with_actual, 2)
)
rownames(perf2) <- NULL
perf2
cbind(perf, forest_sup = perf2[,3])
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.