forest_impute: Impute using a tree ensemble in un/supervised setting

Description Usage Arguments Details Value See Also Examples

View source: R/forest_impute.R

Description

In the unsupervised case, tree ensemble built on the imputed data (of the previous iteration) in an unsupervised way and used to impute data until a stopping criteria is reached. In the supervised case, forest is grown in a supervised way (a response is used) to impute for every iteration. See 'details'.

Usage

1
2
3
forest_impute(dataset, responseVarName, method = "synthetic",
  predictMethod = "terminalNodes", implementation = "ranger",
  tol = 0.05, maxIter = 10L, seed = 1L, nproc = 1L, ...)

Arguments

dataset

A list with two components:

  • First item (datasetComplete) should be a dataframe without missing values.

  • Second item (datasetMissingBoolean) should be a dataframe with TRUE at the position where data is missing, FALSE otherwise. The dimension and column names should be identical to datasetComplete.

responseVarName

(string) Name of the response variable (supervised case)

method

(string) A method to build the tree ensemble when object is missing. Currently, only "synthetic" is implemented.

predictMethod

(string) Method to to compute the proximity matrix. Currently, only "terminalNodes" is implemented.

implementation

(string) One among: 'ranger', 'randomForest'

tol

(number between 0 and 1) Threshold for the change of the metric. See 'details'.

maxIter

(positive integer) Maximum number of iterations.

seed

(positive integer) seed for growing a forest.

nproc

(positive integer) Number of parallel processes to be used

...

Arguments to be passed to synthetic_forest in the unsupervised case.

Details

Value

A list with these elements:

See Also

rfImpute

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
## Not run: 
# example of unsupervised imputation

library("magrittr")

# create 20% artificial missings values at random
iris_with_na  <- missRanger::generateNA(iris, 0.2, seed = 1)
# impute with mean/mode
iris_complete <- randomForest::na.roughfix(iris_with_na)
# dataframe of missing positions
iris_missing  <- is.na(iris_with_na) %>% as.data.frame()

imp1        <- forest_impute(list(iris_complete, iris_missing)
                             , implementation = "ranger"
                             )

imp1        <- forest_impute(list(iris_complete, iris_missing)
                             , implementation = "randomForest"
                             )

imp1$iter # number of iterations
imp1$errors # errors of the last iteration

metric_relative <- function(x, y, z){

  if(sum(z) == 0){
    return(0)
  }

  if(is.numeric(x)){
    mean(abs((y[z] - x[z])/y[z]))
  } else {
    sum(x[z] != y[z])/sum(z)
  }

}

compare_roughimpute_with_actual <-
  Map(metric_relative, iris_complete, iris, iris_missing) %>%
    unlist()
compare_forest_impute_with_actual <-
  Map(metric_relative, imp1$data, iris, iris_missing) %>%
    unlist()

perf <- data.frame(
  colnames = names(compare_forest_impute_with_actual)
  , rough  = round(compare_roughimpute_with_actual, 2)
  , forest = round(compare_forest_impute_with_actual, 2)
  )
rownames(perf) <- NULL
perf

# example of supervised imputation

# create data for supervised case
iris_complete2         <- iris_complete
iris_complete2$Species <- iris$Species

iris_missing2 <- iris_missing
iris_missing2$Species <- rep(FALSE, length(iris_missing))

imp2        <- forest_impute(list(iris_complete2, iris_missing2)
                             , "Species"
                             , implementation = "ranger"
                             )


imp2        <- forest_impute(list(iris_complete2, iris_missing2)
                             , "Species"
                             , implementation = "randomForest"
                             )

compare_forest_impute_sup_with_actual <-
  Map(metric_relative, imp2$data, iris, iris_missing2) %>% unlist()

perf2 <- data.frame(
  colnames     = names(compare_forest_impute_sup_with_actual)
  , rough      = round(compare_roughimpute_with_actual, 2)
  , forest_sup = round(compare_forest_impute_sup_with_actual, 2)
  )
rownames(perf2) <- NULL
perf2
cbind(perf, forest_sup = perf2[,3])

## End(Not run)

talegari/forager documentation built on May 3, 2019, 4:01 p.m.