forestFloor: ForestFloor: Visualize topologies of randomForest model

Description Usage Arguments Details Value Note Author(s) References See Also Examples

Description

The function forestFloor computes a feature contribution matrix from a randomForest model-fit and outputs a forestFloor S3 class object, including importance and the orginal training set. The output object is the basis all visualizations.

Usage

1
forestFloor(rf.fit,X,calc_np = FALSE, binary_reg = FALSE,...)

Arguments

rf.fit

rf.fit, a random forest object as the output from randomForest::randomForest or cinbag::trimTrees

X

data.frame of input variables, numeric(continnous), descrete(treated as continous) or factors(categoric). n_rows observations and n_columns features X MUST be the same data.frame as used to train the random forest, see above item.

calc_np

calculate Node Predictions(TRUE) or reuse information from rf.fit(FALSE)? slightly faster when FALSE for regression
This option will only take effect for rf.fit of class "randomForest" and type="regression". This option, is only for developmental purposes. Set to FALSE always as function will override this choice anyway when needed.

binary_reg

booleen, if TRUE binary classification can be changed to "percentage votes" of class 1, and thus be treated as regression.

...

does nothing

Details

forestFloor computes feature contributions for random forest regression as suggest by Kuz'min et al, and for binaray classification as suggested by Palczewska et al. Feature contributions is the sums over all local increments for each observation for each feature divided by the number of trees. A local increment is the change of node prediction for given observation in one node being split to a subnode by a given feature. forestFloor use inbag samples to calculate local increments, but only sum local increments over out-of-bag samples divided with OOBtimes. OOBtimes is the number of times a given observation have out-of-bag which normally is ~ trees / 3. This implementation, can be said to yield cross-validated feature contributions. In practices this lowers the leaverage of any observation to the feature contributions of this observation. Hereby becomes the visulization less noisy. In systems with low or no noise, this implementation have no particular advantage.

Node predictions, the average target value of inbag samples in any terminal or intermediary node of a rf.fit of class randomForest are already calculated for regression and are placed in rfo$forest$nodepred. Node predictions for binaray classification are the fraction of class 1 (out of 2) in any node of a random forest and are not calculated in advance, by randomForest package.

Value

the forestFloor function outputs(depending on type rf.fit) an object of either class "forestFloor_regression" or "forestFloor_multiClass" with following elements:

X

a copy of the training data or feature space matrix/data.frame, X. The copy is passed from the input of this function. X is used in all visualization to expand the feature contributions over the features of which they were recorded.

Y

a copy of the target vector, Y.

importance

The gini-importance or permutation-importance a.k.a varaiable importance of the random forest object
if rfo=randomForest(X,Y,importance=FALSE), gini-importance is used.
gini-importance is less reproducible and more biased. The extra time used to compute permutation importance is negliable.

imp_ind

imp_ind, the importance indices is the order to sort the features by descending importance. imp_ind is used by plotting functions to present must relevant feature contributions first. If using gini-importance, the order of plots is more random and will favor continous variables. The plots themselves will not differ.

FC_matrix

[ONLY forestFloor_regression.] feature contributions in a matrix.
n_row observations and n_column features - same dimensions as X.

FC_array

[ONLY forestFloor_multiClass.] feature contributions in a array.
n_row observations and n_column features and n_layer classes. First two dimensions will match dimensions of X.

Note

this version

Author(s)

Soren Havelund Welling

References

Interpretation of QSAR Models Based on Random Forest Methods, http://dx.doi.org/10.1002/minf.201000173
Interpreting random forest classification models using a feature contribution method, http://arxiv.org/abs/1312.1121

See Also

plot.forestFloor, show3d,

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
## Not run: 
library(forestFloor)
library(randomForest)
#simulate data
obs=2500
vars = 6 

X = data.frame(replicate(vars,rnorm(obs)))
Y = with(X, X1^2 + sin(X2*pi) + 2 * X3 * X4 + 1 * rnorm(obs))


#grow a forest, remeber to include inbag
rfo=randomForest(X,Y,keep.inbag = TRUE,sampsize=1500,ntree=500)

#compute topology
ff = forestFloor(rfo,X)


#print forestFloor
print(ff) 

#plot partial functions of most important variables first
plot(ff) 

#Non interacting functions are well displayed, whereas X3 and X4 are not
#by applying different colourgradient, interactions reveal themself 
Col = fcol(ff,3,orderByImportance=FALSE)
plot(ff,col=Col,compute_GOF=TRUE) 



#in 3D the interaction between X3 and X reveals itself completely
show3d(ff,3:4,col=Col,plot.rgl=list(size=5),sortByImportance=FALSE) 

#although no interaction, a joined additive effect of X1 and X2
#colour by FC-component FC1 and FC2 summed
Col = fcol(ff,1:2,orderByImportance=FALSE,X.m=FALSE,RGB=TRUE)
plot(ff,col=Col) 
show3d(ff,1:2,col=Col,plot.rgl=list(size=5),sortByImportance=FALSE) 

#...or two-way gradient is formed from FC-component X1 and X2.
Col = fcol(ff,1:2,orderByImportance=FALSE,X.matrix=TRUE,alpha=0.8) 
plot(ff,col=Col) 
show3d(ff,1:2,col=Col,plot.rgl=list(size=5),sortByImportance=FALSE)


## End(Not run)

forestFloor documentation built on May 2, 2019, 4:46 p.m.