plot.forestFloor: plot.forestFloor_regression

Description Usage Arguments Details Author(s) Examples

Description

A method to plot an object of forestFloor-class. Plot partial feature contributions of the most important variables. Colour gradients can be applied to show possible interactions. Fitted function(plot_GOF) describe FC only as a main effect and quantifies 'Goodness Of Fit'.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
## S3 method for class 'forestFloor_regression'
 plot(
  x,
  plot_seq=NULL, 
  plotTest = NULL,
  limitY=TRUE,
  orderByImportance=TRUE, 
  cropXaxes=NULL, 
  crop_limit=4,
  plot_GOF = TRUE,
  GOF_args = list(col="#33333399"),
  speedup_GOF = TRUE,
  ...)
                          
## S3 method for class 'forestFloor_multiClass'
 plot(
  x,
  plot_seq = NULL,
  label.seq = NULL,
  plotTest  = NULL,
  limitY = TRUE,
  col = NULL,
  colLists = NULL,
  orderByImportance = TRUE,
  fig.columns = NULL,
  plot_GOF = TRUE,
  GOF_args = list(),
  speedup_GOF = TRUE,
  jitter_these_cols = NULL,
  jitter.factor = NULL,
  ...)                             

Arguments

x

forestFloor-object, also abbreviated ff. Basically a list of class="forestFloor" containing feature contributions, features, targets and variable importance.

plot_seq

a numeric vector describing which variables and in what sequence to plot. Ordered by importance as default. If orderByImportance = F, then by feature/column order of training data.

label.seq

[only classification] a numeric vector describing which classes and in what sequence to plot. NULL is all classes ordered is in levels in x$Y of forestFloor_mulitClass object x.

plotTest

NULL(plot by test set if available), TRUE(plot by test set), FALSE(plot by train), "andTrain"(plot by both test and train)

fig.columns

[only for multiple plotting], how many columns per page. default(NULL) is 1 for one plot, 2 for 2, 3 for 3, 2 for 4 and 3 for more.

limitY

TRUE/FLASE, constrain all Yaxis to same limits to ensure relevance of low importance features is not over interpreted

col

Either a colur vector with one colour per plotted class label or a list of colour vectors. Each element is a colour vector one class. Colour vectors in list are normally either of length 1 with or of length equal to number of training observations. NULL will choose standard one colour per class.

colLists

Deprecetated, will be replaced by col input

jitter_these_cols

vector to apply jitter to x-axis in plots. Will refer to variables. Useful to for categorical variables. Default=NULL is no jitter.

jitter.factor

value to decide how much jitter to apply. often between .5 and 3

orderByImportance

TRUE / FALSE should plotting and plot_seq be ordered after importance. Most important feature plot first(TRUE)

cropXaxes

a vector of indices of which zooming of x.axis should look away from outliers

crop_limit

a number often between 1.5 and 5, referring limit in sigmas from the mean defining outliers if limit = 2, above selected plots will zoom to +/- 2 std.dev of the respective features.

plot_GOF

Boolean TRUE/FALSE. Should the goodness of fit be plotted as a line?

GOF_args

Graphical arguments fitted lines, see points for parameter names.

speedup_GOF

Should GOF only computed on reasonable sub sample of data set to speedup computation. GOF estimation leave-one-out-kNN becomes increasingly slow for +1500 samples.

...

... other arguments passed to par or plot
. e.g. mar=, mfrow=, is passed to par, and cex= is passed to plot. par() arguments are reset immediately as plot function returns.

Details

The method plot.forestFloor visualizes partial plots of the most important variables first. Partial dependence plots are available in the randomForest package. But such plots are single lines(1d-slices) and do not answer the question: Is this partial function(PF) a fair generalization or subject to global or local interactions.

Author(s)

Soren Havelund Welling

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
## Not run: 
  ## avoid testing of rgl 3D plot on headless non-windows OS
  ## users can disregard this sentence.
  if(!interactive() && Sys.info()["sysname"]!="Windows") skipRGL=TRUE
  
  ###
  #1 - Regression example:
  set.seed(1234)
  library(forestFloor)
  library(randomForest)
  
  #simulate data y = x1^2+sin(x2*pi)+x3*x4 + noise
  obs = 5000 #how many observations/samples
  vars = 6   #how many variables/features
  #create 6 normal distr. uncorr. variables
  X = data.frame(replicate(vars,rnorm(obs)))
  #create target by hidden function
  Y = with(X, X1^2 + sin(X2*pi) + 2 * X3 * X4 + 0.5 * rnorm(obs)) 
  
  #grow a forest
  rfo = randomForest(
    X, #features, data.frame or matrix. Recommended to name columns.
    Y, #targets, vector of integers or floats
    keep.inbag = TRUE,  # mandatory,
    importance = TRUE,  # recommended, else ordering by giniImpurity (unstable)
    sampsize = 1500 ,   # optional, reduce tree sizes to compute faster
    ntree = if(interactive()) 1000 else 25 #speedup CRAN testing
  )
  
  #compute forestFloor object, often only 5-10% time of growing forest
  ff = forestFloor(
    rf.fit = rfo,       # mandatory
    X = X,              # mandatory
    calc_np = FALSE,    # TRUE or FALSE both works, makes no difference
    binary_reg = FALSE  # takes no effect here when rfo$type="regression"
  )
  
  #print forestFloor
  print(ff) #prints a text of what an 'forestFloor_regression' object is
  plot(ff)
  
  #plot partial functions of most important variables first
  plot(ff,                       # forestFloor object
       plot_seq = 1:6,           # optional sequence of features to plot
       orderByImportance=TRUE    # if TRUE index sequence by importance, else by X column  
  )
       
  #Non interacting features are well displayed, whereas X3 and X4 are not
  #by applying color gradient, interactions reveal themself 
  #also a k-nearest neighbor fit is applied to evaluate goodness-of-fit
  Col=fcol(ff,3,orderByImportance=FALSE) #create color gradient see help(fcol)
  plot(ff,col=Col,plot_GOF=TRUE) 
  
  #feature contributions of X3 and X4 are well explained in the context of X3 and X4
  # as GOF R^2>.8
  
  show3d(ff,3:4,col=Col,plot_GOF=TRUE,orderByImportance=FALSE)
  
  #if needed, k-nearest neighbor parameters for goodness-of-fit can be accessed through convolute_ff
  #a new fit will be calculated and saved to forstFloor object as ff$FCfit
  ff = convolute_ff(ff,userArgs.kknn=alist(kernel="epanechnikov",kmax=5))
  plot(ff,col=Col,plot_GOF=TRUE) #this computed fit is now used in any 2D plotting.
  
  
  ###
  #2 - Multi classification example:   (multi is more than two classes)
  set.seed(1234)
  library(forestFloor)
  library(randomForest)
  
  data(iris)
  X = iris[,!names(iris) %in% "Species"]
  Y = iris[,"Species"]
  
  rf = randomForest(
    X,Y,               
    keep.forest=TRUE,  # mandatory
    keep.inbag=TRUE,   # mandatory
    samp=20,           # reduce complexity of mapping structure, with same OOB%-explained
    importance  = TRUE, # recommended, else ordering by giniImpurity (unstable)
    ntree = if(interactive()) 1000 else 25 #speedup CRAN testing
  )
  
  ff = forestFloor(rf,X)
  
  plot(ff,plot_GOF=TRUE,cex=.7,
    col=c("#FF0000A5","#00FF0050","#0000FF35") #one col per plotted class
  )
  
  #...and 3D plot, see show3d
  show3d(ff,1:2,1:2,plot_GOF=TRUE)
  
  #...and simplex plot (only for three class problems)
  plot_simplex3(ff)
  plot_simplex3(ff,zoom.fit = TRUE)
  
  #...and 3d simplex plots (rough look, Z-axis is feature)
  plot_simplex3(ff,fig3d = TRUE)
  
  ###
  #3 - binary regression example
  #classification of two classes can be seen as regression in 0 to 1 scale
  set.seed(1234)
  library(forestFloor)
  library(randomForest)
  data(iris)
  X = iris[-1:-50,!names(iris) %in% "Species"] #drop third class virginica
  Y = iris[-1:-50,"Species"]
  Y = droplevels((Y)) #drop unused level virginica
  
  rf = randomForest(
    X,Y,               
    keep.forest=TRUE,  # mandatory
    keep.inbag=TRUE,   # mandatory
    samp=20,           # reduce complexity of mapping structure, with same OOB%-explained
    importance  = TRUE, # recommended, else giniImpurity
     ntree = if(interactive()) 1000 else 25 #speedup CRAN testing
  )
  
  ff = forestFloor(rf,X,
                   calc_np=TRUE,    #mandatory to recalculate
                   binary_reg=TRUE) #binary regression, scale direction is printed
  Col = fcol(ff,1) #color by most important feature
  plot(ff,col=Col)   #plot features 
  
  #interfacing with rgl::plot3d
  show3d(ff,1:2,col=Col,plot.rgl.args = list(size=2,type="s",alpha=.5)) 

## End(Not run)

sorhawell/forestFloor documentation built on Oct. 23, 2021, 2:20 a.m.