rf.modelSel: Random Forest Model Selection

View source: R/rf.modelSel.R

rf.modelSelR Documentation

Random Forest Model Selection

Description

Implements Murphy et al., (2010) Random Forests model selection approach.

Usage

rf.modelSel(
  xdata,
  ydata,
  imp.scale = c("mir", "se"),
  r = c(0.25, 0.5, 0.75),
  final.model = FALSE,
  seed = NULL,
  parsimony = NULL,
  kappa = FALSE,
  method = c("Breiman", "Wright"),
  pvalue = NULL,
  nperm = 99,
  ...
)

Arguments

xdata

X Data for model

ydata

Y Data for model

imp.scale

Type of scaling for importance values (mir or se), default is mir

r

Vector of importance percentiles to test i.e., seq(0,1,0.2)[2:5]

final.model

Run final model with selected variables (TRUE/FALSE)

seed

Sets random seed in the R global environment. This is highly suggested.

parsimony

Threshold for competing model (0-1)

kappa

Use the chance corrected kappa statistic rather than PCC

method

Use the fast C++ ranger implementation "Wright" or original "Breiman" Fortran code

pvalue

Calculate a p-value and filter parameters with this threshold

nperm

Number of permutations to calculate p-value

...

Additional arguments to pass to randomForest or ranger (e.g., ntree=1000, replace=TRUE, proximity=TRUE)

Details

If you want to run classification, make sure that y is a factor, otherwise the randomForest model runs in regression mode For classification problems the model selection criteria is: smallest OOB error, smallest maximum within class error, and fewest parameters. For regression problems, the model selection criteria is largest percent variation explained, smallest MSE and fewest parameters.

The "mir" scale option performs a row standardization and the "se" option performs normalization using the "standard errors" of the permutation-based importance measure. Both options result in a 0-1 range but, "se" sums to 1. The scaled importance measures are calculated as: mir = i/max(i) and se = (i / se) / ( sum(i) / se).

The parsimony argument is the percent of allowable error surrounding competing models. For example, if there are two competing models, a selected model with 5 parameters and a competing model with 3 parameters, and parsimony = 0.05, if there is +/- 5 parameter model it will be selected at the final model.

If you specify the pvalue and nperm arguments then a permutation test is applied and parameters that do not meet the specified significance are removed before the model selection process. Please note that the p-value will be a function of the number of permutations. So a pvlaue=0.10 would be adequate for nperm=99.

Using the kappa = TRUE argument will base error optimization on the kappa rather than percent correctly classified (PCC). This will correct the PCC for random agreement. The method = "Breiman" specifies the use of the original Breiman Fortran code whereas "Wright" uses the C++ implementation from the ranger package (which exhibits a considerable improvement in speed).

Value

A rf.modelSel class object with the following components:

  • "rf.final" Final selected model, if final = TRUE(randomForest model object)

  • "sel.vars" Final selected variables (vector)

  • "test" Validation parameters used on model selection (data.frame)

  • "sel.importance" Importance values for selected model (data.frame)

  • "importance" Importance values for all models (data.frame)

  • "parameters" Variables used in each tested model (list)

  • "scaling" Type of scaling used for importance

Author(s)

Jeffrey S. Evans <jeffrey_evans@tnc.org>

References

Evans, J.S. and S.A. Cushman (2009) Gradient Modeling of Conifer Species Using Random Forest. Landscape Ecology 5:673-683.

Murphy M.A., J.S. Evans, and A.S. Storfer (2010) Quantify Bufo boreas connectivity in Yellowstone National Park with landscape genetics. Ecology 91:252-261

Evans J.S., M.A. Murphy, Z.A. Holden, S.A. Cushman (2011). Modeling species distribution and change using Random Forests CH.8 in Predictive Modeling in Landscape Ecology eds Drew, CA, Huettmann F, Wiersma Y. Springer

See Also

randomForest for randomForest ... model options when method = "Breiman"

ranger for ranger ... model options when method = "Wright"

rf.ImpScale details on p-values

Examples

require(randomForest)
  data(airquality)
  airquality <- na.omit(airquality)

  xdata = airquality[,2:6]
  ydata = airquality[,1]

 #### Regression example
 
 #### Using Breiman's original Fortran code from randomForest package
 ( rf.regress <- rf.modelSel(airquality[,2:6], airquality[,1], 
                             imp.scale="se") )
 
 #### Using Wright's C++ code from ranger package
 ( rf.regress <- rf.modelSel(airquality[,2:6], airquality[,1], 
                             method="Wright") )

 #### Classification example
 ydata = as.factor(ifelse(ydata < 40, 0, 1))
 
  #### Using Breiman's original Fortran code from randomForest package
  ( rf.class <- rf.modelSel(xdata, ydata, ntree=1000) )
  
     # Use selected variables (same as final.model = TRUE
  vars <- rf.class$selvars
     ( rf.fit <- randomForest(x=iris[,vars], y=iris[,"Species"]) )
  
     # Use results to select competing model
  vars <- na.omit(as.character(rf.class$parameters[2,]))
     ( rf.fit <- randomForest(x=xdata[,vars], y=ydata) )   
  
  #### Using Wright's C++ code from ranger package
  ( rf.class <- rf.modelSel(xdata, ydata, method="Wright") )	
  	
## Not run: 
   # Using ranger package, filter p-values for classification
   ( rf.class <- rf.modelSel(xdata, ydata, method="Wright", 
                             pvalue=0.1, nperm=99, num.trees=1000) )

  # Using ranger package, filter p-values for regression
  ( rf.class <- rf.modelSel(airquality[,1], ydata, method="Wright", 
                            pvalue=0.1, num.trees=1000) )

## End(Not run)


jeffreyevans/rfUtilities documentation built on Nov. 12, 2023, 6:52 p.m.