rf.crossValidation: Cross-validation for Random Forest models

View source: R/rf.crossValidation.R

rf.crossValidationR Documentation

Cross-validation for Random Forest models

Description

Implements a permutation based cross-validation test for classification or regression Random Forests models

Usage

rf.crossValidation(
  x,
  p = 0.1,
  n = 99,
  seed = NULL,
  normalize = FALSE,
  bootstrap = FALSE,
  p.threshold = 0.6,
  trace = FALSE
)

Arguments

x

A randomForest or ranger object

p

Proportion data withhold (default p=0.10)

n

Number of cross validations (default n=99)

seed

Sets random seed in R global environment

normalize

(FALSE/TRUE) For regression, should rmse, mbe and mae be normalized using (max(y) - min(y))

bootstrap

(FALSE/TRUE) Should a bootstrap sampling be applied. If FALSE, an n-th percent withhold will be conducted

p.threshold

If ranger probability forest, threshold to use in validation

trace

Print iterations

Details

For classification problems, the cross-validation statistics are based on the prediction error on the withheld data: Total observed accuracy represents the percent correctly classified (aka, PCC) and is considered as a naive measure of agreement.

The diagonal of the confusion matrix represents correctly classified observations where off-diagonals represent cross-classification error. The primary issue with this evaluation is that does not reveal if error was evenly distributed between classes.

To represent the balance of error one can use omission and commission statistics such as estimates of users and producers accuracy. User's accuracy corresponds to error of commission (inclusion), observations being erroneously included in a given class.

The commission errors are represented by row sums of the matrix. Producer's accuracy corresponds to error of omission (exclusion), observations being erroneously excluded from a given class. The omission errors are represented by column sums of the matrix.

None of the previous statistics account for random agreement influencing the accuracy measure. The kappa statistic is a chance corrected metric that reflects the difference between observed agreement and agreement expected by random chance. A kappa of k=0.85 would indicate that there is 85

  • pcc = [Number of correct observations / total number of observations]

  • pcc = [Number of correct observations / total number of observations]

  • producers accuracy = [Number of correct / total number of correct and omission errors]

  • k = (observed accuracy - chance agreement) / (1 - chance agreement) where; change agreement = sum[product of row and column totals for each class]

For regression problems, a Bootstrap is constructed and the subset models MSE and percent variance explained is reported. Additional, the RMSE between the withheld response variable (y) and the predicted subset model

Value

For classification a "rf.cv"", "classification" class object with the following components:

  • cross.validation$cv.users.accuracy Class-level users accuracy for the subset cross validation data

  • cross.validation$cv.producers.accuracy Class-level producers accuracy for the subset cross validation data

  • cross.validation$cv.oob Global and class-level OOB error for the subset cross validation data

  • model$model.users.accuracy Class-level users accuracy for the model

  • model$model.producers.accuracy Class-level producers accuracy for the model

  • model$model.oob Global and class-level OOB error for the model

For regression a "rf.cv", "regression" class object with the following components:

  • fit.var.exp Percent variance explained from specified fit model

  • fit.mse Mean Squared Error from specified fit model

  • y.rmse Root Mean Squared Error (observed vs. predicted) from each Bootstrap iteration (cross-validation)

  • y.mbe Mean Bias Error from each Bootstrapped model

  • y.mae Mean Absolute Error from each Bootstrapped model

  • D Test statistic from Kolmogorov-Smirnov distribution Test (y and estimate)

  • p.val p-value for Kolmogorov-Smirnov distribution Test (y and estimate)

  • model.mse Mean Squared Error from each Bootstrapped model

  • model.varExp Percent variance explained from each Bootstrapped model

Note

Please note that previous versions of this function required ydata, xdata and "..." arguments that are no longer necessary. The model object is now used in obtaining the data and arguments used in the original model

Author(s)

Jeffrey S. Evans <jeffrey_evans<at>tnc.org>

References

Evans, J.S. and S.A. Cushman (2009) Gradient Modeling of Conifer Species Using Random Forest. Landscape Ecology 5:673-683.

Murphy M.A., J.S. Evans, and A.S. Storfer (2010) Quantify Bufo boreas connectivity in Yellowstone National Park with landscape genetics. Ecology 91:252-261

Evans J.S., M.A. Murphy, Z.A. Holden, S.A. Cushman (2011). Modeling species distribution and change using Random Forests CH.8 in Predictive Modeling in Landscape Ecology eds Drew, CA, Huettmann F, Wiersma Y. Springer

See Also

randomForest for randomForest details

ranger for ranger details

Examples

## Not run: 
library(randomForest)
library(ranger)

data(airquality)
airquality <- na.omit(airquality)
yclass = as.factor(ifelse(airquality[,1] < 40, 0, 1))

# regression with ranger
rf.mdl <- ranger(x = airquality[,2:6], y = airquality[,1])
  ( rf.cv <- rf.crossValidation(rf.mdl, p=0.10) )

  # plot results
  par(mfrow=c(2,2))
    plot(rf.cv)  
    plot(rf.cv, stat = "mse")
    plot(rf.cv, stat = "var.exp")
    plot(rf.cv, stat = "mae")

# regression with randomForest
rf.mdl <- randomForest(airquality[,2:6], airquality[,1])
  ( rf.cv <- rf.crossValidation(rf.mdl, p=0.10) )

# classification with ranger
rf.mdl <- ranger(x = airquality[,2:6], y = yclass)
  ( rf.cv <- rf.crossValidation(rf.mdl, p=0.10) )

    # Plot cross validation versus model producers accuracy
    par(mfrow=c(1,2)) 
      plot(rf.cv, type = "cv", main = "CV producers accuracy")
      plot(rf.cv, type = "model", main = "Model producers accuracy")
    
    # Plot cross validation versus model oob
    par(mfrow=c(1,2)) 
      plot(rf.cv, type = "cv", stat = "oob", main = "CV oob error")
      plot(rf.cv, type = "model", stat = "oob", main = "Model oob error")	 

# classification with randomForest
rf.mdl <- randomForest(x = airquality[,2:6], y = yclass)
  ( rf.cv <- rf.crossValidation(rf.mdl, p=0.10) )

# multi-class classification
data(iris)
  iris$Species <- as.factor(iris$Species)    	
( rf.mdl <- randomForest(iris[,1:4], iris[,"Species"], ntree=501) )
  ( rf.cv <- rf.crossValidation(rf.mdl) )


## End(Not run)	 
  

jeffreyevans/rfUtilities documentation built on Nov. 12, 2023, 6:52 p.m.