Load files / libraries
library(randomForest) setwd("./data") train<-read.csv("sonar_train.csv",header=F) test<-read.csv("sonar_test.csv",header=F) setwd("../")
Format training and test data sets and fit model
Note defaults for RF will: - Take training data - Take test data - Run thru 500 trees, take majority vote - Each tree is diff based only on attr avail for split decision (e.g., 60 in our set) - You can control much of the detail about what to weight on split
y_train<-as.factor(train[,61]) x_train<-train[,1:60] fit<-randomForest(x_train, y_train, ) 1-sum(y_train==predict(fit,x_train))/length(y_train) fit
Note: - 0 class prediction errors on training data (good though not surprising given 500 trees) - When you print fit, you see that RF tried 7 vars at each split. - Got an error rate of 17.7%. - Provides confustion matrix w/ class error calculated.
Run against test data set and calc misclass error.
y_test<-as.factor(test[,61]) x_test<-test[,1:60] 1-sum(y_test==predict(fit,x_test))/length(y_test)
Note: only 12.8% test error. Did much better than prior methods w/ only 7 vars/split. Power of the 500 decision trees.
So, why doesn't it to better than we would forecast using the pbinom() formula?
1-pbinom(50,60,0.7)
Problem is it's very difficult to generate completely independent classifiers. Trees built uisng the same methodology, though splits may be randomized, will still share a common set of errors do to the common approach. So, not comletely independent as pbinom() expects.
If you use more features (higher % of total) on each split, you'll likely get higher interdependence. Tradeoff here as the fewer features you use, the less representative your model can be of your data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.