Just as a forester uses many tools to manage a forest, we may need multiple tools to manage a random forest. forestr
is an R package that extends the random forest methodology by including multiple splitting criteria for building the trees. Additionally, the possibility for user specified splitting criteria is left open by functionalizing the splitting methods.
(To be) included splitting criteria within forestr
:
- Gini
- Information
- One-sided extreme
- One-sided purity
To install forestr
,
install.packages("devtools")
,devtools::install_github("andeek/forestr")
,rpart
with splitting functions by devtools::install_github("andeek/rpart")
.This is a project for STAT 503 with Di Cook in Spring 2015. Below I detail the steps that will be taken to create this package and test its use.
The old plan (see below) has been temporarily dropped due to the incomplete port of randomForest
from Fortran to C. With my lack of Fortran experience, finishing the port myself has proved to difficult. Instead of extending the randomForest
library, I am building a random forest framework around the rpart
library, which does include the ability for a user to create splitting functions.
I will be extending the current randomForest
package by updating the C, Fortran, and R files within its source. The main lifting will be to create splitting functions, rather than having the splitting be native within the code. Additionally, parameters will need to be created in the top-level R code to uncover the functionality to the user.
Currently I have read through the code files in the randomForest
package and located where the splitting is being done (with Gini). The challengs that I foresee are that I do not know Fotran, nor have I ever written an R package. There is a first time for everything though, and extending a well written package will be a much more achievable goal than starting from scratch.
The one-sided extreme and one-sided purity methods will be the focus of testing the package for my paper. These methods were created to better handle unbalanced (2-class) classification tasks where one class is of more importance to be classified correctly (think cancer detection). As such, we will test these splitting criteria with varying levels of unbalanced data in a 2-class classification problem.
The data used in this project come from the UCI Machine Learning repository, and are produced using Monte Carlo simulations to resemble properties measured by particle detectors in an accelerator to detect the Higgs Boson particle. I have created smaller datasets from this large dataset by sampling observations to create the percentage of unbalancedness I desire. The levels of unbalancedness I will test are 5%, 10%, and 25% and compare the performance of the one-sided extremes and one-sided purity to gini and information performance on the same datasets.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.