stratified_rf: Stratified Random Forest

Description Usage Arguments Details See Also Examples

View source: R/rf_c50.R

Description

Random Forest that works with groups of predictor variables. When building a tree, a number of variables is taken from each group separately. Useful when rows contain information about different things (e.g. user information and product information) and it's not sensible to make a prediction with information from only one group of variables, or when there are far more variables from one group than the other and it's desired to have groups appear evenly on trees.

Usage

1
2
3
4
stratified_rf(df, targetvar, groups, mtry = "auto", ntrees = 500,
  multicore = TRUE, class_quotas = NULL, sample_weights = NULL,
  fulldepth = TRUE, replacement = TRUE, c50_control = NULL,
  na.action = na.pass, drop_threshold = NULL)

Arguments

df

Data to build the model (data.frame only).

targetvar

String indicating the name of the target or outcome variable in the data. Character types will be coerced to factors.

groups

Unnamed list, containing at each entry a group of variables (as a string vector with their names).

mtry

A numeric vector indicating how many variables to take from each group when building each tree. If set to "auto" then, for each group, mtry=round(sqrt(m_total)*len(m_group)/len(m_total)) (with a minimum of 1 for each group).

ntrees

Number of trees to grow. When setting multicore=TRUE, the number of trees should be a multiple of the number of cores, otherwise it will get rounded downwards to the nearest multiple.

multicore

Whether to use multiple CPU cores to parallelize the construction of trees. Parallelization is done with the 'parallel' library's default settings.

class_quotas

How many rows from each class to use in each tree (useful when there is a class imbalance). Must be a numeric vector or a named list with the number of desired rows to sample for each level of the target variable. Ignored when sample_weights is passed. Note that using more rows than the data originally had might result in incorrect out-of-bag error estimates.

sample_weights

Probability of sampling each row when building a tree. Must be a numeric vector. If not defined, then all rows have the same probability. Note that, depending on the structure of the data, setting this might result in incorret out-of-bag error estimates.

fulldepth

Whether to grow the trees to full depth. Ignored when passing c50_control.

replacement

Whether to sample rows with replacement.

c50_control

Custom parameters for growing trees. Must be a C5.0Control object compatible with the 'C50' package.

na.action

A function indicating how to handle NAs. Default is to include missing values when building a tree (see 'C50' documentation).

drop_threshold

Drop a tree whenever its resulting out-of-bag classification accuracy falls below a certain threshold specified here. Must be a number between 0 and 1.

Details

Note that while this algorithm forces each tree to consider possible splits with variables from all groups, it doesn't guarantee that they will end up having splits with variables from different groups.

The original Random Forest algorithm recommends a total number of sqrt(n_features), but this might not work so well when there are unequal groups of variables.

Implementation of everything outside the tree-building is in native R code, thus might be slow. Trees are grown using the C5.0 algorithm from the 'C50' library, thus it can be used for classification only (not for regression). Refer to the 'C50' library for any documentation about the tree-building algorithm.

See Also

'C50' library: https://cran.r-project.org/package=C50

Examples

1
2
3
4
5
data(iris)
groups <- list(c("Sepal.Length","Sepal.Width"),c("Petal.Length","Petal.Width"))
mtry <- c(1,1)
m <- stratified_rf(iris,"Species",groups,mtry,ntrees=2,multicore=FALSE)
summary(m)

Example output

Stratified Random Forest object

Out-of-bag prediction error:  5.38%

Confusion Matrix
            pred
real         setosa versicolor virginica
  setosa         29          1         0
  versicolor      0         28         2
  virginica       0          2        31

Class  setosa - Precision: 100% Recall: 96.6% 
Class  versicolor - Precision: 90.3% Recall: 93.3% 
Class  virginica - Precision: 93.9% Recall: 93.9% 

Predictor Variables:
Group 1 :  Sepal.Length, Sepal.Width 
Group 2 :  Petal.Length, Petal.Width 

Target Variable:  Species

Built with 2 trees

StratifiedRF documentation built on May 1, 2019, 10:28 p.m.