rfgv: rfgv

Description Usage Arguments Value Examples

Description

Random Forest for Grouped Variables. Only implement for binary classification. The function builts a large number of random decision trees based on a variant of the CARTGV method.

Usage

1
2
3
4
5
6
rfgv(data, group, groupImp, ntree = 200,
  mtry_group = floor(sqrt(length(unique(group[!is.na(group)])))),
  maxdepth = 1, replace = T, sampsize = ifelse(replace == T,
  nrow(data), floor(0.632 * nrow(data))), case_min = 1,
  grp.importance = TRUE, test = NULL, keep_forest = F, crit = 1,
  penalty = "No", sampvar = FALSE, mtry_var)

Arguments

data

a data frame containing the response value (for the first variable) and the predictors and used to grow the tree. The name of the response value must be "Y". The response variable must be the first variable of the data frame and the variable meust be coded as the two levels "0" and "1".

group

a vector with the group number of each variable. (WARNING : if there are "p" goups, the groups must be numbers from "1" to "p" in increasing order. The group label of the response variable is missing (i.e. NA))

groupImp

a vector which indicates the group number of each variable (for the groups used to compute the group importance).

ntree

an integer indicating the number of trees to grow

mtry_group

an integer the number of variables randomly samples as candidates at each split.

maxdepth

an integer indicating the maximal depth for a split-tree. The default value is 2.

replace

a boolean indicating if sampling of cases is done with or without replacement?

sampsize

an interger indicating the size of the boostrap samples.

case_min

an integer indicating the minimun number of cases/non cases in a terminal nodes. The default is 1.

grp.importance

a boolean indicating if the importance of each group need to be computed

test

an independent data frame containing the same variables that "data".

keep_forest

a boolean indicating if the forest will be retained in the output object

crit

an integer indicating the impurity function used (1=Gini index / 2=Entropie/ 3=Misclassification rate)

penalty

a boolean indicating if the decrease in node impurity must take account of the group size. Four penalty are available: "No","Size","Root.size" or "Log".

sampvar

a boolean indicating if within each splitting tree, a subset of variables is drawn for each group

mtry_var

a vector of length the number of groups. It indicates the number of drawn variables for each group. Usefull only if sampvar=TRUE

Value

a list with elements:

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
data(rfgv_dataset)
data(group)
data <- rfgv_dataset
train<-data[which(data[,1]=="train"),-1]           # negative index into the `data`
test<-data[which(data[,1]=="test"),-1]             # object specifying all rows and all columns
validation<-data[which(data[,1]=="validation"),-1] # except the first column.

forest<-rfgv(train,
             group=group,
             groupImp=group,
             ntree=1,
             mtry_group=3,
             sampvar=TRUE,
             maxdepth=2,
             replace=TRUE,
             case_min=1,
             sampsize=nrow(train),
             mtry_var=rep(2,5),
             grp.importance=TRUE,
             test=test,
             keep_forest=FALSE,
             crit=1,
             penalty="No")

dtrfgv/dtrfgv documentation built on May 6, 2019, 8:02 p.m.