Build a Forest of Weighted Subspace Decision Trees

Share:

Description

Build weighted subspace C4.5-based decision trees to construct a forest.

Usage

1
2
3
4
5
6
## S3 method for class 'formula'
wsrf(formula, data, ...)
## Default S3 method:
wsrf(x, y, mtry=floor(log2(length(x))+1), ntree=500,
                       weights=TRUE, parallel=TRUE, na.action=na.fail,
                       importance=FALSE, nodesize=2, clusterlogfile, ...)

Arguments

x, formula

a data frame or a matrix of predictors, or a formula with a response but no interaction terms.

y

a response vector.

data

a data frame in which to interpret the variables named in the formula.

ntree

number of trees to grow. By default, 500

mtry

number of variables to choose as candidates at each node split, by default, floor(log2(length(x))+1).

weights

logical. TRUE for weighted subspace selection, which is the default; FALSE for random selection, and the trees are based on C4.5.

na.action

a function indicate the behaviour when encountering NA values in data. By default, na.fail. If NULL, do nothing.

parallel

whether to run multiple cores (TRUE), nodes, or sequentially (FALSE).

importance

should importance of predictors be assessed?

nodesize

minimum size of leaf node, i.e., minimum number of observations a leaf node represents. By default, 2.

clusterlogfile

character. The pathname of the log file when building model in a cluster. For debug.

...

optional parameters to be passed to the low level function wsrf.default.

Details

See Xu, Huang, Williams, Wang, and Ye (2012) for more details of the algorithm.

Currently, wsrf can only be used for classification. When weights=FALSE, C4.5-based trees (Quinlan (1993)) are grown by wsrf, where binary split is used for continuous predictors (variables) and k-way split for categorical ones. For continuous predictors, each of the values themselves is used as split points, no discretization used. The only stopping condition for split is the minimum node size must not less than nodesize.

Value

An object of class wsrf, which is a list with the following components:

confusion

the confusion matrix of the prediction (based on OOB data).

oob.times

number of times cases are ‘out-of-bag’ (and thus used in computing OOB error estimate)

predicted

the predicted values of the input data based on out-of-bag samples.

useweights

logical. Whether weighted subspace selcetion is used? NULL if the model is obtained by combining multiple wsrf model and one of them has different value of 'useweights'.

mtry

integer. The number of variables to be chosen when spliting a node.

Author(s)

He Zhao and Graham Williams (SIAT)

References

Xu B, Huang JZ, Williams G, Wang Q, Ye YM (2012). "Classifying very high-dimensional data with random forests built from small subspaces." International Journal of Data Warehousing and Mining (IJDWM), 8(2), 44-63.

Quinlan J. R. (1993). "C4.5: Programs for Machine Learning". Morgan Kaufmann.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
  library("wsrf")

  # Prepare parameters.
  ds <- rattle::weather
  dim(ds)
  names(ds)
  target <- "RainTomorrow"
  id     <- c("Date", "Location")
  risk   <- "RISK_MM"
  ignore <- c(id, if (exists("risk")) risk) 
  vars   <- setdiff(names(ds), ignore)
  if (sum(is.na(ds[vars]))) ds[vars] <- randomForest::na.roughfix(ds[vars])
  ds[target] <- as.factor(ds[[target]])
  (tt  <- table(ds[target]))
  form <- as.formula(paste(target, "~ ."))
  set.seed(42)
  train <- sample(nrow(ds), 0.7*nrow(ds))
  test  <- setdiff(seq_len(nrow(ds)), train)

  # Build model.  We disable parallelism here, since CRAN Repository
  # Policy (https://cran.r-project.org/web/packages/policies.html)
  # limits the usage of multiple cores to save the limited resource of
  # the check farm.

  model.wsrf <- wsrf(form, data=ds[train, vars], parallel=FALSE)
  
  # View model.
  print(model.wsrf)
  print(model.wsrf, tree=1)

  # Evaluate.
  strength(model.wsrf)
  correlation(model.wsrf)
  res <- predict(model.wsrf, newdata=ds[test, vars], type=c("response", "waprob"))
  actual <- ds[test, target]
  (accuracy.wsrf <- mean(res$response==actual))
  
  # Different type of prediction.
  cl <- apply(res$waprob, 1, which.max)
  cl <- factor(cl, levels=1:ncol(res$waprob), labels=levels(actual))
  (accuracy2.wsrf <- mean(cl==actual))