wsrf: Build a Forest of Weighted Subspace Decision Trees

View source: R/wsrf.R

wsrfR Documentation

Build a Forest of Weighted Subspace Decision Trees

Description

Build weighted subspace C4.5-based decision trees to construct a forest.

Usage


## S3 method for class 'formula'
wsrf(formula, data, ...)
## Default S3 method:
wsrf(x, y, mtry=floor(log2(length(x))+1), ntree=500,
                       weights=TRUE, parallel=TRUE, na.action=na.fail,
                       importance=FALSE, nodesize=2, clusterlogfile, ...)

Arguments

x, formula

a data frame or a matrix of predictors, or a formula with a response but no interaction terms.

y

a response vector.

data

a data frame in which to interpret the variables named in the formula.

ntree

number of trees to grow. By default, 500

mtry

number of variables to choose as candidates at each node split, by default, floor(log2(length(x))+1).

weights

logical. TRUE for weighted subspace selection, which is the default; FALSE for random selection, and the trees are based on C4.5.

na.action

a function indicate the behaviour when encountering NA values in data. By default, na.fail. If NULL, do nothing.

parallel

whether to run multiple cores (TRUE), nodes, or sequentially (FALSE).

importance

should importance of predictors be assessed?

nodesize

minimum size of leaf node, i.e., minimum number of observations a leaf node represents. By default, 2.

clusterlogfile

character. The pathname of the log file when building model in a cluster. For debug.

...

optional parameters to be passed to the low level function wsrf.default.

Details

See Xu, Huang, Williams, Wang, and Ye (2012) for more details of the algorithm, and Zhao, Williams, Huang (2017) for more details of the package.

Currently, wsrf can only be used for classification. When weights=FALSE, C4.5-based trees (Quinlan (1993)) are grown by wsrf, where binary split is used for continuous predictors (variables) and k-way split for categorical ones. For continuous predictors, each of the values themselves is used as split points, no discretization used. The only stopping condition for split is the minimum node size must not less than nodesize.

Value

An object of class wsrf, which is a list with the following components:

confusion

the confusion matrix of the prediction (based on OOB data).

oob.times

number of times cases are ‘out-of-bag’ (and thus used in computing OOB error estimate)

predicted

the predicted values of the input data based on out-of-bag samples.

useweights

logical. Whether weighted subspace selection is used? NULL if the model is obtained by combining multiple wsrf model and one of them has different value of 'useweights'.

mtry

integer. The number of variables to be chosen when splitting a node.

Author(s)

He Zhao and Graham Williams (SIAT, CAS)

References

Xu, B. and Huang, J. Z. and Williams, G. J. and Wang, Q. and Ye, Y. 2012 "Classifying very high-dimensional data with random forests built from small subspaces". International Journal of Data Warehousing and Mining (IJDWM), 8(2), 44–63.

Quinlan, J. R. 1993 C4.5: Programs for Machine Learning. Morgan Kaufmann.

Zhao, H. and Williams, G. J. and Huang, J. Z. 2017 "wsrf: An R Package for Classification with Scalable Weighted Subspace Random Forests". Journal of Statistical Software, 77(3), 1–30. doi:10.18637/jss.v077.i03

Examples

  library("wsrf")

  # Prepare parameters.
  ds <- iris
  dim(ds)
  names(ds)
  target <- "Species"
  vars   <- names(ds)
  if (sum(is.na(ds[vars]))) ds[vars] <- randomForest::na.roughfix(ds[vars])
  ds[target] <- as.factor(ds[[target]])
  (tt  <- table(ds[target]))
  form <- as.formula(paste(target, "~ ."))
  set.seed(42)
  train <- sample(nrow(ds), 0.7*nrow(ds))
  test  <- setdiff(seq_len(nrow(ds)), train)

  # Build model.  We disable parallelism here, since CRAN Repository
  # Policy (https://cran.r-project.org/web/packages/policies.html)
  # limits the usage of multiple cores to save the limited resource of
  # the check farm.

  model.wsrf <- wsrf(form, data=ds[train, vars], parallel=FALSE)
  
  # View model.
  print(model.wsrf)
  print(model.wsrf, tree=1)

  # Evaluate.
  strength(model.wsrf)
  correlation(model.wsrf)
  res <- predict(model.wsrf, newdata=ds[test, vars], type=c("response", "waprob"))
  actual <- ds[test, target]
  (accuracy.wsrf <- mean(res$response==actual))
  
  # Different type of prediction.
  cl <- apply(res$waprob, 1, which.max)
  cl <- factor(cl, levels=1:ncol(res$waprob), labels=levels(actual))
  (accuracy2.wsrf <- mean(cl==actual))

wsrf documentation built on Jan. 6, 2023, 5:06 p.m.