Build a Forest of Weighted Subspace Decision Trees
Description
Build weighted subspace C4.5based decision trees to construct a forest.
Usage
1 2 3 4 5 6 
Arguments
x, formula 
a data frame or a matrix of predictors, or a formula with a response but no interaction terms. 
y 
a response vector. 
data 
a data frame in which to interpret the variables named in the formula. 
ntree 
number of trees to grow. By default, 500 
mtry 
number of variables to choose as candidates at each node
split, by default, 
weights 
logical. 
na.action 
a function indicate the behaviour when encountering
NA values in 
parallel 
whether to run multiple cores (TRUE), nodes, or sequentially (FALSE). 
importance 
should importance of predictors be assessed? 
nodesize 
minimum size of leaf node, i.e., minimum number of observations a leaf node represents. By default, 2. 
clusterlogfile 
character. The pathname of the log file when building model in a cluster. For debug. 
... 
optional parameters to be passed to the low level function

Details
See Xu, Huang, Williams, Wang, and Ye (2012) for more details of the algorithm.
Currently, wsrf can only be used for classification. When
weights=FALSE
, C4.5based trees (Quinlan (1993)) are grown by
wsrf
, where binary split is used for continuous predictors
(variables) and kway split for categorical ones. For
continuous predictors, each of the values themselves is used as split
points, no discretization used. The only stopping condition for split
is the minimum node size must not less than nodesize
.
Value
An object of class wsrf, which is a list with the following components:
confusion 
the confusion matrix of the prediction (based on OOB data). 
oob.times 
number of times cases are ‘outofbag’ (and thus used in computing OOB error estimate) 
predicted 
the predicted values of the input data based on outofbag samples. 
useweights 
logical. Whether weighted subspace selcetion is used? NULL if the model is obtained by combining multiple wsrf model and one of them has different value of 'useweights'. 
mtry 
integer. The number of variables to be chosen when spliting a node. 
Author(s)
He Zhao and Graham Williams (SIAT)
References
Xu B, Huang JZ, Williams G, Wang Q, Ye YM (2012). "Classifying very highdimensional data with random forests built from small subspaces." International Journal of Data Warehousing and Mining (IJDWM), 8(2), 4463.
Quinlan J. R. (1993). "C4.5: Programs for Machine Learning". Morgan Kaufmann.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41  library("wsrf")
# Prepare parameters.
ds < rattle::weather
dim(ds)
names(ds)
target < "RainTomorrow"
id < c("Date", "Location")
risk < "RISK_MM"
ignore < c(id, if (exists("risk")) risk)
vars < setdiff(names(ds), ignore)
if (sum(is.na(ds[vars]))) ds[vars] < randomForest::na.roughfix(ds[vars])
ds[target] < as.factor(ds[[target]])
(tt < table(ds[target]))
form < as.formula(paste(target, "~ ."))
set.seed(42)
train < sample(nrow(ds), 0.7*nrow(ds))
test < setdiff(seq_len(nrow(ds)), train)
# Build model. We disable parallelism here, since CRAN Repository
# Policy (https://cran.rproject.org/web/packages/policies.html)
# limits the usage of multiple cores to save the limited resource of
# the check farm.
model.wsrf < wsrf(form, data=ds[train, vars], parallel=FALSE)
# View model.
print(model.wsrf)
print(model.wsrf, tree=1)
# Evaluate.
strength(model.wsrf)
correlation(model.wsrf)
res < predict(model.wsrf, newdata=ds[test, vars], type=c("response", "waprob"))
actual < ds[test, target]
(accuracy.wsrf < mean(res$response==actual))
# Different type of prediction.
cl < apply(res$waprob, 1, which.max)
cl < factor(cl, levels=1:ncol(res$waprob), labels=levels(actual))
(accuracy2.wsrf < mean(cl==actual))
