The **wsrf** package is a
parallel implementation of the Weighted Subspace Random Forest
algorithm (wsrf) of @xu2012classifying. A novel variable weighting
method is used for variable subspace selection in place of the
traditional approach of random variable sampling. This new approach
is particularly useful in building models for high dimensional data
--- often consisting of thousands of variables. Parallel computation
is used to take advantage of multi-core machines and clusters of
machines to build random forest models from high dimensional data with
reduced elapsed times.

Currently, **wsrf** requires R (>= 3.3.0),
**Rcpp** (>= 0.10.2)
[@dirk2011rcpp; @dirk2013seamless]. For the use of multi-threading, a
C++ compiler with C++11
standard support of threads is required. To install the latest stable
version of the package, from within R run:

install.packages("wsrf")

or the latest development version:

devtools::install_github("simonyansenzhao/wsrf")

The version of R before 3.3.0 doesn't provide fully support of C++11, thus we provided other options for installation of wsrf. From 1.6.0, we drop the support for those options. One can find the usage in the documentation from previous version if interested.

This section demonstrates how to use **wsrf**, especially on a cluster
of machines.

The example uses a small dataset *weather* from
**rattle.data**
[@rattle]. See the help page of **rattle.data** in R (`?weather`

) for
more details of *weather*. Below are the basic information of it.

library("rattle.data") ds <- weather dim(ds) names(ds)

Before building the model we need to prepare the training dataset. First we note the various roles played by the different variables, including identifying the irrelevant variables.

target <- "RainTomorrow" ignore <- c("Date", "Location", "RISK_MM") (vars <- setdiff(names(ds), ignore)) dim(ds[vars])

Next we deal with missing values, using `na.roughfix()`

from
**randomForest** to take care of them.

library("randomForest") if (sum(is.na(ds[vars]))) ds[vars] <- na.roughfix(ds[vars]) ds[target] <- as.factor(ds[[target]]) (tt <- table(ds[target]))

We construct the formula that describes the model which will predict the target based on all other variables.

(form <- as.formula(paste(target, "~ .")))

Finally we create the randomly selected training and test datasets, setting a seed so that the results can be exactly replicated.

seed <- 42 set.seed(seed) length(train <- sample(nrow(ds), 0.7*nrow(ds))) length(test <- setdiff(seq_len(nrow(ds)), train))

The function to build a weighted random forest model in **wsrf** is:

wsrf(formula, data, ...)

and

wsrf(x, y, mtry=floor(log2(length(x))+1), ntree=500, weights=TRUE, parallel=TRUE, na.action=na.fail, importance=FALSE, nodesize=2, clusterlogfile, ...)

We use the training dataset to build a random forest model. All
parameters, except `formula`

and `data`

, use their default values:
`500`

for `ntree`

--- the number of trees; `TRUE`

for `weights`

---
weighted subspace random forest or random forest; `TRUE`

for
`parallel`

--- use multi-thread or other options, etc.

library("wsrf") model.wsrf.1 <- wsrf(form, data=ds[train, vars], parallel=FALSE) print(model.wsrf.1) print(model.wsrf.1, 1) # Print tree 1.

Then, `predict`

the classes of test data.

cl <- predict(model.wsrf.1, newdata=ds[test, vars], type="class")$class actual <- ds[test, target] (accuracy.wsrf <- mean(cl == actual, na.rm=TRUE))

Thus, we have built a model that is around ```
r round(100*accuracy.wsrf,
0)
```

% accurate on unseen testing data.

Using different random seed, we obtain another model.

# Here we build another model without weighting. model.wsrf.2 <- wsrf(form, data=ds[train, vars], weights=FALSE, parallel=FALSE) print(model.wsrf.2)

We can also derive a subset of the forest from the model or a combination of multiple forests.

submodel.wsrf <- subset.wsrf(model.wsrf.1, 1:150) print(submodel.wsrf) bigmodel.wsrf <- combine.wsrf(model.wsrf.1, model.wsrf.2) print(bigmodel.wsrf)

Next, we will specify building the model on a cluster of servers.

servers <- paste0("node", 31:40) model.wsrf.3 <- wsrf(form, data=ds[train, vars], parallel=servers)

All we need is a character verctor specifying the hostnames of which
nodes to use, or a named integer vector, whose values of the elements
give how many threads to use for model building, in other words, how
many trees built simultaneously. More detail descriptions about
**wsrf** are presented in the
manual.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.