View source: R/synthetic.rfsrc.R
synthetic | R Documentation |
Grows a synthetic random forest (RF) using RF machines as synthetic features. Applies only to regression and classification settings.
## S3 method for class 'rfsrc'
synthetic(formula, data, object, newdata,
ntree = 1000, mtry = NULL, nodesize = 5, nsplit = 10,
mtrySeq = NULL, nodesizeSeq = c(1:10,20,30,50,100),
min.node = 3,
fast = TRUE,
use.org.features = TRUE,
na.action = c("na.omit", "na.impute"),
oob = TRUE,
verbose = TRUE,
...)
formula |
Model to be fit. Must be specified unless |
data |
Data frame containing the y-outcome and x-variables. Must be specified unless |
object |
An object of class |
newdata |
Optional test data for prediction. If omitted, the training data is used. |
ntree |
Number of trees used in each RF machine. |
mtry |
|
nodesize |
|
nsplit |
Number of random splits used in randomized splitting. Increases speed when set to a small positive integer. |
mtrySeq |
Sequence of |
nodesizeSeq |
Sequence of |
min.node |
Minimum forest-averaged number of terminal nodes required for an RF machine to be retained as a synthetic feature. |
fast |
Use |
use.org.features |
Should original features be included alongside synthetic features in the final synthetic forest? |
na.action |
Action to be taken on missing data. The default, |
oob |
Preserve out-of-bag (OOB) estimation for error rates and VIMP? Defaults to |
verbose |
Display detailed output of the fitting process? Defaults to |
... |
Additional arguments passed to |
A collection of random forests are trained using different values of
nodesize
(and optionally mtry
). The out-of-bag (OOB)
predicted values from these forests are then used as synthetic
features (referred to as RF machines) to train a final synthetic random
forest. The original features can optionally be included in the final
model.
This approach is currently implemented for regression and classification settings (both univariate and multivariate).
Synthetic features are generated using OOB predictions to prevent
overfitting. To ensure that performance metrics (such as error rates
and VIMP) remain valid, the same bootstrap samples are reused across
all trees for both the synthetic forest and its constituent RF
machines. This behavior is controlled by the oob=TRUE
option. Disabling this may yield misleading performance estimates and
should be done with caution.
If values for mtrySeq
are provided, RF machines are constructed
for every combination of nodesizeSeq
and mtrySeq
.
A list with the following components:
rfMachines |
RF machines used to construct the synthetic features. |
rfSyn |
The (grow) synthetic RF built over training data. |
rfSynPred |
The predict synthetic RF built over test data (if available). |
synthetic |
List containing the synthetic features. |
opt.machine |
Optimal machine: RF machine with smallest OOB error rate. |
Hemant Ishwaran and Udaya B. Kogalur
Ishwaran H. and Malley J.D. (2014). Synthetic learning machines. BioData Mining, 7:28.
rfsrc
,
rfsrc.fast
## ------------------------------------------------------------
## compare synthetic forests to regular forest (classification)
## ------------------------------------------------------------
## rfsrc and synthetic calls
if (library("mlbench", logical.return = TRUE)) {
## simulate the data
ring <- data.frame(mlbench.ringnorm(250, 20))
## classification forests
ringRF <- rfsrc(classes ~., ring)
## synthetic forests
## 1 = nodesize varied
## 2 = nodesize/mtry varied
ringSyn1 <- synthetic(classes ~., ring)
ringSyn2 <- synthetic(classes ~., ring, mtrySeq = c(1, 10, 20))
## test-set performance
ring.test <- data.frame(mlbench.ringnorm(500, 20))
pred.ringRF <- predict(ringRF, newdata = ring.test)
pred.ringSyn1 <- synthetic(object = ringSyn1, newdata = ring.test)$rfSynPred
pred.ringSyn2 <- synthetic(object = ringSyn2, newdata = ring.test)$rfSynPred
print(pred.ringRF)
print(pred.ringSyn1)
print(pred.ringSyn2)
}
## ------------------------------------------------------------
## compare synthetic forest to regular forest (regression)
## ------------------------------------------------------------
## simulate the data
n <- 250
ntest <- 1000
N <- n + ntest
d <- 50
std <- 0.1
x <- matrix(runif(N * d, -1, 1), ncol = d)
y <- 1 * (x[,1] + x[,4]^3 + x[,9] + sin(x[,12]*x[,18]) + rnorm(n, sd = std)>.38)
dat <- data.frame(x = x, y = y)
test <- (n+1):N
## regression forests
regF <- rfsrc(y ~ ., dat[-test, ], )
pred.regF <- predict(regF, dat[test, ])
## synthetic forests using fast rfsrc
synF1 <- synthetic(y ~ ., dat[-test, ], newdata = dat[test, ])
synF2 <- synthetic(y ~ ., dat[-test, ],
newdata = dat[test, ], mtrySeq = c(1, 10, 20, 30, 40, 50))
## standardized MSE performance
mse <- c(tail(pred.regF$err.rate, 1),
tail(synF1$rfSynPred$err.rate, 1),
tail(synF2$rfSynPred$err.rate, 1)) / var(y[-test])
names(mse) <- c("forest", "synthetic1", "synthetic2")
print(mse)
## ------------------------------------------------------------
## multivariate synthetic forests
## ------------------------------------------------------------
mtcars.new <- mtcars
mtcars.new$cyl <- factor(mtcars.new$cyl)
mtcars.new$carb <- factor(mtcars.new$carb, ordered = TRUE)
trn <- sample(1:nrow(mtcars.new), nrow(mtcars.new)/2)
mvSyn <- synthetic(cbind(carb, mpg, cyl) ~., mtcars.new[trn,])
mvSyn.pred <- synthetic(object = mvSyn, newdata = mtcars.new[-trn,])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.