View source: R/variable_selection_rfe.R
var.sel.rfe | R Documentation |
Compares random forests based on nested subsets of the variables and selects those variables leading to the forest with the smallest prediction error within a tolerance.
var.sel.rfe( x, y, prop.rm = 0.2, recalculate = TRUE, tol = 10, ntree = 500, mtry.prop = 0.2, nodesize.prop = 0.1, no.threads = 1, method = "ranger", type = "regression", importance = "impurity_corrected", case.weights = NULL )
x |
matrix or data.frame of predictor variables with variables in columns and samples in rows (Note: missing values are not allowed). |
y |
vector with values of phenotype variable (Note: will be converted to factor if classification mode is used). |
prop.rm |
proportion of variables removed at each step (default value of |
recalculate |
logical stating if importance should be recalculated at each iteration (default: TRUE) |
tol |
acceptable difference in optimal performance (finds the smallest subset size that has a percent loss less than tol) |
ntree |
number of trees. |
mtry.prop |
proportion of variables that should be used at each split. |
nodesize.prop |
proportion of minimal number of samples in terminal nodes. |
no.threads |
number of threads used for parallel execution. |
method |
implementation to be used ("ranger"). |
type |
mode of prediction ("regression", "classification" or "probability"). |
importance |
Variable importance mode ('none', 'impurity', 'impurity_corrected' or 'permutation'). Default is 'impurity_corrected'. |
case.weights |
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees. |
Note: This function differs from the approach implemented in the R package
varSelRF
because it recalculates importance scores in each step. The tolerance step is based on the
pickSizeTolerance
function in the R package caret
.
List with the following components:
info
data.frame
with information for each variable
included.until.subset = number of smallest subset which contains variable
selected = variable has been selected
var
vector of selected variables
info.runs
data.frame with information for each run
n = number of variables
mse = mean squared error
rsq = R^2
@examples # simulate toy data set data = simulation.data.cor(no.samples = 100, group.size = rep(10, 6), no.var.total = 200)
# select variables res = var.sel.rfe(x = data[, -1], y = data[, 1], prop.rm = 0.2, recalculate = TRUE) res$var
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.