var.sel.rfe: Variable selection using recursive feature elimination.

View source: R/variable_selection_rfe.R

var.sel.rfeR Documentation

Variable selection using recursive feature elimination.

Description

Compares random forests based on nested subsets of the variables and selects those variables leading to the forest with the smallest prediction error within a tolerance.

Usage

var.sel.rfe(
  x,
  y,
  prop.rm = 0.2,
  recalculate = TRUE,
  tol = 10,
  ntree = 500,
  mtry.prop = 0.2,
  nodesize.prop = 0.1,
  no.threads = 1,
  method = "ranger",
  type = "regression",
  importance = "impurity_corrected",
  case.weights = NULL
)

Arguments

x

matrix or data.frame of predictor variables with variables in columns and samples in rows (Note: missing values are not allowed).

y

vector with values of phenotype variable (Note: will be converted to factor if classification mode is used).

prop.rm

proportion of variables removed at each step (default value of varSelRF)

recalculate

logical stating if importance should be recalculated at each iteration (default: TRUE)

tol

acceptable difference in optimal performance (finds the smallest subset size that has a percent loss less than tol)

ntree

number of trees.

mtry.prop

proportion of variables that should be used at each split.

nodesize.prop

proportion of minimal number of samples in terminal nodes.

no.threads

number of threads used for parallel execution.

method

implementation to be used ("ranger").

type

mode of prediction ("regression", "classification" or "probability").

importance

Variable importance mode ('none', 'impurity', 'impurity_corrected' or 'permutation'). Default is 'impurity_corrected'.

case.weights

Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.

Details

Note: This function differs from the approach implemented in the R package varSelRF because it recalculates importance scores in each step. The tolerance step is based on the pickSizeTolerance function in the R package caret.

Value

List with the following components:

  • info data.frame with information for each variable

    • included.until.subset = number of smallest subset which contains variable

    • selected = variable has been selected

  • var vector of selected variables

  • info.runs data.frame with information for each run

    • n = number of variables

    • mse = mean squared error

    • rsq = R^2

@examples # simulate toy data set data = simulation.data.cor(no.samples = 100, group.size = rep(10, 6), no.var.total = 200)

# select variables res = var.sel.rfe(x = data[, -1], y = data[, 1], prop.rm = 0.2, recalculate = TRUE) res$var


silkeszy/Pomona documentation built on March 31, 2022, 11:13 p.m.