var.select.smd: Variable selection with Surrogate Minimal Depth (SMD) (MAIN...

View source: R/variable_selection_smd.R

var.select.smdR Documentation

Variable selection with Surrogate Minimal Depth (SMD) (MAIN FUNCTION)

Description

This function executes SMD applying ranger for random forests generation and a modified version of rpart to find surrogate variables.

Usage

var.select.smd(
  x = NULL,
  y = NULL,
  ntree = 500,
  type = "regression",
  s = NULL,
  mtry = NULL,
  min.node.size = 1,
  num.threads = NULL,
  status = NULL,
  save.ranger = FALSE,
  create.forest = TRUE,
  forest = NULL,
  save.memory = FALSE,
  case.weights = NULL
)

Arguments

x

matrix or data.frame of predictor variables with variables in columns and samples in rows (Note: missing values are not allowed)

y

vector with values of phenotype variable (Note: will be converted to factor if classification mode is used). For survival forests this is the time variable.

ntree

number of trees. Default is 500.

type

mode of prediction ("regression" or "classification"). Default is regression.

s

predefined number of surrogate splits (it may happen that the actual number of surrogate splits differs in individual nodes). Default is 1 % of no. of variables.

mtry

number of variables to possibly split at in each node. Default is no. of variables^(3/4) ("^3/4") as recommended by (Ishwaran 2011). Also possible is "sqrt" and "0.5" to use the square root or half of the no. of variables.

min.node.size

minimal node size. Default is 1.

num.threads

number of threads used for parallel execution. Default is number of CPUs available.

status

status variable, only applicable to survival data. Use 1 for event and 0 for censoring.

save.ranger

set TRUE if ranger object should be saved. Default is that ranger object is not saved (FALSE).

create.forest

set FALSE if you want to analyze an existing forest. Default is TRUE.

forest

the random forest that should be analyzed if create.forest is set to FALSE. (x and y still have to be given to obtain variable names)

save.memory

Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems. (This parameter is transfered to ranger)

case.weights

Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.

Value

list with the following components:

  • info: list with results from surrmindep function:

    • depth: mean surrogate minimal depth for each variable.

    • selected: variables has been selected (1) or not (0).

    • threshold: the threshold that is used for the selection.

  • var: vector of selected variables.

  • s: list with the results of count.surrogate function:

    • s.a: total average number of surrogate variables.

    • s.l: average number of surrogate variables in the respective layers.

  • forest: a list containing: #'

    • trees: list of trees that was created by getTreeranger, addLayer, and addSurrogates functions and that was used for surrogate minimal depth variable importance.

    • allvariables: all variable names of the predictor variables that are present in x.

  • ranger: ranger object.

References

Examples

# read data
data("SMD_example_data")


# select variables (usually more trees are needed)
set.seed(42)
res = var.select.smd(x = SMD_example_data[,2:ncol(SMD_example_data)], y = SMD_example_data[,1],s = 10, ntree = 10)
res$var


StephanSeifert/SurrogateMinimalDepth documentation built on Aug. 7, 2023, 1:59 a.m.