var.select.mir: Variable selection with mutual impurity reduction (MIR)

View source: R/variable_selection_mir.R

var.select.mirR Documentation

Variable selection with mutual impurity reduction (MIR)

Description

This function executes MIR applying ranger for random forests generation and actual impurity reduction and a modified version of rpart to find surrogate variables.

Usage

var.select.mir(
  x = NULL,
  y = NULL,
  ntree = 500,
  type = "regression",
  s = NULL,
  mtry = NULL,
  min.node.size = 1,
  num.threads = NULL,
  status = NULL,
  save.ranger = FALSE,
  save.memory = FALSE,
  min.var.p = 200,
  p.t.sel = 0.01,
  p.t.rel = 0.01,
  select.var = TRUE,
  select.rel = FALSE,
  case.weights = NULL,
  corr.rel = TRUE,
  t = 5,
  method.rel = "janitza",
  method.sel = "janitza",
  num.threads.rel = NULL
)

Arguments

x

matrix or data.frame of predictor variables with variables in columns and samples in rows (Note: missing values are not allowed)

y

vector with values of phenotype variable (Note: will be converted to factor if classification mode is used). For survival forests this is the time variable.

ntree

number of trees. Default is 500.

type

mode of prediction ("regression" or "classification"). Default is regression.

s

predefined number of surrogate splits (it may happen that the actual number of surrogate splits differs in individual nodes). Default is 1 % of no. of variables.

mtry

number of variables to possibly split at in each node. Default is no. of variables^(3/4) ("^3/4") as recommended by (Ishwaran 2011). Also possible is "sqrt" and "0.5" to use the square root or half of the no. of variables.

min.node.size

minimal node size. Default is 1.

num.threads

number of threads used for parallel execution. Default is number of CPUs available.

status

status variable, only applicable to survival data. Use 1 for event and 0 for censoring.

save.ranger

set TRUE if ranger object should be saved. Default is that ranger object is not saved (FALSE).

save.memory

Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems. (This parameter is transfered to ranger)

min.var.p

minimum number of permuted variables used to determine p-value for variable selection of important variables. Default is 200.

p.t.sel

p.value threshold for selection of important variables. Default is 0.01.

p.t.rel

p.value threshold for selection of related variables. Default is 0.01.

select.var

set False if only importance should be calculated and no variables should be selected.

select.rel

set False if only relations should be calculated and no variables should be selected.

case.weights

Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.

corr.rel

set FALSE if non-corrected variable relations should be used for calculation of MIR. In this case the method "janitza" should not be used for selection of important variables

t

variable to calculate threshold for non-corrected relation analysis. Default is 5.

method.rel

Method to compute p-values for selection of related variables with var.relations.corr. Use "janitza" for the method by Janitza et al. (2016) or "permutation" to utilize permuted variables.

method.sel

Method to compute p-values for selection of important variables. Use "janitza" for the method by Janitza et al. (2016) (can only be used when corrected variable relations are utilized) or "permutation" to utilize permuted variables.

num.threads.rel

number of threads used for determination of relations. Default is number of CPUs available. (this process can be memory-intensive and it can be preferable to reduce this)

Value

list with the following components:

  • info: list with results containing:

    • MIR: the calculated variable importance for each variable based on mutual impurity reduction.

    • pvalue: the obtained p-values for each variable.

    • selected: variables has been selected (1) or not (0).

    • relations: a list containing the results of variable relation analysis.

    • parameters: a list that contains the parameters s, type, mtry, p.t.sel, p.t.rel and method.sel that were used.

  • var: vector of selected variables.

  • ranger: ranger object.

References

Examples

# read data
data("SMD_example_data")


# select variables (usually more trees are needed)
set.seed(42)
res = var.select.mir(x = SMD_example_data[,2:ncol(SMD_example_data)], y = SMD_example_data[,1],s = 10, ntree = 10)
res$var


StephanSeifert/SurrogateMinimalDepth documentation built on Aug. 7, 2023, 1:59 a.m.