View source: R/variable_selection_mir.R
var.select.mir | R Documentation |
This function executes MIR applying ranger for random forests generation and actual impurity reduction and a modified version of rpart to find surrogate variables.
var.select.mir(
x = NULL,
y = NULL,
ntree = 500,
type = "regression",
s = NULL,
mtry = NULL,
min.node.size = 1,
num.threads = NULL,
status = NULL,
save.ranger = FALSE,
save.memory = FALSE,
min.var.p = 200,
p.t.sel = 0.01,
p.t.rel = 0.01,
select.var = TRUE,
select.rel = FALSE,
case.weights = NULL,
corr.rel = TRUE,
t = 5,
method.rel = "janitza",
method.sel = "janitza",
num.threads.rel = NULL
)
x |
matrix or data.frame of predictor variables with variables in columns and samples in rows (Note: missing values are not allowed) |
y |
vector with values of phenotype variable (Note: will be converted to factor if classification mode is used). For survival forests this is the time variable. |
ntree |
number of trees. Default is 500. |
type |
mode of prediction ("regression" or "classification"). Default is regression. |
s |
predefined number of surrogate splits (it may happen that the actual number of surrogate splits differs in individual nodes). Default is 1 % of no. of variables. |
mtry |
number of variables to possibly split at in each node. Default is no. of variables^(3/4) ("^3/4") as recommended by (Ishwaran 2011). Also possible is "sqrt" and "0.5" to use the square root or half of the no. of variables. |
min.node.size |
minimal node size. Default is 1. |
num.threads |
number of threads used for parallel execution. Default is number of CPUs available. |
status |
status variable, only applicable to survival data. Use 1 for event and 0 for censoring. |
save.ranger |
set TRUE if ranger object should be saved. Default is that ranger object is not saved (FALSE). |
save.memory |
Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems. (This parameter is transfered to ranger) |
min.var.p |
minimum number of permuted variables used to determine p-value for variable selection of important variables. Default is 200. |
p.t.sel |
p.value threshold for selection of important variables. Default is 0.01. |
p.t.rel |
p.value threshold for selection of related variables. Default is 0.01. |
select.var |
set False if only importance should be calculated and no variables should be selected. |
select.rel |
set False if only relations should be calculated and no variables should be selected. |
case.weights |
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees. |
corr.rel |
set FALSE if non-corrected variable relations should be used for calculation of MIR. In this case the method "janitza" should not be used for selection of important variables |
t |
variable to calculate threshold for non-corrected relation analysis. Default is 5. |
method.rel |
Method to compute p-values for selection of related variables with var.relations.corr. Use "janitza" for the method by Janitza et al. (2016) or "permutation" to utilize permuted variables. |
method.sel |
Method to compute p-values for selection of important variables. Use "janitza" for the method by Janitza et al. (2016) (can only be used when corrected variable relations are utilized) or "permutation" to utilize permuted variables. |
num.threads.rel |
number of threads used for determination of relations. Default is number of CPUs available. (this process can be memory-intensive and it can be preferable to reduce this) |
list with the following components:
info: list with results containing:
MIR: the calculated variable importance for each variable based on mutual impurity reduction.
pvalue: the obtained p-values for each variable.
selected: variables has been selected (1) or not (0).
relations: a list containing the results of variable relation analysis.
parameters: a list that contains the parameters s, type, mtry, p.t.sel, p.t.rel and method.sel that were used.
var: vector of selected variables.
ranger: ranger object.
Nembrini, S. et al. (2018) The revival of the Gini importance? Bioinformatics, 34, 3711–3718. https://academic.oup.com/bioinformatics/article/34/21/3711/4994791
Seifert, S. et al. (2019) Surrogate minimal depth as an importance measure for variables in random forests. Bioinformatics, 35, 3663–3671. https://academic.oup.com/bioinformatics/article/35/19/3663/5368013
# read data
data("SMD_example_data")
# select variables (usually more trees are needed)
set.seed(42)
res = var.select.mir(x = SMD_example_data[,2:ncol(SMD_example_data)], y = SMD_example_data[,1],s = 10, ntree = 10)
res$var
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.