View source: R/var.relations.R
var.relations | R Documentation |
This function uses the mean adjusted agreement to select variables that are related to a defined variable using a threshold T. The parameter t is used to calculate T: t=1 means that every variable with higher probability than "by chance" is identified as "important". t=2 means the probability has to be twice, etc. Based on the threshold a vector is created containing the related variables.
var.relations(
x = NULL,
y = NULL,
ntree = 500,
type = "regression",
s = NULL,
mtry = NULL,
min.node.size = 1,
num.threads = NULL,
status = NULL,
save.ranger = FALSE,
create.forest = FALSE,
forest = NULL,
save.memory = FALSE,
case.weights = NULL,
variables,
candidates,
t = 5,
select.rel = TRUE,
num.threads.rel = NULL
)
x |
matrix or data.frame of predictor variables with variables in columns and samples in rows (Note: missing values are not allowed) |
y |
vector with values of phenotype variable (Note: will be converted to factor if classification mode is used). For survival forests this is the time variable. |
ntree |
number of trees. Default is 500. |
type |
mode of prediction ("regression" or "classification"). Default is regression. |
s |
predefined number of surrogate splits (it may happen that the actual number of surrogate splits differs in individual nodes). Default is 1 % of no. of variables. |
mtry |
number of variables to possibly split at in each node. Default is no. of variables^(3/4) ("^3/4") as recommended by (Ishwaran 2011). Also possible is "sqrt" and "0.5" to use the square root or half of the no. of variables. |
min.node.size |
minimal node size. Default is 1. |
num.threads |
number of threads used for parallel execution. Default is number of CPUs available. |
status |
status variable, only applicable to survival data. Use 1 for event and 0 for censoring. |
save.ranger |
set TRUE if ranger object should be saved. Default is that ranger object is not saved (FALSE). |
create.forest |
set FALSE if you want to analyze an existing forest. Default is TRUE. |
forest |
the random forest that should be analyzed if create.forest is set to FALSE. (x and y still have to be given to obtain variable names) |
save.memory |
Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems. (This parameter is transfered to ranger) |
case.weights |
Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees. |
variables |
variable names (string) for which related variables should be searched for (has to be contained in allvariables) |
candidates |
vector of variable names (strings) that are candidates to be related to the variables (has to be contained in allvariables) |
t |
variable to calculate threshold. Default is 5. |
select.rel |
set False if only relations should be calculated and no related variables should be selected. |
num.threads.rel |
number of threads used for determination of relations. Default is number of CPUs available. (this process can be memory-intensive and it can be preferable to reduce this) |
a list containing:
variables: the variables to which relations are investigated.
surr.res: a matrix with mean adjusted agreement values with variables in rows and candidates in columns.
threshold: the threshold used to select related variables.
var: a list with one vector for each variable containing related variables.
ranger: ranger object.
# read data
data("SMD_example_data")
x = SMD_example_data[,2:ncol(SMD_example_data)]
y = SMD_example_data[,1]
# calculate variable relations
set.seed(42)
res = var.relations(x = x, y = y, s = 10, ntree = 100, variables = c("X1","X7"), candidates = colnames(x)[1:100], t = 5)
res$var
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.