logicDT | R Documentation |
Main function for fitting logicDT models.
## Default S3 method: logicDT( X, y, max_vars = 3, max_conj = 3, Z = NULL, search_algo = "sa", cooling_schedule = cooling.schedule(), scoring_rule = "auc", tree_control = tree.control(), gamma = 0, simplify = "vars", val_method = "none", val_frac = 0.5, val_reps = 10, allow_conj_removal = TRUE, conjsize = 1, randomize_greedy = FALSE, greedy_mod = TRUE, greedy_rem = FALSE, max_gen = 10000, gp_sigma = 0.15, gp_fs_interval = 1, ... ) ## S3 method for class 'formula' logicDT(formula, data, ...)
X |
Matrix or data frame of binary predictors coded as 0 or 1. |
y |
Response vector. 0-1 coding for binary responses. Otherwise, a regression task is assumed. |
max_vars |
Maximum number of predictors in the set of predictors. For the set [X_1 \land X_2^c, X_1 \land X_3], this parameter is equal to 4. |
max_conj |
Maximum number of input variables for the decision trees. For the set [X_1 \land X_2^c, X_1 \land X_3], this parameter is equal to 2. |
Z |
Optional matrix or data frame of quantitative/continuous covariables. Multiple covariables allowed for splitting the trees. If four parameter logistic models shall be fitted in the leaves, only the first given covariable is used. |
search_algo |
Search algorithm for guiding the global search. This can either be "sa" for simulated annealing, "greedy" for a greedy search or "gp" for genetic programming. |
cooling_schedule |
Cooling schedule parameters if simulated
annealing is used. The required object should be created via
the function |
scoring_rule |
Scoring rule for guiding the global search. This can either be "auc" for the area under the receiver operating characteristic curve (default for binary reponses), "deviance" for the deviance, "nce" for the normalized cross entropy or "brier" for the Brier score. For regression purposes, the MSE (mean squared error) is automatically chosen. |
tree_control |
Parameters controlling the fitting of
decision trees. This should be configured via the
function |
gamma |
Complexity penalty added to the score.
If \texttt{gamma} > 0 is given, \texttt{gamma} \cdot ||m||_0
is added to the score with ||m||_0 being the total number of
variables contained in the current model m.
The main purpose of this penalty is for fitting logicDT stumps
in conjunction with boosting. For regular logicDT models or bagged
logicDT models, instead, the model complexity parameters |
simplify |
Should the final fitted model be simplified? This means, that unnecessary terms as a whole ("conj") will be removed if they cannot improve the score. simplify = "vars" additionally tries to prune individual conjunctions by removing unnecessary variables in those. simplify = "none" will not modify the final model. |
val_method |
Inner validation method. "rv" leads to a repeated validation where val_reps times the original data set is divided into \texttt{val\_frac} \cdot 100\% validation data and (1-\texttt{val\_frac}) \cdot 100\% training data. "bootstrap" draws bootstrap samples and uses the out-of-bag data as validation data. "cv" employs cross-validation with val_reps folds. |
val_frac |
Only used if val_method = "rv". See description of val_method. |
val_reps |
Number of inner validation partitionings. |
allow_conj_removal |
Should it be allowed to remove complete terms/conjunctions in the search? If exact numbers of terms are aimed at, this should be set to FALSE. If extensive hyperparameter optimizations are feasible, allow_conj_removal = FALSE with a proper search over max_vars and max_conj is advised for fitting single models. For bagging or boosting with a greedy search, allow_conj_removal = TRUE together with a small number for max_vars = max_conj is recommended, e.g., 2 or 3. |
conjsize |
The minimum of training samples that have to belong to a conjunction. This parameters prevents including unnecessarily complex conjunctions that rarely occur. |
randomize_greedy |
Should the greedy search be randomized by only considering √{\mathrm{Neighbour\ states}} neighbors at each iteration, similar to random forests. Speeds up the greedy search but can lead to inferior results. |
greedy_mod |
Should modifications of conjunctions be considered in a greedy search? Speeds up the greedy search but can lead to inferior results. |
greedy_rem |
Should the removal of conjunctions be considered in a greedy search? Speeds up the greedy search but can lead to inferior results. |
max_gen |
Maximum number of generations for genetic programming. |
gp_sigma |
Parameter σ for fitness sharing in genetic programming. Very small values (e.g., 0.001) are recommended leading to only penalizing models with yield the exact same score. |
gp_fs_interval |
Interval for fitness sharing in genetic programming. The fitness calculation can be computationally expensive if many models exist in one generation. gp_fs_interval = 10 leads to performing fitness sharing only every 10th generation. |
... |
Arguments passed to |
formula |
An object of type |
data |
A data frame containing the data for the corresponding
|
logicDT is a method for finding response-associated interactions between binary predictors. A global search for the best set of predictors and interactions between predictors is performed trying to find the global optimal decision trees. On the one hand, this can be seen as a variable selection. On the other hand, Boolean conjunctions between binary predictors can be identified as impactful which is particularly useful if the corresponding marginal effects are negligible due to the greedy fashion of choosing splits in decision trees.
Three search algorithms are implemented:
Simulated annealing. An exhaustive stochastic optimization procedure. Recommended for single models (without [outer] bagging or boosting).
Greedy search. A very fast search always looking for the best possible improvement. Recommended for ensemble models.
Genetic programming. A more or less intensive search holding several competetive models at each generation. Niche method which is only recommended if multiple (simple) models do explain the variation in the response.
Furthermore, the option of a so-called "inner validation" is available. Here, the search is guided using several train-validation-splits and the average of the validation performance. This approach is computationally expensive but can lead to more robust single models.
For minimizing the computation time, two-dimensional hash tables are used saving evaluated models. This is irrelevant for the greedy search but can heavily improve the fitting times when employing a search with simulated annealing or genetic programming, especially when choosing an inner validation.
An object of class logicDT
. This is a list
containing
|
A matrix of identified set of predictors and conjunctions of predictors. Each entry corresponds to the column index in X. Negative values indicate negations. Missing values mean that the term does not contain any more variables. |
|
Human readable form of |
|
Score of the best model. Smaller values are prefered. |
|
Decision tree fitted on the best set of input terms. This is a list containing the pointer to the C representation of the tree and R representations of the tree structure such as the splits and predictions. |
|
List of decision trees. Only relevant if inner validation was used. |
|
The total number of search iterations, i.e., tested configurations by fitting a tree (ensemble) and evaluating it. |
|
The number of prevented tree fitting by using the 2-dimensional hash table. |
|
Supplied parameters of the functional call
to |
logicDT models can be saved and loaded using save(...)
and
load(...)
. The internal C structures will not be saved
but rebuilt from the R representations if necessary.
Lau, M., Schikowski, T. & Schwender, H. (2021). logicDT: A Procedure for Identifying Response-Associated Interactions Between Binary Predictors. To be submitted.
Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984). Classification and Regression Trees. CRC Press. doi: 10.1201/9781315139470
Kirkpatrick, S., Gelatt C. D. & Vecchi M. P. (1983). Optimization by Simulated Annealing. Science 220(4598):671–680. doi: 10.1126/science.220.4598.671
# Generate toy data set.seed(123) maf <- 0.25 n.snps <- 50 N <- 2000 X <- matrix(sample(0:2, n.snps * N, replace = TRUE, prob = c((1-maf)^2, 1-(1-maf)^2-maf^2, maf^2)), ncol = n.snps) colnames(X) <- paste("SNP", 1:n.snps, sep="") X <- splitSNPs(X) Z <- matrix(rnorm(N, 20, 10), ncol = 1) colnames(Z) <- "E" Z[Z < 0] <- 0 y <- -0.75 + log(2) * (X[,"SNP1D"] != 0) + log(4) * Z/20 * (X[,"SNP2D"] != 0 & X[,"SNP3D"] == 0) + rnorm(N, 0, 1) # Fit and evaluate single logicDT model model <- logicDT(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE], max_vars = 3, max_conj = 2, search_algo = "sa", tree_control = tree.control(nodesize = floor(0.05 * nrow(X)/2)), simplify = "vars", allow_conj_removal = FALSE, conjsize = floor(0.05 * nrow(X)/2)) calcNRMSE(predict(model, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N]) plot(model) print(model) # Fit and evaluate bagged logicDT model model.bagged <- logicDT.bagging(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE], bagging.iter = 50, max_vars = 3, max_conj = 3, search_algo = "greedy", tree_control = tree.control(nodesize = floor(0.05 * nrow(X)/2)), simplify = "vars", conjsize = floor(0.05 * nrow(X)/2)) calcNRMSE(predict(model.bagged, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N]) print(model.bagged) # Fit and evaluate boosted logicDT model model.boosted <- logicDT.boosting(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE], boosting.iter = 50, learning.rate = 0.01, subsample.frac = 0.75, replace = FALSE, max_vars = 3, max_conj = 3, search_algo = "greedy", tree_control = tree.control(nodesize = floor(0.05 * nrow(X)/2)), simplify = "vars", conjsize = floor(0.05 * nrow(X)/2)) calcNRMSE(predict(model.boosted, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N]) print(model.boosted) # Calculate VIMs (variable importance measures) vims <- vim(model.bagged) plot(vims) print(vims) # Single greedy model model <- logicDT(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE], max_vars = 3, max_conj = 2, search_algo = "greedy", tree_control = tree.control(nodesize = floor(0.05 * nrow(X)/2)), simplify = "vars", allow_conj_removal = FALSE, conjsize = floor(0.05 * nrow(X)/2)) calcNRMSE(predict(model, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N]) plot(model) print(model)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.