logicDT: Fitting logic decision trees

View source: R/logicDT.R

logicDTR Documentation

Fitting logic decision trees

Description

Main function for fitting logicDT models.

Usage

## Default S3 method:
logicDT(
  X,
  y,
  max_vars = 3,
  max_conj = 3,
  Z = NULL,
  search_algo = "sa",
  cooling_schedule = cooling.schedule(),
  scoring_rule = "auc",
  tree_control = tree.control(),
  gamma = 0,
  simplify = "vars",
  val_method = "none",
  val_frac = 0.5,
  val_reps = 10,
  allow_conj_removal = TRUE,
  conjsize = 1,
  randomize_greedy = FALSE,
  greedy_mod = TRUE,
  greedy_rem = FALSE,
  max_gen = 10000,
  gp_sigma = 0.15,
  gp_fs_interval = 1,
  ...
)

## S3 method for class 'formula'
logicDT(formula, data, ...)

Arguments

X

Matrix or data frame of binary predictors coded as 0 or 1.

y

Response vector. 0-1 coding for binary responses. Otherwise, a regression task is assumed.

max_vars

Maximum number of predictors in the set of predictors. For the set [X_1 \land X_2^c, X_1 \land X_3], this parameter is equal to 4.

max_conj

Maximum number of input variables for the decision trees. For the set [X_1 \land X_2^c, X_1 \land X_3], this parameter is equal to 2.

Z

Optional matrix or data frame of quantitative/continuous covariables. Multiple covariables allowed for splitting the trees. If four parameter logistic models shall be fitted in the leaves, only the first given covariable is used.

search_algo

Search algorithm for guiding the global search. This can either be "sa" for simulated annealing, "greedy" for a greedy search or "gp" for genetic programming.

cooling_schedule

Cooling schedule parameters if simulated annealing is used. The required object should be created via the function cooling.schedule.

scoring_rule

Scoring rule for guiding the global search. This can either be "auc" for the area under the receiver operating characteristic curve (default for binary reponses), "deviance" for the deviance, "nce" for the normalized cross entropy or "brier" for the Brier score. For regression purposes, the MSE (mean squared error) is automatically chosen.

tree_control

Parameters controlling the fitting of decision trees. This should be configured via the function tree.control.

gamma

Complexity penalty added to the score. If \texttt{gamma} > 0 is given, \texttt{gamma} \cdot ||m||_0 is added to the score with ||m||_0 being the total number of variables contained in the current model m. The main purpose of this penalty is for fitting logicDT stumps in conjunction with boosting. For regular logicDT models or bagged logicDT models, instead, the model complexity parameters max_vars and max_conj should be tuned.

simplify

Should the final fitted model be simplified? This means, that unnecessary terms as a whole ("conj") will be removed if they cannot improve the score. simplify = "vars" additionally tries to prune individual conjunctions by removing unnecessary variables in those. simplify = "none" will not modify the final model.

val_method

Inner validation method. "rv" leads to a repeated validation where val_reps times the original data set is divided into \texttt{val\_frac} \cdot 100\% validation data and (1-\texttt{val\_frac}) \cdot 100\% training data. "bootstrap" draws bootstrap samples and uses the out-of-bag data as validation data. "cv" employs cross-validation with val_reps folds.

val_frac

Only used if val_method = "rv". See description of val_method.

val_reps

Number of inner validation partitionings.

allow_conj_removal

Should it be allowed to remove complete terms/conjunctions in the search? If exact numbers of terms are aimed at, this should be set to FALSE. If extensive hyperparameter optimizations are feasible, allow_conj_removal = FALSE with a proper search over max_vars and max_conj is advised for fitting single models. For bagging or boosting with a greedy search, allow_conj_removal = TRUE together with a small number for max_vars = max_conj is recommended, e.g., 2 or 3.

conjsize

The minimum of training samples that have to belong to a conjunction. This parameters prevents including unnecessarily complex conjunctions that rarely occur.

randomize_greedy

Should the greedy search be randomized by only considering √{\mathrm{Neighbour\ states}} neighbors at each iteration, similar to random forests. Speeds up the greedy search but can lead to inferior results.

greedy_mod

Should modifications of conjunctions be considered in a greedy search? Speeds up the greedy search but can lead to inferior results.

greedy_rem

Should the removal of conjunctions be considered in a greedy search? Speeds up the greedy search but can lead to inferior results.

max_gen

Maximum number of generations for genetic programming.

gp_sigma

Parameter σ for fitness sharing in genetic programming. Very small values (e.g., 0.001) are recommended leading to only penalizing models with yield the exact same score.

gp_fs_interval

Interval for fitness sharing in genetic programming. The fitness calculation can be computationally expensive if many models exist in one generation. gp_fs_interval = 10 leads to performing fitness sharing only every 10th generation.

...

Arguments passed to logicDT.default

formula

An object of type formula describing the model to be fitted.

data

A data frame containing the data for the corresponding formula object. Must also contain quantitative covariables if they should be included as well.

Details

logicDT is a method for finding response-associated interactions between binary predictors. A global search for the best set of predictors and interactions between predictors is performed trying to find the global optimal decision trees. On the one hand, this can be seen as a variable selection. On the other hand, Boolean conjunctions between binary predictors can be identified as impactful which is particularly useful if the corresponding marginal effects are negligible due to the greedy fashion of choosing splits in decision trees.

Three search algorithms are implemented:

  • Simulated annealing. An exhaustive stochastic optimization procedure. Recommended for single models (without [outer] bagging or boosting).

  • Greedy search. A very fast search always looking for the best possible improvement. Recommended for ensemble models.

  • Genetic programming. A more or less intensive search holding several competetive models at each generation. Niche method which is only recommended if multiple (simple) models do explain the variation in the response.

Furthermore, the option of a so-called "inner validation" is available. Here, the search is guided using several train-validation-splits and the average of the validation performance. This approach is computationally expensive but can lead to more robust single models.

For minimizing the computation time, two-dimensional hash tables are used saving evaluated models. This is irrelevant for the greedy search but can heavily improve the fitting times when employing a search with simulated annealing or genetic programming, especially when choosing an inner validation.

Value

An object of class logicDT. This is a list containing

disj

A matrix of identified set of predictors and conjunctions of predictors. Each entry corresponds to the column index in X. Negative values indicate negations. Missing values mean that the term does not contain any more variables.

real_disj

Human readable form of disj. Here, variable names are directly depicted.

score

Score of the best model. Smaller values are prefered.

pet

Decision tree fitted on the best set of input terms. This is a list containing the pointer to the C representation of the tree and R representations of the tree structure such as the splits and predictions.

ensemble

List of decision trees. Only relevant if inner validation was used.

total_iter

The total number of search iterations, i.e., tested configurations by fitting a tree (ensemble) and evaluating it.

prevented_evals

The number of prevented tree fitting by using the 2-dimensional hash table.

...

Supplied parameters of the functional call to logicDT.

Saving and Loading

logicDT models can be saved and loaded using save(...) and load(...). The internal C structures will not be saved but rebuilt from the R representations if necessary.

References

  • Lau, M., Schikowski, T. & Schwender, H. (2021). logicDT: A Procedure for Identifying Response-Associated Interactions Between Binary Predictors. To be submitted.

  • Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984). Classification and Regression Trees. CRC Press. doi: 10.1201/9781315139470

  • Kirkpatrick, S., Gelatt C. D. & Vecchi M. P. (1983). Optimization by Simulated Annealing. Science 220(4598):671–680. doi: 10.1126/science.220.4598.671

Examples

# Generate toy data
set.seed(123)
maf <- 0.25
n.snps <- 50
N <- 2000
X <- matrix(sample(0:2, n.snps * N, replace = TRUE,
                   prob = c((1-maf)^2, 1-(1-maf)^2-maf^2, maf^2)), ncol = n.snps)
colnames(X) <- paste("SNP", 1:n.snps, sep="")
X <- splitSNPs(X)
Z <- matrix(rnorm(N, 20, 10), ncol = 1)
colnames(Z) <- "E"
Z[Z < 0] <- 0
y <- -0.75 + log(2) * (X[,"SNP1D"] != 0) +
  log(4) * Z/20 * (X[,"SNP2D"] != 0 & X[,"SNP3D"] == 0) +
  rnorm(N, 0, 1)


# Fit and evaluate single logicDT model
model <- logicDT(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE],
                 max_vars = 3, max_conj = 2,
                 search_algo = "sa",
                 tree_control = tree.control(nodesize = floor(0.05 * nrow(X)/2)),
                 simplify = "vars",
                 allow_conj_removal = FALSE, conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]),
          y[(N/2+1):N])
plot(model)
print(model)

# Fit and evaluate bagged logicDT model
model.bagged <- logicDT.bagging(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE],
                                bagging.iter = 50,
                                max_vars = 3, max_conj = 3,
                                search_algo = "greedy",
                                tree_control = tree.control(nodesize = floor(0.05 * nrow(X)/2)),
                                simplify = "vars",
                                conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model.bagged, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]),
          y[(N/2+1):N])
print(model.bagged)

# Fit and evaluate boosted logicDT model
model.boosted <- logicDT.boosting(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE],
                                  boosting.iter = 50, learning.rate = 0.01,
                                  subsample.frac = 0.75, replace = FALSE,
                                  max_vars = 3, max_conj = 3,
                                  search_algo = "greedy",
                                  tree_control = tree.control(nodesize = floor(0.05 * nrow(X)/2)),
                                  simplify = "vars",
                                  conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model.boosted, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]),
          y[(N/2+1):N])
print(model.boosted)

# Calculate VIMs (variable importance measures)
vims <- vim(model.bagged)
plot(vims)
print(vims)

# Single greedy model
model <- logicDT(X[1:(N/2),], y[1:(N/2)], Z = Z[1:(N/2),,drop=FALSE],
                 max_vars = 3, max_conj = 2,
                 search_algo = "greedy",
                 tree_control = tree.control(nodesize = floor(0.05 * nrow(X)/2)),
                 simplify = "vars",
                 allow_conj_removal = FALSE, conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model, X[(N/2+1):N,], Z = Z[(N/2+1):N,,drop=FALSE]),
          y[(N/2+1):N])
plot(model)
print(model)

logicDT documentation built on Jan. 14, 2023, 5:06 p.m.