SDForest: Spectrally Deconfounded Random Forests
In SDModels: Spectrally Deconfounded Models

SDForest

R Documentation

Spectrally Deconfounded Random Forests

Description

Estimate regression Random Forest using spectral deconfounding. The spectrally deconfounded Random Forest (SDForest) combines SDTrees in the same way, as in the original Random Forest \insertCiteBreiman2001RandomForestsSDModels. The idea is to combine multiple regression trees into an ensemble in order to decrease variance and get a smooth function. Ensembles work best if the different models are independent of each other. To decorrelate the regression trees as much as possible from each other, we have two mechanisms. The first one is bagging \insertCiteBreiman1996BaggingPredictorsSDModels, where we train each regression tree on an independent bootstrap sample of the observations, e.g., we draw a random sample of size n with replacement from the observations. The second mechanic to decrease the correlation is that only a random subset of the covariates is available for each split. Before each split, we sample \text{mtry} \leq p from all the covariates and choose the one that reduces the loss the most only from those.

\widehat{f(X)} = \frac{1}{N_{tree}} \sum_{t = 1}^{N_{tree}} SDTree_t(X)

Usage

SDForest(
  formula = NULL,
  data = NULL,
  x = NULL,
  y = NULL,
  nTree = 100,
  cp = 0,
  min_sample = 5,
  mtry = NULL,
  mc.cores = 1,
  Q_type = "trim",
  trim_quantile = 0.5,
  q_hat = 0,
  Qf = NULL,
  A = NULL,
  gamma = 7,
  max_size = NULL,
  gpu = FALSE,
  return_data = TRUE,
  mem_size = 1e+07,
  leave_out_ind = NULL,
  envs = NULL,
  nTree_leave_out = NULL,
  nTree_env = NULL,
  max_candidates = 100,
  Q_scale = TRUE,
  verbose = TRUE,
  predictors = NULL
)

Arguments

`formula`	Object of class `formula` or describing the model to fit of the form `y ~ x1 + x2 + ...` where `y` is a numeric response and `x1, x2, ...` are vectors of covariates. Interactions are not supported.
`data`	Training data of class `data.frame` containing the variables in the model.
`x`	Matrix of covariates, alternative to `formula` and `data`.
`y`	Vector of responses, alternative to `formula` and `data`.
`nTree`	Number of trees to grow.
`cp`	Complexity parameter, minimum loss decrease to split a node. A split is only performed if the loss decrease is larger than `cp * initial_loss`, where `initial_loss` is the loss of the initial estimate using only a stump.
`min_sample`	Minimum number of observations per leaf. A split is only performed if both resulting leaves have at least `min_sample` observations.
`mtry`	Number of randomly selected covariates to consider for a split, if `NULL` half of the covariates are available for each split. `\text{mtry} = \lfloor \frac{p}{2} \rfloor`
`mc.cores`	Number of cores to use for parallel processing, if `mc.cores > 1` the trees are estimated in parallel.
`Q_type`	Type of deconfounding, one of 'trim', 'pca', 'no_deconfounding'. 'trim' corresponds to the Trim transform \insertCiteCevid2020SpectralModelsSDModels as implemented in the Doubly debiased lasso \insertCiteGuo2022DoublyConfoundingSDModels, 'pca' to the PCA transformation\insertCitePaul2008PreconditioningProblemsSDModels. See `get_Q`.
`trim_quantile`	Quantile for Trim transform, only needed for trim, see `get_Q`.
`q_hat`	Assumed confounding dimension, only needed for pca, see `get_Q`.
`Qf`	Spectral transformation, if `NULL` it is internally estimated using `get_Q`.
`A`	Numerical Anchor of class `matrix`. See `get_W`.
`gamma`	Strength of distributional robustness, `\gamma \in [0, \infty]`. See `get_W`.
`max_size`	Maximum number of observations used for a bootstrap sample. If `NULL` n samples with replacement are drawn.
`gpu`	If `TRUE`, the calculations are performed on the GPU. If it is properly set up.
`return_data`	If `TRUE`, the training data is returned in the output. This is needed for `prune.SDForest`, `regPath.SDForest`, and for `mergeForest`.
`mem_size`	Amount of split candidates that can be evaluated at once. This is a trade-off between memory and speed can be decreased if either the memory is not sufficient or the gpu is to small.
`leave_out_ind`	Indices of observations that should not be used for training.
`envs`	Vector of environments of class `factor` which can be used for stratified tree fitting.
`nTree_leave_out`	Number of trees that should be estimated while leaving one of the environments out. Results in number of environments times number of trees.
`nTree_env`	Number of trees that should be estimated for each environment. Results in number of environments times number of trees.
`max_candidates`	Maximum number of split points that are proposed at each node for each covariate.
`Q_scale`	Should data be scaled to estimate the spectral transformation? Default is `TRUE` to not reduce the signal of high variance covariates, and we do not know of a scenario where this hurts.
`verbose`	If `TRUE` fitting information is shown.
`predictors`	Subset of colnames(X) or numerical indices of the covariates for which an effect on y should be estimated. All the other covariates are only used for deconfounding.

Value

Object of class SDForest containing:

`predictions`	Vector of predictions for each observation.
`forest`	List of SDTree objects.
`var_names`	Names of the covariates.
`oob_loss`	Out-of-bag loss. MSE
`oob_SDloss`	Out-of-bag loss using the spectral transformation.
`var_importance`	Variable importance. The variable importance is calculated as the sum of the decrease in the loss function resulting from all splits that use a covariate for each tree. The mean of the variable importance of all trees results in the variable importance for the forest.
`oob_ind`	List of indices of trees that did not contain the observation in the training set.
`oob_predictions`	Out-of-bag predictions.

If return_data is TRUE the following are also returned:

`X`	Matrix of covariates.
`Y`	Vector of responses.
`Q`	Spectral transformation.

If envs is provided the following are also returned:

`envs`	Vector of environments.
`nTree_env`	Number of trees for each environment.
`ooEnv_ind`	List of indices of trees that did not contain the observation or the same environment in the training set for each observation.
`ooEnv_loss`	Out-of-bag loss using only trees that did not contain the observation or the same environment.
`ooEnv_SDloss`	Out-of-bag loss using the spectral transformation and only trees that did not contain the observation or the same environment.
`ooEnv_predictions`	Out-of-bag predictions using only trees that did not contain the observation or the same environment.
`nTree_leave_out`	If environments are left out, the environment for each tree, that was left out.
`nTree_env`	If environments are provided, the environment each tree is trained with.

Author(s)

Markus Ulmer

References

\insertAllCited

Examples

set.seed(1)
n <- 50
X <- matrix(rnorm(n * 5), nrow = n)
y <- sign(X[, 1]) * 3 + rnorm(n)
model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', nTree = 5, cp = 0.5)
predict(model, newdata = data.frame(X))

###### subset of predictors
# if we know, that only the first covariate has an effect on y,
# we can estimate only its effect and use the others just for deconfounding
model <- SDForest(x = X, y = y, cp = 0.5, nTree = 5, predictors = c(1))


set.seed(42)
# simulation of confounded data
sim_data <- simulate_data_nonlinear(q = 2, p = 150, n = 100, m = 2)
X <- sim_data$X
Y <- sim_data$Y
train_data <- data.frame(X, Y)
# causal parents of y
sim_data$j

# comparison to classical random forest
fit_ranger <- ranger::ranger(Y ~ ., train_data, importance = 'impurity')

fit <- SDForest(x = X, y = Y, nTree = 100, Q_type = 'pca', q_hat = 2)
fit <- SDForest(Y ~ ., nTree = 100, train_data)
fit

# we can plot the fit to see whether the number of trees is high enough
# if the performance stabilizes, we have enough trees otherwise one can fit
# more and add them
plot(fit)

# a few more might be helpfull
fit2 <- SDForest(Y ~ ., nTree = 50, train_data) 
fit <- mergeForest(fit, fit2)

# comparison of variable importance
imp_ranger <- fit_ranger$variable.importance
imp_sdf <- fit$var_importance
imp_col <- rep('black', length(imp_ranger))
imp_col[sim_data$j] <- 'red'

plot(imp_ranger, imp_sdf, col = imp_col, pch = 20,
     xlab = 'ranger', ylab = 'SDForest', 
     main = 'Variable Importance')

# check regularization path of variable importance
path <- regPath(fit)
# out of bag error for different regularization
plotOOB(path)
plot(path)

# detection of causal parent using stability selection
stablePath <- stabilitySelection(fit)
plot(stablePath)

# pruning of forest according to optimal out-of-bag performance
fit <- prune(fit, cp = path$cp_min)

# partial functional dependence of y on the most important covariate
most_imp <- which.max(fit$var_importance)
dep <- partDependence(fit, most_imp)
plot(dep, n_examples = 100)

SDModels documentation built on June 8, 2025, 11:17 a.m.