bcf: Run the Bayesian Causal Forest (BCF) algorithm for...
In stochtree: Stochastic Tree Ensembles (XBART and BART) for Supervised Learning and Causal Inference

View source: R/bcf.R

bcf	R Documentation

Run the Bayesian Causal Forest (BCF) algorithm for regularized causal effect estimation.

Description

Run the Bayesian Causal Forest (BCF) algorithm for regularized causal effect estimation.

Usage

bcf(
  X_train,
  Z_train,
  y_train,
  propensity_train = NULL,
  rfx_group_ids_train = NULL,
  rfx_basis_train = NULL,
  X_test = NULL,
  Z_test = NULL,
  propensity_test = NULL,
  rfx_group_ids_test = NULL,
  rfx_basis_test = NULL,
  num_gfr = 5,
  num_burnin = 0,
  num_mcmc = 100,
  previous_model_json = NULL,
  previous_model_warmstart_sample_num = NULL,
  general_params = list(),
  prognostic_forest_params = list(),
  treatment_effect_forest_params = list(),
  variance_forest_params = list()
)

Arguments

`X_train`	Covariates used to split trees in the ensemble. May be provided either as a dataframe or a matrix. Matrix covariates will be assumed to be all numeric. Covariates passed as a dataframe will be preprocessed based on the variable types (e.g. categorical columns stored as unordered factors will be one-hot encoded, categorical columns stored as ordered factors will passed as integers to the core algorithm, along with the metadata that the column is ordered categorical).
`Z_train`	Vector of (continuous or binary) treatment assignments.
`y_train`	Outcome to be modeled by the ensemble.
`propensity_train`	(Optional) Vector of propensity scores. If not provided, this will be estimated from the data.
`rfx_group_ids_train`	(Optional) Group labels used for an additive random effects model.
`rfx_basis_train`	(Optional) Basis for "random-slope" regression in an additive random effects model. If `rfx_group_ids_train` is provided with a regression basis, an intercept-only random effects model will be estimated.
`X_test`	(Optional) Test set of covariates used to define "out of sample" evaluation data. May be provided either as a dataframe or a matrix, but the format of `X_test` must be consistent with that of `X_train`.
`Z_test`	(Optional) Test set of (continuous or binary) treatment assignments.
`propensity_test`	(Optional) Vector of propensity scores. If not provided, this will be estimated from the data.
`rfx_group_ids_test`	(Optional) Test set group labels used for an additive random effects model. We do not currently support (but plan to in the near future), test set evaluation for group labels that were not in the training set.
`rfx_basis_test`	(Optional) Test set basis for "random-slope" regression in additive random effects model.
`num_gfr`	Number of "warm-start" iterations run using the grow-from-root algorithm (He and Hahn, 2021). Default: 5.
`num_burnin`	Number of "burn-in" iterations of the MCMC sampler. Default: 0.
`num_mcmc`	Number of "retained" iterations of the MCMC sampler. Default: 100.
`previous_model_json`	(Optional) JSON string containing a previous BCF model. This can be used to "continue" a sampler interactively after inspecting the samples or to run parallel chains "warm-started" from existing forest samples. Default: `NULL`.
`previous_model_warmstart_sample_num`	(Optional) Sample number from `previous_model_json` that will be used to warmstart this BCF sampler. One-indexed (so that the first sample is used for warm-start by setting `previous_model_warmstart_sample_num = 1`). Default: `NULL`.
`general_params`	(Optional) A list of general (non-forest-specific) model parameters, each of which has a default value processed internally, so this argument list is optional. `cutpoint_grid_size` Maximum size of the "grid" of potential cutpoints to consider in the GFR algorithm. Default: `100`. `standardize` Whether or not to standardize the outcome (and store the offset / scale in the model object). Default: `TRUE`. `sample_sigma2_global` Whether or not to update the `sigma^2` global error variance parameter based on `IG(sigma2_global_shape, sigma2_global_scale)`. Default: `TRUE`. `sigma2_global_init` Starting value of global error variance parameter. Calibrated internally as `1.0var((y_train-mean(y_train))/sd(y_train))` if not set. `sigma2_global_shape` Shape parameter in the `IG(sigma2_global_shape, sigma2_global_scale)` global error variance model. Default: `0`. `sigma2_global_scale` Scale parameter in the `IG(sigma2_global_shape, sigma2_global_scale)` global error variance model. Default: `0`. `variable_weights` Numeric weights reflecting the relative probability of splitting on each variable. Does not need to sum to 1 but cannot be negative. Defaults to `rep(1/ncol(X_train), ncol(X_train))` if not set here. Note that if the propensity score is included as a covariate in either forest, its weight will default to `1/ncol(X_train)`. A workaround if you wish to provide a custom weight for the propensity score is to include it as a column in `X_train` and then set `propensity_covariate` to `'none'` adjust `keep_vars` accordingly for the `mu` or `tau` forests. `propensity_covariate` Whether to include the propensity score as a covariate in either or both of the forests. Enter `"none"` for neither, `"mu"` for the prognostic forest, `"tau"` for the treatment forest, and `"both"` for both forests. If this is not `"none"` and a propensity score is not provided, it will be estimated from (`X_train`, `Z_train`) using `stochtree::bart()`. Default: `"mu"`. `adaptive_coding` Whether or not to use an "adaptive coding" scheme in which a binary treatment variable is not coded manually as (0,1) or (-1,1) but learned via parameters `b_0` and `b_1` that attach to the outcome model `⁠[b_0 (1-Z) + b_1 Z] tau(X)⁠`. This is ignored when Z is not binary. Default: `TRUE`. `control_coding_init` Initial value of the "control" group coding parameter. This is ignored when Z is not binary. Default: `-0.5`. `treated_coding_init` Initial value of the "treatment" group coding parameter. This is ignored when Z is not binary. Default: `0.5`. `rfx_prior_var` Prior on the (diagonals of the) covariance of the additive group-level random regression coefficients. Must be a vector of length `ncol(rfx_basis_train)`. Default: `rep(1, ncol(rfx_basis_train))` `random_seed` Integer parameterizing the C++ random number generator. If not specified, the C++ random number generator is seeded according to `std::random_device`. `keep_burnin` Whether or not "burnin" samples should be included in the stored samples of forests and other parameters. Default `FALSE`. Ignored if `num_mcmc = 0`. `keep_gfr` Whether or not "grow-from-root" samples should be included in the stored samples of forests and other parameters. Default `FALSE`. Ignored if `num_mcmc = 0`. `keep_every` How many iterations of the burned-in MCMC sampler should be run before forests and parameters are retained. Default `1`. Setting `keep_every <- k` for some `k > 1` will "thin" the MCMC samples by retaining every `k`-th sample, rather than simply every sample. This can reduce the autocorrelation of the MCMC samples. `num_chains` How many independent MCMC chains should be sampled. If `num_mcmc = 0`, this is ignored. If `num_gfr = 0`, then each chain is run from root for `num_mcmc keep_every + num_burnin` iterations, with `num_mcmc` samples retained. If `num_gfr > 0`, each MCMC chain will be initialized from a separate GFR ensemble, with the requirement that `num_gfr >= num_chains`. Default: `1`. `verbose` Whether or not to print progress during the sampling loops. Default: `FALSE`.
`prognostic_forest_params`	(Optional) A list of prognostic forest model parameters, each of which has a default value processed internally, so this argument list is optional. `num_trees` Number of trees in the ensemble for the prognostic forest. Default: `250`. Must be a positive integer. `alpha` Prior probability of splitting for a tree of depth 0 in the prognostic forest. Tree split prior combines `alpha` and `beta` via `alpha(1+node_depth)^-beta`. Default: `0.95`. `beta` Exponent that decreases split probabilities for nodes of depth > 0 in the prognostic forest. Tree split prior combines `alpha` and `beta` via `alpha(1+node_depth)^-beta`. Default: `2`. `min_samples_leaf` Minimum allowable size of a leaf, in terms of training samples, in the prognostic forest. Default: `5`. `max_depth` Maximum depth of any tree in the ensemble in the prognostic forest. Default: `10`. Can be overridden with `-1` which does not enforce any depth limits on trees. `variable_weights` Numeric weights reflecting the relative probability of splitting on each variable in the prognostic forest. Does not need to sum to 1 but cannot be negative. Defaults to `rep(1/ncol(X_train), ncol(X_train))` if not set here. `sample_sigma2_leaf` Whether or not to update the leaf scale variance parameter based on `IG(sigma2_leaf_shape, sigma2_leaf_scale)`. `sigma2_leaf_init` Starting value of leaf node scale parameter. Calibrated internally as `1/num_trees` if not set here. `sigma2_leaf_shape` Shape parameter in the `IG(sigma2_leaf_shape, sigma2_leaf_scale)` leaf node parameter variance model. Default: `3`. `sigma2_leaf_scale` Scale parameter in the `IG(sigma2_leaf_shape, sigma2_leaf_scale)` leaf node parameter variance model. Calibrated internally as `0.5/num_trees` if not set here. `keep_vars` Vector of variable names or column indices denoting variables that should be included in the forest. Default: `NULL`. `drop_vars` Vector of variable names or column indices denoting variables that should be excluded from the forest. Default: `NULL`. If both `drop_vars` and `keep_vars` are set, `drop_vars` will be ignored.
`treatment_effect_forest_params`	(Optional) A list of treatment effect forest model parameters, each of which has a default value processed internally, so this argument list is optional. `num_trees` Number of trees in the ensemble for the treatment effect forest. Default: `50`. Must be a positive integer. `alpha` Prior probability of splitting for a tree of depth 0 in the treatment effect forest. Tree split prior combines `alpha` and `beta` via `alpha(1+node_depth)^-beta`. Default: `0.25`. `beta` Exponent that decreases split probabilities for nodes of depth > 0 in the treatment effect forest. Tree split prior combines `alpha` and `beta` via `alpha(1+node_depth)^-beta`. Default: `3`. `min_samples_leaf` Minimum allowable size of a leaf, in terms of training samples, in the treatment effect forest. Default: `5`. `max_depth` Maximum depth of any tree in the ensemble in the treatment effect forest. Default: `5`. Can be overridden with `-1` which does not enforce any depth limits on trees. `variable_weights` Numeric weights reflecting the relative probability of splitting on each variable in the treatment effect forest. Does not need to sum to 1 but cannot be negative. Defaults to `rep(1/ncol(X_train), ncol(X_train))` if not set here. `sample_sigma2_leaf` Whether or not to update the leaf scale variance parameter based on `IG(sigma2_leaf_shape, sigma2_leaf_scale)`. Cannot (currently) be set to true if `ncol(Z_train)>1`. Default: `FALSE`. `sigma2_leaf_init` Starting value of leaf node scale parameter. Calibrated internally as `1/num_trees` if not set here. `sigma2_leaf_shape` Shape parameter in the `IG(sigma2_leaf_shape, sigma2_leaf_scale)` leaf node parameter variance model. Default: `3`. `sigma2_leaf_scale` Scale parameter in the `IG(sigma2_leaf_shape, sigma2_leaf_scale)` leaf node parameter variance model. Calibrated internally as `0.5/num_trees` if not set here. `keep_vars` Vector of variable names or column indices denoting variables that should be included in the forest. Default: `NULL`. `drop_vars` Vector of variable names or column indices denoting variables that should be excluded from the forest. Default: `NULL`. If both `drop_vars` and `keep_vars` are set, `drop_vars` will be ignored.
`variance_forest_params`	(Optional) A list of variance forest model parameters, each of which has a default value processed internally, so this argument list is optional. `num_trees` Number of trees in the ensemble for the conditional variance model. Default: `0`. Variance is only modeled using a tree / forest if `num_trees > 0`. `alpha` Prior probability of splitting for a tree of depth 0 in the variance model. Tree split prior combines `alpha` and `beta` via `alpha(1+node_depth)^-beta`. Default: `0.95`. `beta` Exponent that decreases split probabilities for nodes of depth > 0 in the variance model. Tree split prior combines `alpha` and `beta` via `alpha(1+node_depth)^-beta`. Default: `2`. `min_samples_leaf` Minimum allowable size of a leaf, in terms of training samples, in the variance model. Default: `5`. `max_depth` Maximum depth of any tree in the ensemble in the variance model. Default: `10`. Can be overridden with `-1` which does not enforce any depth limits on trees. `leaf_prior_calibration_param` Hyperparameter used to calibrate the `IG(var_forest_prior_shape, var_forest_prior_scale)` conditional error variance model. If `var_forest_prior_shape` and `var_forest_prior_scale` are not set below, this calibration parameter is used to set these values to `num_trees / leaf_prior_calibration_param^2 + 0.5` and `num_trees / leaf_prior_calibration_param^2`, respectively. Default: `1.5`. `variance_forest_init` Starting value of root forest prediction in conditional (heteroskedastic) error variance model. Calibrated internally as `log(0.6*var((y_train-mean(y_train))/sd(y_train)))/num_trees` if not set. `var_forest_prior_shape` Shape parameter in the `IG(var_forest_prior_shape, var_forest_prior_scale)` conditional error variance model (which is only sampled if `num_trees > 0`). Calibrated internally as `num_trees / 1.5^2 + 0.5` if not set. `var_forest_prior_scale` Scale parameter in the `IG(var_forest_prior_shape, var_forest_prior_scale)` conditional error variance model (which is only sampled if `num_trees > 0`). Calibrated internally as `num_trees / 1.5^2` if not set. `keep_vars` Vector of variable names or column indices denoting variables that should be included in the forest. Default: `NULL`. `drop_vars` Vector of variable names or column indices denoting variables that should be excluded from the forest. Default: `NULL`. If both `drop_vars` and `keep_vars` are set, `drop_vars` will be ignored.

Value

List of sampling outputs and a wrapper around the sampled forests (which can be used for in-memory prediction on new data, or serialized to JSON on disk).

Examples

n <- 500
p <- 5
X <- matrix(runif(n*p), ncol = p)
mu_x <- (
    ((0 <= X[,1]) & (0.25 > X[,1])) * (-7.5) + 
    ((0.25 <= X[,1]) & (0.5 > X[,1])) * (-2.5) + 
    ((0.5 <= X[,1]) & (0.75 > X[,1])) * (2.5) + 
    ((0.75 <= X[,1]) & (1 > X[,1])) * (7.5)
)
pi_x <- (
    ((0 <= X[,1]) & (0.25 > X[,1])) * (0.2) + 
    ((0.25 <= X[,1]) & (0.5 > X[,1])) * (0.4) + 
    ((0.5 <= X[,1]) & (0.75 > X[,1])) * (0.6) + 
    ((0.75 <= X[,1]) & (1 > X[,1])) * (0.8)
)
tau_x <- (
    ((0 <= X[,2]) & (0.25 > X[,2])) * (0.5) + 
    ((0.25 <= X[,2]) & (0.5 > X[,2])) * (1.0) + 
    ((0.5 <= X[,2]) & (0.75 > X[,2])) * (1.5) + 
    ((0.75 <= X[,2]) & (1 > X[,2])) * (2.0)
)
Z <- rbinom(n, 1, pi_x)
noise_sd <- 1
y <- mu_x + tau_x*Z + rnorm(n, 0, noise_sd)
test_set_pct <- 0.2
n_test <- round(test_set_pct*n)
n_train <- n - n_test
test_inds <- sort(sample(1:n, n_test, replace = FALSE))
train_inds <- (1:n)[!((1:n) %in% test_inds)]
X_test <- X[test_inds,]
X_train <- X[train_inds,]
pi_test <- pi_x[test_inds]
pi_train <- pi_x[train_inds]
Z_test <- Z[test_inds]
Z_train <- Z[train_inds]
y_test <- y[test_inds]
y_train <- y[train_inds]
mu_test <- mu_x[test_inds]
mu_train <- mu_x[train_inds]
tau_test <- tau_x[test_inds]
tau_train <- tau_x[train_inds]
bcf_model <- bcf(X_train = X_train, Z_train = Z_train, y_train = y_train, 
                 propensity_train = pi_train, X_test = X_test, Z_test = Z_test, 
                 propensity_test = pi_test, num_gfr = 10, 
                 num_burnin = 0, num_mcmc = 10)

stochtree documentation built on April 4, 2025, 2:11 a.m.

stochtree index

Package overview README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

stochtree
Stochastic Tree Ensembles (XBART and BART) for Supervised Learning and Causal Inference

bcf: Run the Bayesian Causal Forest (BCF) algorithm for...
In stochtree: Stochastic Tree Ensembles (XBART and BART) for Supervised Learning and Causal Inference

Run the Bayesian Causal Forest (BCF) algorithm for regularized causal effect estimation.

Description

Usage

Arguments

Value

Examples

Related to bcf in stochtree...

R Package Documentation

Browse R Packages

We want your feedback!

stochtree Stochastic Tree Ensembles (XBART and BART) for Supervised Learning and Causal Inference

bcf: Run the Bayesian Causal Forest (BCF) algorithm for... In stochtree: Stochastic Tree Ensembles (XBART and BART) for Supervised Learning and Causal Inference

Run the Bayesian Causal Forest (BCF) algorithm for regularized causal effect estimation.

Description

Usage

Arguments

Value

Examples

Related to bcf in stochtree...

R Package Documentation

Browse R Packages

We want your feedback!

stochtree
Stochastic Tree Ensembles (XBART and BART) for Supervised Learning and Causal Inference

bcf: Run the Bayesian Causal Forest (BCF) algorithm for...
In stochtree: Stochastic Tree Ensembles (XBART and BART) for Supervised Learning and Causal Inference