knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The MRTAnalysis
package now supports analysis of distal causal excursion effect of a continuous distal outcomes in micro-randomized trials (MRTs), using the function dcee()
.
Distal outcomes are measured once at the end of the study (e.g., weight loss, cognitive score), in contrast to proximal outcomes which are repeatedly measured after each treatment decision point.
This vignette introduces:
dcee()
function to estimate DCEE for MRT with a continuous distal outcome.In a distal-outcome MRT:
Thus, each row in the long-format data corresponds to $(X_{it}, A_{it}, I_{it}, p_{it})$, with $Y_i$ constant within each participant.
The distal causal excursion effects are defined using potential outcomes in @qian2025distal. Roughly speaking, the DCEE at decision point $t$ is the difference in the outcome $Y_i$ due to assigning treatment $A_{it}=1$ versus $A_{it}=0$ at time $t$, while keeping the past and future treatment assignments according to the randomization probabilities in the MRT (i.e., the MRT policy), and averaging over the covariate history and availability at $t$.
This package provides data_distal_continuous
, a synthetic dataset with:
userid
: participant id. dp
: decision point index. X
: continuous endogenous covariate. Z
: binary endogenous covariate. avail
: availability indicator. A
: treatment indicator. prob_A
: randomization probability. A_lag1
: lag-1 treatment.Y
: continuous distal outcome, identical across rows for same userid
. library(MRTAnalysis) current_options <- options(digits = 3) # save current options for restoring later head(data_distal_continuous, 10)
dcee()
In the following function call of dcee()
, we specify the distal outcome variable by outcome = "Y"
. We specify the treatment variable by treatment = "A"
. We specify the time-varying randomization probability by rand_prob = "prob_A"
. We specify the fully marginal effect as the quantity to be estimated by setting moderator_formula = ~1
. We use X
and Z
as two variables by setting control_formula = ~logstep_pre30min
. We specify the availability variable by availability = avail
. We use linear regression for the control regression model (i.e., the Stage-1 nuisance models in the two-stage estimation procedure in @qian2025distal) by setting control_reg_method = "lm"
.
Note that the estimator for the distal causal excursion effect is consistent even if the control regression model is mis-specified, as long as the treatment randomization probabilities are correctly specified (which will be the case for MRTs). Different control regression methods can be used to improve efficiency.
fit_lm <- dcee( data = data_distal_continuous, id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A", moderator_formula = ~1, control_formula = ~ X + Z, availability = "avail", control_reg_method = "lm" ) summary(fit_lm)
The summary()
function provides the estimated distal causal excursion effect as well as the 95% confidence interval, standard error, and p-value. The only row in the output Distal causal excursion effect (beta)
is named Intercept
, indicating that this is the fully marginal effect (like an intercept in the causal effect model). In particular, the estimated marginal distal excursion effect is 0.404, with 95% confidence interval (-0.771, 1.579), and p-value 0.49. The confidence interval and the p-value are based on t-quantiles.
The following code uses dcee()
to estimate the distal causal excursion effect moderated by the time-varying covariate Z
. This is achieved by setting moderator_formula = ~ Z
.
fit_mod <- dcee( data = data_distal_continuous, id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A", moderator_formula = ~Z, control_formula = ~ Z + X, availability = "avail", control_reg_method = "lm" ) summary(fit_mod, lincomb = c(1, 1)) # beta0 + beta1
In the above, we asked summary()
to calculate and print the estimated coefficients for $\beta_0 + \beta_1$, the distal causal excursion effect when the binary variable $Z$ takes value 1, by using the lincomb
optional argument. This is illustrated by the following code. We set lincomb = c(1, 1)
, i.e., asks summary()
to print out $[1, 1] \times (\beta_0, \beta_1)^T = \beta_0 + \beta_1$. The table under Linear combinations (L * beta)
is the fitted result for this $\beta_0 + \beta_1$ coefficient combination.
One can use generalized additive models (GAM) for the control regression models by setting control_reg_method = "gam"
. This may improve efficiency if the relationship between the distal outcome and the covariates is non-linear. One can use s()
to specify non-linear terms in the control_formula
. For example, here we use a smooth term for the continuous covariate X
, by setting control_formula = ~ s(X) + Z
.
fit_gam <- dcee( data = data_distal_continuous, id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A", moderator_formula = ~Z, control_formula = ~ s(X) + Z, availability = "avail", control_reg_method = "gam" ) summary(fit_gam)
One can also use tree-based methods for the control regression models by setting control_reg_method = "rf"
(random forest via randomForest
package) or control_reg_method = "ranger"
(faster random forest via ranger
package). This may improve efficiency if the relationship between the distal outcome and the covariates is complex. Note that tree-based methods do not allow specification of smooth terms like s(X)
. The control_formula
has to be specified using main terms only. Additional optional arguments can be passed to the underlying random forest function via ...
argument of dcee()
, which is not shown in this example.
fit_rf <- dcee( data = data_distal_continuous, id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A", moderator_formula = ~1, control_formula = ~ X + Z, availability = "avail", control_reg_method = "rf" # can replace "rf" with "ranger" for faster implementation ) summary(fit_rf)
The dcee()
function also supports cross-fitting, which may lead to improved finite sample performance when using complex machine learning methods for the control regression models. This is done by setting cross_fit = TRUE
and specifying the number of folds via cf_fold
. Here we use 5-fold cross-fitting with generalized additive models for the control regression models as an example. The particular cross-fitting algorithm follows Section 4 in the Web Appendix of @zhong2021aipw.
fit_cf <- dcee( data = data_distal_continuous, id = "userid", outcome = "Y", treatment = "A", rand_prob = "prob_A", moderator_formula = ~1, control_formula = ~ X + Z, availability = "avail", control_reg_method = "gam", cross_fit = TRUE, cf_fold = 5 ) summary(fit_cf)
We can set show_control_fit = TRUE
in the summary()
function to inspect the control regression (i.e., Stage-1 nuisance) model fits. This is useful for diagnosing the fit of the control regression models. For lm
/gam
these include regression summaries. For tree-based or SuperLearner fits, original learner output is shown. To further inspect the control regression model fits, one can manually inspect $fit$regfit_a0
and $fit$regfit_a1
.
summary(fit_lm, show_control_fit = TRUE)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.