knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(stdReg2)
Suppose $X$ denotes an exposure of interest that takes values 0 or 1. This could represent two different medical treatments, environmental exposures, economic policies, or genetic variants. We will most often use biomedical examples because we are biostatisticians.
We consider the setting where it is of interest to quantify the effect of intervening with $X$ on outcome that we will denote $Y$. The outcome could represent some numeric value, it could be the presence or absence of a condition, or it could be the time between two events, such as time from cancer diagnosis to death. Let $Y(X = x)$ denote the potential outcome if all subjects in the population would hypothetically be exposed to $X = x$.
To quantify the effect of $X$, we must summarize the distribution of $Y(X = x)$ with some statistic. If $Y$ is dichotomous, it is natural to use $p{Y(X = x) = 1}$, called the risk. If $Y$ is continuous, the mean is a natural summary statistic $E{Y(X = x)}$. If $Y$ is a continuous time-to-event, the probability of exceeding a particular value $t$ is a reasonable statistic: $p{Y(X = x) > t}$. In general, denote the summary statistic of choice as $T{Y(X = x)}$. The summary statistic can also be applied to conditional distributions, which we will denote, e.g., $T{Y | X = x}$.
To quantify the effect of $X$, we must also decide on a contrast to measure the causal effect. The point of the contrast is to compare the chosen summary statistic between the $X = 1$ and $X = 0$ interventions. Typical choices would be the difference $T{Y(X = 1)} - T{Y(X = 0)}$ or the ratio $T{Y(X = 1)} / T{Y(X = 0)}$. It may also be of interest to quantify and report the summary statistics within each group $(T{Y(X = 1)}, T{Y(X = 0)})$.
In observational studies, the relationship between $X$ and $Y$ is likely confounded by a set of other variables $\boldsymbol{Z}$. This means that the values of the outcome $Y$ are determined by at least a subset of $\boldsymbol{Z}$ and the values of the exposure $X$ are determined by a subset of $\boldsymbol{Z}$. Naively estimating the contrast would lead to biased estimates of the causal effect.
See @sjolander2016regression and @sjolander2018estimation for more details. Suppose that the covariates $\boldsymbol{Z}$ are sufficient for confounding control. For more information on what constitutes a sufficient adjustment set, see @witte2019covariate. For a given summary statistic $T$, then [ T{Y(X = x)} = E_{\boldsymbol{Z}}[T{Y | X = x, \boldsymbol{Z}}], ] where the expectation is taken with respect to the population distribution of $\boldsymbol{Z}$. This is also known as the g-formula or adjustment formula.
In order to estimate this quantity based on an independent and identically distributed sample $(X_1, Y_1, \boldsymbol{Z}_1), \ldots, (X_n, Y_n, \boldsymbol{Z}_n)$, we proceed by
One can do this for each level of $X = 0, 1$ and compute the desired contrast. Under the assumptions that 1) $\boldsymbol{Z}$ is sufficient for confounding control, and 2) the regression model in step 1 is correctly specified, then this estimator is consistent and asymptotically normal.
A doubly-robust estimator is an estimator that is consistent for a given estimand when one or more of the models used in the forming of the estimator is correctly specified for confounding. So far, we only have one model used in our estimator: the outcome model. We will now introduce a model for the exposure, and see how we can combine them. First, some terminology:
Misspecified model - The true generating mechanism is not contained in the possible mechanisms that are possible under the selected model.
Correctly Specified - A model is correctly specified if it is not misspecified.
Correctly specified for confounding - A correctly specified model that contains a sufficient set of confounders.
If we can model $P(X=1|\boldsymbol{Z})$, and it is correctly specified and contains all confounders, then we can use that to estimate the probability that each individual $i$ received the treatment that they did $W_i = \frac{X_i}{P(X_i=1|\boldsymbol{Z}_i)} + \frac{1-X_i}{1-P(X_i=1|\boldsymbol{Z}_i)}$. Let $\hat{p}_i$ denote the estimated probability that subject $i$ received treatment $1$.
Any consistent estimation method can be used for the outcome and exposure models. As long as either the outcome model or the propensity score model is correctly specified for confounding, then a doubly robust estimator is consistent for the ATE. In this package, for generalized linear models, we use the estimator as described by @gabriel2023inverse.
Recently, there seems to be a general misconception that combining an adjusted outcome model and a propensity score model always gives a doubly robust estimator. This is not true -- it matters how you combine them!
Here we will use regression standardization to estimate the average causal effect of the exposure (quitting smoking) in the variable qsmk
on the weight gain outcome in the variable wt82_71
in the nhefs_complete
dataset that comes with the causaldata
package. The data was collected as part of a project by the National Center for Health Statistics and the National Institute on Aging in collaboration with other agencies of the United States Public Health Service. It was designed to investigate the relationships between clinical, nutritional, and behavioral factors and subsequent morbidity, mortality, and hospital utilization and changes in risk factors, functional limitation, and institutionalization. The dataset includes 1566 individuals and contains among others the following variables:
nhefs_dat <- causaldata::nhefs_complete summary(nhefs_dat)
We will assume that the set of confounders in $\boldsymbol{Z}$ includes sex, race, age, education, number of cigarettes smoked per year, the number of years smoked, level of physical activity, and baseline weight in 1971. This equivalent to assuming that the counterfactual weight gain is independent of the exposure conditional on these variables. In other words, we are assuming the following directed acyclic graph:
library(survival) states <- c("X", "Z", "Y") connect <- matrix(0, 3, 3, dimnames = list(states, states)) connect[cbind(c(1, 2, 2), c(3, 1, 3))] <- 1 statefig(cbind(c(.2, .5, .8), c(.3, .6, .3)), connect)
For the specific forms of the conditional expectations required in the outcome we assume a linear regression model with both linear and quadratic forms of the continuous covariates. We can fit this as
m <- glm(wt82_71 ~ qsmk + sex + race + age + I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), data = nhefs_dat) summary(m)
To perform regression standardization to estimate the causal effect we use standardize_glm
. We must specify the same outcome regression model as a formula, provide the data, describe which values of the exposure we wish to estimate the counterfactual means, specify which contrasts we want, and specify the reference level for the contrasts. The following command estimates that model and we obtain the group-wise estimates, the difference, and the ratio.
m2 <- standardize_glm(wt82_71 ~ qsmk + sex + race + age + I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), data = nhefs_dat, values = list(qsmk = c(0,1)), contrasts = c("difference", "ratio"), reference = 0) m2 plot(m2) plot(m2, contrast = "difference", reference = 0)
The output from the model shows the estimated potential outcome means in each exposure level, the difference and ratio thereof, and the associated standard error estimates, confidence intervals, and p-values. Inference is done using the sandwich method for variance calculation. Under the assumptions that the outcome model is correctly specified and contains all confounders, these are consistent estimates of the causal effects of interest. We can also get plots of the effects by using the plot function.
To obtain the estimates and inference for the different contrasts as a tidy data frame, suitable for saving as a data table or using in downstream analyses or reports, we provide the tidy
function:
(tidy(m2) -> tidy_m2) |> print()
To obtain doubly robust inference we use the following command. Note that we now specify a model for the exposure, the propensity score model.
m2_dr <- standardize_glm_dr(formula_outcome = wt82_71 ~ qsmk + sex + race + age + I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), formula_exposure = qsmk ~ sex + race + age + I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), data = nhefs_dat, values = list(qsmk = c(0,1)), contrast = c("difference", "ratio"), reference = 0) m2_dr (tidy(m2_dr) -> tidy_m2_dr) |> print()
Based on these results, we can report that the estimated effect of smoking on average weight change as measured by the difference in potential outcome means is r with(subset(tidy_m2, contrast == "difference" & qsmk == 1), sprintf("%.2f (%.2f to %.2f)", Estimate, lower.0.95, upper.0.95))
and as measured by the ratio of potential outcome means is r with(subset(tidy_m2, contrast == "ratio" & qsmk == 1), sprintf("%.2f (%.2f to %.2f)", Estimate, lower.0.95, upper.0.95))
. Using the doubly-robust estimation method, we obtain similarly r with(subset(tidy_m2_dr, contrast == "difference" & qsmk == 1), sprintf("%.2f (%.2f to %.2f)", Estimate, lower.0.95, upper.0.95))
and as measured by the ratio of potential outcome means is r with(subset(tidy_m2_dr, contrast == "ratio" & qsmk == 1), sprintf("%.2f (%.2f to %.2f)", Estimate, lower.0.95, upper.0.95))
. In particular, if we believe that the necessary assumptions are valid, we can conclude that smoking causes an increase in weight which the numeric summaries indicate that smoking increases the weight change over 11 years by about 3.5 pounds (additive scale) or smoking increases the weight change over 11 years by a multiplicative factor of about 3.
The doubly-robust method requires specification of an exposure model in addition to the specification of the outcome model. Both of these estimated models can be accessed and inspected by looking at the fit_outcome
and fit_exposure
elements of the return object:
m2_dr$res$fit_outcome m2_dr$res$fit_exposure
This is useful when we want to inspect the distribution of the propensity scores by exposure status. This is an important thing to check in practice, we are looking for any practical violations of the positivity assumption.
hist(m2_dr$res$fit_exposure$fitted[nhefs_dat$qsmk == 0], xlim = c(0, 1), main = "qsmk = 0", xlab = "estimated propensity") hist(m2_dr$res$fit_exposure$fitted[nhefs_dat$qsmk == 1], xlim = c(0, 1), main = "qsmk = 1", xlab = "estimated propensity")
The function standardize_glm
and its doubly-robust counterpart standardize_glm_dr
support any outcome variable type the can be used with glm
. This includes binary, count, and more. Just like in glm
, we specify the model we would like to use for the outcome by specifying the family
argument of standardize_glm
or the family_outcome
argument of standardize_glm_dr
. By default this is "gaussian"
which corresponds to linear regression. For a binary outcome we might like to use "binomial"
which corresponds to logistic regression. Let us see how it works using the a binary outcome that we create by dichotomizing the weight change at 0.
nhefs_dat$gained_weight <- 1.0 * (nhefs_dat$wt82_71 > 0) m3 <- standardize_glm(gained_weight ~ qsmk + sex + race + age + I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), data = nhefs_dat, family = "binomial", values = list(qsmk = c(0,1)), contrasts = c("difference", "ratio"), reference = 0) m3
Here we interpret the estimates in terms of probabilities, or the risk of gaining weight over 11 years. In particular, smoking causes a 13% additive increase in the risk of gaining weight, or a 1.2 fold multiplicative increase in the risk of gaining weight. With binary outcomes, we can use transformations to get different contrasts, and also to improve the performance of the ratio contrast. By using the log transformation combined with the difference contrast, our estimates become the log risk ratios. We can exponential them to get the risk ratios, but often it is better to construct confidence intervals of ratios on the log scale and back transform.
m3_log <- standardize_glm(gained_weight ~ qsmk + sex + race + age + I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), data = nhefs_dat, family = "binomial", values = list(qsmk = c(0,1)), contrasts = c("difference"), transforms = c("log"), reference = 0) m3_log
The estimates reported in the table are log risk ratios, which we now back transform by exponentiating and compare to the ratios estimated previously.
tm3_log <- tidy(m3_log) tm3_log$rr <- exp(tm3_log$Estimate) tm3_log$rr.lower <- exp(tm3_log$lower.0.95) tm3_log$rr.upper <- exp(tm3_log$upper.0.95) subset(tm3_log, contrast == "difference" & qsmk == 1) m3
In this case, the estimates and inference using transformation are nearly identical to those without transformation.
Other available transformations include the odds, and logit (log odds). These can be used to estimate causal odds ratios, untransformed and transformed, respectively.
m3_odds <- standardize_glm(gained_weight ~ qsmk + sex + race + age + I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), data = nhefs_dat, family = "binomial", values = list(qsmk = c(0,1)), contrasts = c("ratio"), transforms = c("odds"), reference = 0) m3_odds m3_logit <- standardize_glm(gained_weight ~ qsmk + sex + race + age + I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), data = nhefs_dat, family = "binomial", values = list(qsmk = c(0,1)), contrasts = c("difference"), transforms = c("logit"), reference = 0) m3_logit m3_logOR <- tidy(m3_logit) |> subset(contrast == "difference" & qsmk == 1) sprintf("%.2f (%.2f to %.2f)", exp(m3_logOR$Estimate), exp(m3_logOR$lower.0.95), exp(m3_logOR$upper.0.95))
In this case the estimate is identical, but the confidence interval is slightly different because it is symmetric on the odds ratio scale in the first case, and symmetric on the log odds ratio scale in the transformed case.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.