knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = F )
Spatial-temporal aggregated predictor (STAP) models can be used in a longitudinal setting to model within and across subject differences by including subject specific means and subject differences from means. These difference in differences models can have a causal interpretation if certain assumptions are met. In a standard regression modeling framework, this would require a simple transformation of the time-varying confounders. For STAP models, since the variables being centered are random, they must be manipulated within each iteration of the sampler, requiring a different function and modeling syntax in rstap
(as of v.1.1.0), which will be demonstrated in this vignette.
We'll begin by loading the neccessary libraries
library(tidyverse) library(rstap)
The data generation is similar to standard longitudinal simulation so we will omit the code here. It can be found on the Github site in the vignettes folder.
Note that we'll generate the outcome under the difference in difference parameterization (see formula below) to make assessment of the parameters easy, and we'll only include a subject specific intercept and slope, which rstap
is currently limited to.
set.seed(24) num_subj <- 200 num_bef <- 15 Z <- sapply(1:num_subj, function(x) rep(rbinom(n = 1, size = 1, prob = .5),3)) Z <- tibble(time_1 = t(Z)[,1], time_2 = t(Z)[,2], time_3 = t(Z)[,3]) %>% mutate(subj_id = 1:n()) %>% gather(contains("time"), key="time",value="Sex") %>% mutate(time = as.integer(stringr::str_replace(time,"time_","") )) %>% rename(measurement=time) delta <- -.5 theta_s <- 0.25 theta_t <- 1.1 shape <- 4 alpha <- 22 beta <- 3.2 beta_bar <- 2.1 sigma <- 1 sigma_b <- 1 dists_1 <- matrix(rexp(n = num_subj*num_bef, rate = 1), nrow = num_subj, ncol = num_bef) dists_2 <- matrix(rexp(n = num_subj*num_bef, rate = 1), nrow = num_subj, ncol = num_bef) dists_3 <- matrix(rexp(n= num_subj*num_bef, rate = .5), nrow = num_subj, ncol = num_bef) dists <- rbind(dists_1, dists_2, dists_3) times <- dists*6 dists <- as_tibble(dists) times <- as_tibble(times) grd <- expand.grid(subj=1:num_subj,time = 1:3) dists <- dists %>% mutate(id= grd$subj, measurement = grd$time) %>% gather(contains("V"),key = "BEF_id", value="Distance") %>% mutate(BEF_id = stringr::str_replace(BEF_id,"V",""), BEF = "Coffee_Shop") %>% dplyr::filter(Distance<=5) times <- times %>% mutate(id=grd$subj, measurement=grd$time) %>% gather(contains("V"),key="BEF_id",value="Time") %>% mutate(BEF_id = stringr::str_replace(BEF_id,"V","")) %>% dplyr::filter(Time<=(30)) %>% mutate(BEF = "Coffee_Shop") dt_df <- dists %>% select(-BEF) %>% left_join(times %>% select(-BEF), by=c("id","measurement","BEF_id")) X <- dt_df %>% group_by(id,measurement) %>% summarise(Exposure = sum(pracma::erfc(Distance/theta_s) * pracma::erf(Time/theta_t) )) X_bar <- X %>% group_by(id) %>% summarise(MN_Exposure = mean(Exposure)) %>% mutate(subj_int = rnorm(n = num_subj, mean = 0, sd = sigma_b), subj_slope = rnorm(n = num_subj, mean = 0,sd = sigma_b)) X_diff <- X %>% left_join(X_bar,by='id') %>% mutate(X_diff = Exposure - MN_Exposure) %>% left_join(Z,by=c("id"="subj_id","measurement")) %>% arrange(id,measurement) epsilon_ij <- rnorm(n = num_subj * 3, mean = 0, sd = sigma) W <- cbind(rep(1,3*num_subj),rep(1:3,num_subj)) eta <- alpha + X_diff$Sex * delta + beta*X_diff$X_diff + beta_bar*X_diff$MN_Exposure + rowSums(W * X_diff[,c("subj_int","subj_slope")]) y <- eta + epsilon_ij subj_mat1 <- as.matrix(Matrix::fac2sparse(as.factor(X_diff$id))) subj_n <- rep(1/3,num_subj) subj_df <- X_diff %>% ungroup() %>% mutate(BMI=y)
Consider the following model examining the average of some continuous outcome as a function of coffee shop exposure and confounders, conditional on latent subject covariates, $b_i$
$$ E[Y_{ij}|b_i] = \alpha + \Delta X_{ij}(\theta_s)\beta_{i,Coffee} + \bar{X}{ij}(\theta_s)\bar{\beta}{i,Coffee} + + Z_{1}\delta_{i,sex} + b_i + b_i*w_{ij} $$ where $$ \Delta X(\theta_s) = \overbrace{\sum_{d\in\mathcal{D}} w_d(\frac{d}{\theta_s}) }^{X_{ij}(\theta_s)} - \underbrace{\sum_{j=1}^{n_i}\sum_{d\in\mathcal{D}} w_d(\frac{d}{\theta_s})}{\bar{X}{ij}(\theta_s)} $$
The $\bar{X}{ij}$ serves as an estimate of a given subject's average exposure to coffee shops across all measurements, so that $\Delta X{ij}(\theta_s)$ represents the difference in "usual" exposure implying that $\beta_{i,Coffee}$ represents the within subject difference of increased Coffee exposure on the outcome, while $\bar{\beta}$ represents the difference across or between subjects for one unit increase in $\bar{X}_{ij}(\theta_s)$, the average subject exposure.
We combine these two main effect estimates with standard, non time-varying confounders - only sex is used here - as well as random intercepts and slopes for each subject to account for correlation within subjects across measurements. Although one will also have to include other built environment features that may be often colocated with one another to truly remove all sources of confounding, requiring a lot of data, this model represents the best attempt one can make at estimating the causal effect of a built environment feature.
We'll fit the model with priors that reflect our relative uncertainty about the spatial and temporal scales and place a fairly uninformative prior on the subject-specific variance. We'll also use a Weibull survival/cdf function to model the spatial/temporal relationship. This is more flexible than we need, but should still allow us to capture the true information. For information on how priors are set on the subject specific covariance, see the rstanarm
vignette's write up here. The functional form will be the same as the model under which it was simulated since this vignette is for exposition.
fit <- stapdnd_lmer(BMI ~ Sex + stap_dnd_bar(Coffee_Shop) + (measurement|id), subject_data = subj_df, ## names of datasets distance_data = dists,## simulated above time_data = times, subject_ID = 'id', group_ID = "measurement", prior_intercept = normal(location = 25, scale = 4, autoscale = F), prior = normal(location = 0, scale = 3,autoscale=F), prior_stap = normal(location = 0, scale = 3), prior_theta = log_normal(location = .5,scale = .5), max_distance = max(dists$Distance), prior_covariance = decov(regularization = 1, concentration = 1, shape = 1, scale = 1), chains = 2, iter = 1000, cores = 2)
We can now check our model estimates and see how they compare to the true values.
fit
as_tibble(fit) %>% select(Sex,contains("Coffee_Shop"),contains("Sigma")) %>% gather(everything(),key="Parameter",value="Sample") %>% group_by(Parameter) %>% summarise_all(list(lower = ~quantile(.,0.05), median = ~median(.), upper = ~quantile(.,0.95))) %>% mutate(Truth = (Parameter=="(Intercept)")*alpha + (Parameter=="Coffee_Shop_dnd")*beta + (Parameter=="Coffee_Shop_spatial_scale")*theta_s + (Parameter=="Coffee_Shop_temporal_scale")*theta_t+ (Parameter=="Coffee_Shop_spatial_shape")*shape + (Parameter=="Sex")*delta+ (Parameter=="Coffee_Shop_bar")*beta_bar+ (Parameter=="sigma")*sigma + (Parameter=="Sigma[id:(Intercept),(Intercept)]")*sigma_b + (Parameter=="Sigma[id:measurement,measurement]")*sigma_b + (Parameter=="Sigma[id:measurement,(Intercept)]")*0, ranef = grepl("Sigma",Parameter)) %>% ggplot(aes(x=Parameter,y=median),alpha=.5) + geom_point(aes(x=Parameter,y=Truth),color='red',size=2.5)+ geom_pointrange(aes(ymin=lower,ymax=upper),size=.25) + coord_flip() + ggthemes::theme_hc() + facet_wrap(~ranef,scales="free_y") + ggtitle("Parameter Estimates") + theme(strip.background = element_blank(), strip.text = element_blank()) + xlab("Estimate")
as_tibble(fit) %>% select(contains("b[")) %>% gather(everything(),key="Parameter",value="Sample") %>% group_by(Parameter) %>% summarise_all(list(lower = ~quantile(.,0.025), med = ~ median(.), upper = ~ quantile(.,0.975))) %>% mutate(id = stringr::str_split(Parameter,"id:"), id = map_int(id,function(x) as.integer(stringr::str_replace(x[2],"]","") ) ) ) %>% arrange(id) %>% left_join(X_bar,by=c("id")) %>% rename(subject_intercept = subj_int, subject_slope = subj_slope) %>% gather(subject_intercept,subject_slope,key="Term",value="Truth") %>% ggplot(aes(x=med,y=Truth)) + geom_point() + facet_wrap(~Term) + ggthemes::theme_hc() + theme(strip.background = element_blank()) + geom_abline(aes(intercept = 0, slope = 1)) + labs(x= "Median",title = "Random Effect Estimates")
# draws <- sample(1:1000,size=20,replace=F) # as_tibble(fit) %>% select(contains("_spatial")) %>% # mutate(sim_id = 1:n()) %>% # dplyr::filter(sim_id %in% draws) %>% # mutate(Distance = list(seq(from = 0, to = 1.2, by = 0.01))) %>% # unnest() %>% # mutate(Exposure = pweibull(Distance,shape = Coffee_Shop_spatial_shape, # scale = Coffee_Shop_spatial_scale, # lower.tail = FALSE)) %>% # ggplot() + geom_line(aes(x=Distance,y=Exposure,group=sim_id),alpha=0.3) + # geom_line(aes(x=Distance,y=Truth), # data=tibble(Distance = seq(from = 0, to = 1.2, by =0.01), # Truth = pweibull(q = Distance, # shape = shape, # scale = theta_s,lower.tail = F)),color='red') + # ggthemes::theme_hc() + # labs(title = "Spatial Exposure Function - Draws and Truth", # subtitle = "Draws from Posterior in Gray, Truth in Red", # x = "Distance", y = "Exposure")
More information on the parameterization of this model from an academic and technical perspective can be found in:
while the underlying fundamentals of the difference in differences model and its causal interpretation can be found in more lay-accessible book, "Mastering 'Metrics", whose citation is below.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.