STAP Introduction
In rstap: Spatial Temporal Aggregated Predictor Models via 'stan'

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = F
)

Motivation

The purpose of this document is to show how to use the rstap package including the modeling assumptions and the kind of data it uses. We'll begin by loading in the appropriate libraries which include dplyr and tidyr for data manipulation, ggplot2 for plotting and rstap for model fitting.

library(dplyr)
library(ggplot2)
library(tidyr)
library(rstap)

A typical data structure will involve subjects and features at different locations and/or times. For purposes of this example, we simulate 550 subjects and 55 features uniformly in a 1x2 and 3x3 grid, respectively.

set.seed(1214)
num_subj <- 9.5E2
num_bef <- 55
sub_df <- data_frame(x = runif(n = num_subj, min = 1, max = 2.0),
                     y = runif(n = num_subj, min = 1, max = 2.0),
                     class = "Subject")
bef_df <- data_frame(x = runif(n = num_bef, min = 0, max = 3.0),
                     y = runif(n = num_bef, min = 0, max = 3.0),
                     class = "BEF")
rbind(sub_df,bef_df) %>% ggplot(aes(x = x, y = y, color = class )) +
    geom_point() + theme_bw() + ggtitle("Independence Configuration")

rstap models data assuming the relationship between the mean and the covariates is the following:

$$ E[Y_i] = \alpha + Z_i \delta + X_i(\theta)\beta\ X_i(\theta) = \sum_{d \in \mathcal{D}_i} \mathcal{K}_s(\frac{d}{\theta}) $$ Where $\mathcal{K}_s$ is a real-valued function such that $\mathcal{K}_s:[0.\infty] \to [0,1]$. Furthermore, we have $\mathcal{K}_s$ either monotonically decreasing or increasing depending on whether one is modeling a spatial or temporal relationship between the features and subjects, respectively. The default weight function for a spatial decay function in the rstap package is the complementary error function, though others are available. For examples of others in use in the rstap package see the Mis-Specified Weight Function vignette.

For this example we use the following fixed parameters to simulate our dataset:

alpha <- 22.5
Z <- rbinom(n = num_subj, prob = .45, size = 1)
delta <- -.8
theta <- .5
theta_2 <- 3
beta <- 1.2
sigma <- 2.3

supposing we use all the features simulated in the space, then this results in the following dataset, with the following marginal distribution of the outcome, $y$.

dists <- fields::rdist(as.matrix(sub_df[,1:2]),
                       as.matrix(bef_df[,1:2]))
X <- apply(dists,1,function(x) sum(pracma::erfc(x/theta)))
y <- alpha + Z*delta + X*beta + rnorm(n = num_subj, mean = 0, sd = sigma)
data_frame(BMI = y) %>% ggplot(aes(x=BMI)) + geom_density() + theme_bw() +
    ggtitle("Marginal Distribution of Outcome")

The exposure decay function and histogram of the standardized exposure is as follows

d <- seq(from = 0, to = max(dists), by = 0.01)
X_theta_one <- pracma::erfc(d/theta)
par(mfrow = c(1,2))
plot(d,X_theta_one,type='l',main = "Exposure Decay Function", xlab = "Distance",
     ylab = "Exposure")
hist(X, main = "Exposure Dist.")

We then set-up the two dataframes needed for stap to model these data. The first is a fairly typical data frame with the outcome, covariates and subject ID. This is the same kind of dataset that could be used with a function like lm()

subject_data <- data_frame(subj_id = 1:num_subj,
                           y = y,
                           sex = factor(Z,labels=c("M","F")))
subject_data %>% head() %>% knitr::kable()

The distance dataframe contains the corresponding subject id to pair with each built environment feature in the chosen space along with their associated distance. Note that the Built Environment Features are labeled as "Fast_Food" for this example - this will be important syntactically for the stap_glm() function.

distance_data <- dists %>% as_data_frame() %>%
    mutate(subj_id = 1:num_subj) %>% 
    gather(contains("V"),key = 'BEF',value = 'Distance') %>% 
    mutate(BEF = 'Fast_Food')

distance_data %>% head() %>% knitr::kable()

These data are then modeled with rstap in the following manner - note the placement of "Fast_Food" and sap(), designating the rows labeled as Fast_Food in the distance_data as the appropriate ones to model as Spatial Aggregated Predictor with these data. Note that we sample the posterior with 4 chains for 2000 iterations here. This is very conservative and fewer samples/chains could be used when first fitting a model to make sure the sampler is functioning appropriately.

fit <- stap_glm(formula = y ~ sex + sap(Fast_Food),
                subject_data = subject_data,
                distance_data = distance_data,
                family = gaussian(link = 'identity'),
                subject_ID = 'subj_id',
                prior = normal(location = 0, scale = 5,autoscale = F),
                prior_intercept = normal(location = 25, scale = 5, autoscale = F),
                prior_stap = normal(location = 0, scale = 3, autoscale = F),
                prior_theta = log_normal(location = 1, scale = 1), 
                prior_aux = cauchy(location = 0,scale = 5),
                max_distance = max(dists), 
                chains = 4, iter = 2E3, cores = 4) ## include all data

We'll first look at the quick summary contained in the model's print-out

fit

Further model details, including diagnostics concerning convergence properties can be found by using the summary function:

summary(fit, waic=T)

Checking our estimates, the model captures the true values very well.

fd_df <- cbind(posterior_interval(fit)[1:4,1],coef(fit),posterior_interval(fit)[1:4,2],
      c(alpha,delta,beta,theta)) %>% 
    as_data_frame() %>% transmute(lower= V1, 
                                  mid = V2, 
                                  upper = V3,
                                  Truth = V4,
                                  model = "Full Data") %>%
    mutate(parameter = c("alpha","sex","Fast_Food","Fast_Food_spatial_scale"),
           is_alpha = (parameter=="alpha") )
fd_df %>% 
    ggplot(aes(x=parameter,y=mid)) + geom_point() + 
    geom_linerange(aes(ymin=lower,ymax=upper)) + 
    geom_hline(aes(yintercept = Truth),linetype = 2) + coord_flip() + theme_bw() + 
    facet_wrap(~is_alpha,scales = "free") + 
    theme( strip.background = element_blank(),
           strip.text.x = element_blank()) + 
    xlab("") + ylab("Estimate")

One typical way to check for model goodness-of-fit is via posterior predictive checks. These are available here via the posterior_predict function and the ppc_dens_overlay function from the bayesplot package.

pps <- posterior_predict(fit,draws = 50,seed = 34234)
bayesplot::ppc_dens_overlay(y = subject_data$y,
                            yrep = pps)

We'll take this as sufficient evidence (for this vignette) that the model works. Further model evaluation can be found in a different vignette, to be added later. Let's now examine the case when we mis-specify the inclusion distance or superflous data is included.

Mis-Specified Inclusion Distance

While there are tools, such as the dlm package, that can provide an investigator with a sense of what appropriate inclusiong distance to set, it is still worth considering what may occur if the inclusion distance is mis-specified. The following section considers this in two cases of data exclusion: non-informative, and informative.

Non-Informative data excluded

Note that the histogram of distances include many over 1 unit - which from the above graph we know only add "negligible" information.

hist(dists)

Keeping this in mind - let's see what happens to our estimates when we set the exclusion distance to be 1.25 miles.

fit_125 <- stap_glm(y ~ sex + sap(Fast_Food), subject_data = subject_data,
                distance_data = distance_data,
                family = gaussian(link = 'identity'),
                subject_ID = 'subj_id',
                prior = normal(location = 0,scale = 5,autoscale = F),
                prior_intercept = normal(location = 25, scale = 5, autoscale = F),
                prior_stap = normal(location = 0, scale = 3, autoscale = F),
                prior_theta = log_normal(location = 1, scale = 1), 
                prior_aux = cauchy(location = 0,scale = 5),
                max_distance = 1.25, chains = 4, iter = 2E3, cores = 4)

We excluded about half of the total data (not exactly even across individuals)

sum(dists<=1.25)/sum(dists>=0)

and yet the estimates are still very good

d125 <- cbind(posterior_interval(fit_125)[1:4,1],coef(fit_125),posterior_interval(fit_125)[1:4,2],
      c(alpha,delta,beta,theta)) %>% 
    as_data_frame() %>% transmute(lower= V1, 
                                  mid = V2, 
                                  upper = V3,
                                  Truth = V4,
                                  model = "Non-Informative Exclusion") %>%
    mutate(parameter = c("alpha","sex","Fast_Food","Fast_Food_spatial_scale"),
           is_alpha = (parameter=="alpha") )

d125 %>% rbind(.,fd_df) %>% 
    ggplot() +
    geom_pointrange(aes(x = parameter, ymin=lower,y = mid, ymax=upper, color = model),
                    position = position_dodge(width = .2)) +
    geom_hline(aes(yintercept = Truth),linetype = 2) + coord_flip() + theme_bw() + 
    facet_wrap(~is_alpha,scales = "free") + 
    theme( strip.background = element_blank(),
           strip.text.x = element_blank()) + 
    xlab("") + ylab("Estimate")

Informative data excluded

Now let's see what happens when we mis-specify the maximum distance and this results in a loss of "meaningful" Built Environment Features.

fit_25 <- stap_glm(y ~ sex + sap(Fast_Food), subject_data = subject_data,
                distance_data = distance_data,
                family = gaussian(link = 'identity'),
                subject_ID = 'subj_id',
                prior = normal(location = 0,scale = 5,autoscale = F),
                prior_intercept = normal(location = 25, scale = 5, autoscale = F),
                prior_stap = normal(location = 0, scale = 3, autoscale = F),
                prior_theta = log_normal(location = 1, scale = 1), 
                prior_aux = cauchy(location = 0,scale = 5),
                max_distance = .25, chains = 1, iter = 6E2)

We can now see that our spatial scale and the corresponding effect are biased

d25 <- cbind(posterior_interval(fit_25)[1:4,1],coef(fit_25),posterior_interval(fit_25)[1:4,2],
      c(alpha,delta,beta,theta)) %>% 
    as_data_frame() %>% transmute(lower= V1, 
                                  mid = V2, 
                                  upper = V3,
                                  Truth = V4,
                                  model = "Informative_Exlusion") %>%
    mutate(parameter = c("alpha","sex","Fast_Food","Fast_Food_spatial_scale"),
           is_alpha = (parameter=="alpha") )

d25 %>% rbind(.,fd_df) %>% 
    ggplot() +
    geom_pointrange(aes(x = parameter, ymin=lower,y = mid, ymax=upper, color = model),
                    position = position_dodge(width = .2)) +
    geom_hline(aes(yintercept = Truth),linetype = 2) + coord_flip() + theme_bw() + 
    facet_wrap(~is_alpha,scales = "free") + 
    theme( strip.background = element_blank(),
           strip.text.x = element_blank()) + 
    xlab("") + ylab("Estimate")

The shift in estimate, in addition to increased noise, is to be expected since we're increasingly missing informative data for our spatial parameter.

Extreme Missing data

To show the consequences of using an extreme inclusion distance - extreme with respect to the true terminal inclusion distance - we simulate the data under an alternate scale. Note that the following exposure curve does not even terminate at the maximum distance between a "Fast Food" store and a subject.

d <- seq(from = 0, to = max(dists), by = 0.01)
X_theta_one <- pracma::erfc(d/theta_2)
plot(d,X_theta_one,type='l',main = "Exposure Decay Function", xlab = "Distance",
     ylab = "Exposure",ylim = c(0,1))

Fitting the model on a reduced domain of distances from the terminating distance,results in an highly inflated estimate of the spatial scale, as the posterior will attempt to approximate the second graph shown below - an exposure relationship that is approaching uniform on the domain of distances "observed".

plot(d,X_theta_one,type='l',main = "Exposure Decay Function - reduced domain", xlab = "Distance",
     ylab = "Exposure",ylim = c(0,1), xlim = c(0,.5))

Simulating a dataset under this model and fitting an rstap model to it with the restricted domain...

dists <- fields::rdist(as.matrix(sub_df[,1:2]),
                       as.matrix(bef_df[,1:2]))
X <- apply(dists,1,function(x) sum(pracma::erfc(x/theta_2)))
## center and scale the exposure effect for ease of exposition/numerical stability
X_tilde <- (X-mean(X))/sd(X) 
y <- alpha + Z*delta + X_tilde*beta + rnorm(n = num_subj, mean = 0, sd = sigma)

subject_data <- data_frame(subj_id = 1:num_subj,
                           y = y,
                           sex = factor(Z,labels=c("M","F")))

fit_unif <- stap_glm(y ~ sex + sap(Fast_Food), subject_data = subject_data,
                distance_data = distance_data,
                family = gaussian(link = 'identity'),
                subject_ID = 'subj_id',
                prior = normal(location = 0,scale = 5,autoscale = F),
                prior_intercept = normal(location = 25, scale = 5, autoscale = F),
                prior_stap = normal(location = 0, scale = 3, autoscale = F),
                prior_theta = log_normal(location = 1, scale = 1), 
                prior_aux = cauchy(location = 0,scale = 5),
                max_distance = .25, chains = 4, iter = 2E3,cores = 4)

We can see a much wider estimate of the scale parameter, as well as the corresponding effect.

as.matrix(fit_unif) %>% as_data_frame() %>% 
    gather(everything(),key= "parameter",value = "sample") %>% 
    group_by(parameter) %>% 
    summarise(lower = quantile(sample,0.025),
              med  = median(sample),
              upper = quantile(sample, 0.975)) %>% 
    mutate(Truth = c(alpha,beta,theta_2,delta,sigma)) %>% 
    filter(parameter!="(Intercept)") %>% 
    ggplot() + geom_pointrange(aes(x=parameter,y=med,ymin=lower,ymax=upper)) +
    geom_point(aes(x=parameter,y=Truth),color='red') +
    coord_flip() + theme_bw()

Here is a plot of the estimate on the full domain of the data - we can see that the exposure function approximation becomes less accurate on the set of distances not included in the model.

scale_df <-  as.matrix(fit_unif) %>% as_data_frame() %>% 
    gather(everything(), key = "parameter", value = "sample") %>% 
    group_by(parameter) %>% 
    summarise(lower = quantile(sample,0.025),
              med = median(sample),
              upper = quantile(sample, 0.975)) %>% 
    filter(parameter=="Fast_Food_spatial_scale")

data_frame(lower = pracma::erfc(d/scale_df$lower),
           median = pracma::erfc(d/scale_df$med),
           upper = pracma::erfc(d/scale_df$upper),
           truth = X_theta_one,
           distance = d) %>% 
    ggplot(aes(x=distance,y=median)) + geom_line(linetype = 2) +
    geom_ribbon(aes(ymin=lower,ymax=upper),alpha=.3) +
    geom_line(aes(x=distance,y=truth),color='red') + 
    theme_bw() + ggtitle("Extreme Missing Information")

As can be seen below, the approximation is quite good on the reduced domain, but the reduced number of distances results in a wider uncertainty interval, and the constrained domain results in an inflated estimate.

data_frame(lower = pracma::erfc(d/scale_df$lower),
           median = pracma::erfc(d/scale_df$med),
           upper = pracma::erfc(d/scale_df$upper),
           truth = X_theta_one,
           distance = d) %>% 
    ggplot(aes(x=distance,y=median)) + geom_line(linetype = 2) +
    geom_ribbon(aes(ymin=lower,ymax=upper),alpha=.3) +
    geom_line(aes(x=distance,y=truth),color='red') + xlim(0,.5) + 
    theme_bw() + ggtitle("Extreme Missing Information")

Inclusion of superflous data

If additional fast food restaurants were included, beyond that which were truly "needed"", what would happen to our model estimates? Is this equivalent to the non-informative data exclusion?

extra_befs <- data_frame(x = c(runif(min = -1, max = 0, n = 20),
                               runif(min = 0, max = 4, n = 20),
                               runif(min = 3, max = 4, n = 20),
                               runif(min = -1, max = 4, n = 20)),
                         y = c(runif(min = -1, max = 4, n = 20),
                               runif(min = -1, max = 0, n =20 ),
                               runif(min = -1, max = 4, n = 20),
                               runif(min = 3, max = 4, n = 20)),
                         class = 'extra_BEFs')
rbind(sub_df,bef_df,extra_befs) %>% ggplot(aes(x=x,y=y,colour = class)) + 
    geom_point() + theme_bw() + ggtitle("Independence Configuration - Extra BEFs")

dists <- fields::rdist(as.matrix(sub_df[,1:2]),
                       as.matrix(rbind(bef_df[,1:2],extra_befs[,1:2])))

distance_data <- dists %>% as_data_frame() %>%
    mutate(subj_id = 1:num_subj) %>% 
    tidyr::gather(contains("V"),key = 'BEF',value = 'Distance') %>% 
    mutate(BEF = 'Fast_Food')


fit <- stap_glm(formula = y ~ sex + sap(Fast_Food),
                subject_data = subject_data,
                distance_data = distance_data,
                family = gaussian(link = 'identity'),
                subject_ID = 'subj_id',
                prior = normal(location = 0, scale = 5,autoscale = F),
                prior_intercept = normal(location = 25, scale = 5, autoscale = F),
                prior_stap = normal(location = 0, scale = 3, autoscale = F),
                prior_theta = log_normal(location = 0, scale = 1), 
                prior_aux = cauchy(location = 0,scale = 5),
                max_distance = max(dists), chains = 4, cores = 4, iter = 2E3)  ## including unneccessary data

edf <- cbind(posterior_interval(fit)[1:4,1],coef(fit),posterior_interval(fit)[1:4,2],
      c(alpha,delta,beta,theta)) %>% 
    as_data_frame() %>% transmute(lower= V1, 
                                  mid = V2, 
                                  upper = V3,
                                  Truth = V4,
                                  model = "Extra data") %>%
    mutate(parameter = c("alpha","sex","Fast_Food","Fast_Food_spatial_scale"),
           is_alpha = (parameter=="alpha") )

edf %>% rbind(.,fd_df) %>% 
    ggplot() +
    geom_pointrange(aes(x = parameter, ymin=lower,y = mid, ymax=upper, color = model),
                    position = position_dodge(width = .2)) +
    geom_hline(aes(yintercept = Truth),linetype = 2) + coord_flip() + theme_bw() + 
    facet_wrap(~is_alpha,scales = "free") + 
    theme( strip.background = element_blank(),
           strip.text.x = element_blank()) + 
    xlab("") + ylab("Estimate")

Our model is still able to recover the estimates correctly - about as well as it did in the typical model setting! This is because the additional parameters only contribute an increasingly small amount of information as a function of distance, reasonable a priori choices of the $\theta$ spatial scale.

Caveats

It is important to note the limitations of these simulations. To begin with, the exposure covariate $\tilde{X}$ is an unknown function of the distances between the subjects and the businesses and the spatial scale. Thus these simulations do not represent the probably more realistic scenarios in which:

The businesses and subjects' locations are correlated within and between features, resulting in a more highly skewed $\tilde{X}$ distribution and possible collinearity issues or
The decay function is mis-specified or
The prior does not place high probability on the true spatial scale in these settings.

These areas for potential problems will be the subject of future vignettes.