In taylordunn/gasr: Simulate and Analyze Goal Attainment Scaling Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.height = 3, fig.width = 5
)

library(gasr)
library(dplyr)
library(ggplot2)
library(purrr)
library(tidyr)
library(stringr)

With this package, we employ the latent variable approach to generating goal attainment scaling data that was introduced by [@Urach2019]. Here, we walk through some of the theory behind this model and show how it translates to the various gasr functions, but highly recommend any users to read the paper themselves to better understand the model.

Theory

First, some notation (which we alter slightly from that in @Urach2019). Let $n_i \sim D$, where $D$ denotes the distribution of the number of goals chosen by subject $i$. The observed goal score for subject $i$'s goal $j$ is denoted $x_{ij}$, and comes from the discretization of their continuous latent unobserved goal scores $y_{ij}$. For goals $j = 1, \dots n_i$:

$$ \tag{1} y_{ij} = b_0 + u_i + g_i b_{ij} + \epsilon_{ij} $$

where

$b_0$ is the mean response in the control group.
$u_i \sim N(0, \sigma_u^2)$ is a patient-specific random effect.
- Assume independent between subjects.
- This introduces correlation between goals within a subject.
$g_i$ is subject $i$'s treatment group.
- $g_i$ = 0 for control and $g_i$ = 1 for experimental treatment.
$b_{ij} \sim B$ is the effect of the experimental treatment on the attainment of goal $j$ for subject $i$.
- Assume non-negative random variables with variance $\sigma_B^2$.
- Assume independent between subjects.
- Assume independent between goals of the same subjects.
- This reflects the assumption that each subject chooses distinct goals that respond uniquely to the treatment.
$\epsilon_{ij} \sim N(0, \sigma_{\epsilon}^2)$ is a random error term.

The within-subject correlation of two goal scores $y_{ij}$ and $y_{ij'}$ ($j \neq j'$) is given by:

$$ \tag{2} \rho_g = \frac{\sigma_u^2}{\sigma_u^2 + g \sigma_B^2 + \sigma_{\epsilon}^2} $$

which implies that, if $\sigma_B > 0$, the within-subject correlation in the treatment group is smaller than that in the control group.

Proof

The correlation between two random variables $X$ and $Y$ is:

$$ \rho = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var}(Y)}} $$

So if we consider the correlation between subject $i$'s two goals $j$ and $j'$:

$$ \tag{3} \rho_g = \frac{\text{Cov}(y_{ij}, y_{ij'})}{\sqrt{\text{Var}(y_{ij}) \text{Var}(y_{ij'})}} $$

Consider the variance of the latent goal score $\text{Var}(y_{ij})$.

$$ \text{Var}(y_{ij}) = \text{Var}(b_0 + u_i + g_i b_{ij} + \epsilon_{ij}) $$ To decompose the variance, we use the following theorem from @Casella2002:

Theorem 4.5.6 (p. 171) If $X$ and $Y$ are any two random variables and $a$ and $b$ are any two constants, then $$ \text{Var}(aX + bY) = a^2 \text{Var} X + b^2 \text{Var} Y + 2ab \text{Cov}(X, Y) $$

By construction, the four terms $b_0$, $u_i$, $b_{ij}$, and $\epsilon_{ij}$ are independent, and so $\text{Cov}(X, Y) = 0$ for each pair (Thereom 4.5.5 (p. 171)):

$$ \begin{align} \text{Var}(y_{ij}) &= \text{Var}(b_0) + \text{Var}(u_i) + g_i^2\text{Var}(b_{ij}) + \text{Var}(\epsilon_{ij}) \ &= 0 + \sigma_u^2 + g_i^2 \sigma_B^2 + \sigma_\epsilon^2 \ &= \sigma_u^2 + g \sigma_B^2 + \sigma_\epsilon^2 \ \end{align} $$

And the same is true for $\text{Var}(y_{ij'})$, of course. So the denominator of equation (3) is:

$$ \tag{4} \sqrt{\text{Var}(y_{ij}) \text{Var}(y_{ij'})} = \sqrt{[\sigma_u^2 + g \sigma_B^2 + \sigma_{\epsilon}^2]^2} = \sigma_u^2 + g \sigma_B^2 + \sigma_\epsilon^2 \ $$

Now for the numerator of (3), the covariance between goals $j$ and $j'$. From @Casella2002:

Definition 4.5.1 (p.169) The covariance of $X$ and $Y$ is the number defined by $$ \text{Cov}(X, Y) = E((X - \mu_X)(Y - \mu_Y)) $$

Applying this to $y_{ij}$ and $y_{ij'}$:

$$ \tag{5} \text{Cov}(y_{ij}, y_{ij'}) = E((y_{ij} - E(y_{ij}))(y_{ij'} - E(y_{ij'})) $$

Consider just the first term of the product:

$$ \begin{align} y_{ij} - E(y_{ij}) &= (b_0 + u_i + g_i b_{ij} + \epsilon_{ij}) - E(b_0 + u_i + g_i b_{ij} + \epsilon_{ij}) \ &= (b_0 - b_0) + (u_i - E(u_i)) + g_i(b_{ij} - E(b_{ij})) + (\epsilon_{ij} - E(\epsilon_{ij})) \ \end{align} $$

And likewise for the term $y_{ij'} - E(y_{ij'})$. Then the product in equation (5) is:

$$ \begin{align} [y_{ij} - E(y_{ij})] [y_{ij'} - E(y_{ij'})] &= [(u_i - E(u_i)) + g_i(b_{ij} - E(b_{ij})) + (\epsilon_{ij} - E(\epsilon_{ij}))] [(u_i - E(u_i)) + g_i(b_{ij'} - E(b_{ij'})) + (\epsilon_{ij'} - E(\epsilon_{ij'}))] \ &= (u_i - E(u_i))^2 + g_i^2(b_{ij} - E(b_{ij}))(b_{ij'} - E(b_{ij'})) + (\epsilon_{ij} - E(\epsilon_{ij}))(\epsilon_{ij'} - E(\epsilon_{ij'})) \ &+ g_i(u_i - E(u_i))(b_{ij'} - E(b_{ij'})) + (u_i - E(u_i))(\epsilon_{ij'} - E(\epsilon_{ij'})) \ &+ g_i(b_{ij} - E(b_{ij}))(u_i - E(u_i)) + g_i(b_{ij} - E(b_{ij}))(\epsilon_{ij'} - E(\epsilon_{ij'})) \ &+ (\epsilon_{ij} - E(\epsilon_{ij}))(u_i - E(u_i)) + g_i (e_{ij} - E(\epsilon_{ij}))(b_{ij'} - E(b_{ij'})) \end{align} $$

Taking the expected values of the above gives us a sum of variance and covariance terms (by Definition 4.5.1 above):

$$ \begin{align} E[y_{ij} - E(y_{ij})] [y_{ij'} - E(y_{ij'})] &= \text{Var}(u_i) + g_i^2 \text{Cov}(b_{ij}, b_{ij'}) + \text{Cov}(\epsilon_{ij}, \epsilon_{ij'}) \ &+ g_i \text{Cov}(u_i, b_{ij'}) + \text{Cov}(u_i, \epsilon_{ij'}) \ &+ g_i \text{Cov}(u_i, b_{ij}) + g_i\text{Cov}(b_{ij}, \epsilon_{ij'}) \ &+ \text{Cov}(u_i, \epsilon_{ij}) + g_i \text{Cov}(b_{ij'}, \epsilon_{ij}) \end{align} $$

As already established, we have independence between every pair combination of $b_0$, $u_i$, $b_{ij}$, $b_{ij'}$, $\epsilon_{ij}$, and $\epsilon_{ij'}$, and so all of the covariance terms are zero:

$$ \tag{6} E[y_{ij} - E(y_{ij})] [y_{ij'} - E(y_{ij'})] = \text{Var}(u_i) = \sigma_u^2 $$

Combining the denominator from equation (4) and numerator from equation (6) into equation (3):

$$ \rho_g = \frac{\sigma_u^2}{\sigma_u^2 + g \sigma_B^2 + \sigma_{\epsilon}^2} $$

gives us the expected expression of equation (2).

The discrete goal attainment levels $x_{ij}$ are defined by discretizing $y_{ij}$ via a set of thresholds:

$$ \tag{7} -\infty = c_0 < c_1 \dots c_{2l + 1} = \infty $$

such that $x_{ij} = t - l - 1$ if $c_{t-1} < y_{ij} \leq c_t$ for $t = 0, \dots, 2l+1$. This means that $x_{ij}$ will range from $-l$ to $l$.

Traditionally, a 5-point attainment scale is employed in GAS. This corresponds to $l = 2$, and $t = 1, \dots, 5$:

$$ \tag{8} \begin{align} t &= (0, \dots 2l + 1): c_{t-1} < y_{ij} \leq c_t \rightarrow x_{ij} = t - l - 1 \ \ t &= 0: \text{undefined } c_{-1} \ t &= 1: c_0 < y_{ij} \leq c_1 \rightarrow x_{ij} = 1 - 2 - 1 = -2 \ t &= 2: c_1 < y_{ij} \leq c_2 \rightarrow x_{ij} = 2 - 2 - 1 = -1 \ t &= 3: c_2 < y_{ij} \leq c_3 \rightarrow x_{ij} = 3 - 2 - 1 = 0 \ t &= 4: c_3 < y_{ij} \leq c_4 \rightarrow x_{ij} = 4 - 2 - 1 = +1 \ t &= 5: c_4 < y_{ij} \leq c_5 \rightarrow x_{ij} = 5 - 2 - 1 = +2 \ \end{align} $$

(I'm not sure why $t = 0$ is considered, given that $c_{t-1}$ is undefined).

Simulation scenarios

In @Urach2019, the following simulation reference scenario is defined to compare hypothesis testing power:

Parallel group design with equal per group sample size $m/2$.
Overall sample size $m = 40$.
Thresholds $c_t = \Phi^{-1} (0.2t)$, $t = 0, \dots, 5$.
- $\Phi^{-1}$ denotes the inverse of the cumulative standard normal distribution function.

Thresholds

The cumulative standard normal probability distribution $\Phi$ can be computed with stats::pnorm():

d <- tibble(x = seq(-3, 3, 0.1)) %>%
  mutate(
    p = stats::pnorm(x, mean = 0, sd = 1),
  )

d %>%
  ggplot(aes(x, p)) +
  geom_line() +
  labs(title = "Cumulative standard normal distribution (mean 0, SD 1)")

The inverse cumulative probability function $\Phi^-1$ is also called the quantile function. It returns the values of a random variable corresponding to percentiles. For instance, the quantile function for a $\text{Normal}(0, 1)$ distribution:

qnorm(p = c(0.025, 0.5, 0.975), mean = 0, sd = 1)

The 2.5th and 97.5th percentiles return the familiar $\pm$1.96 values from a typical 95% confidence interval. Plot all of the values for the standard normal distribution:

d <- d %>%
  mutate(
    q = stats::qnorm(p, mean = 0, sd = 1),
  )

d %>%
  ggplot(aes(p, q)) +
  geom_line() +
  scale_x_continuous(breaks = seq(0, 1.0, 0.2)) +
  geom_vline(xintercept = seq(0, 1, 0.2), lty = 2) +
  labs(title = "Inverse cumulative standard normal distribution (mean 0, SD 1)")

In the simulation scenario, six thresholds are defined: $c_t = \Phi^{-1} (0.2t)$, $t = 0, \dots, 5$. These are marked in the above plot with the dashed lines. The values $c_t = \Phi^{-1}$ for those values of $t$:

thresh <- tibble(
  t = 0:5,
  percentile = 0.2 * t,
  c_t = stats::qnorm(percentile, mean = 0, sd = 1) %>% round(2)
)
knitr::kable(thresh)

Six thresholds corresponds to five discretized attainment levels; $l = 2$ in equation (7). To show how this works, consider these randomly sampled values from a standard normal distribution:

y <- rnorm(100, mean = 0, sd = 1)
ggplot(data = tibble(y), aes(y)) +
  geom_density() +
  geom_rug() +
  geom_vline(xintercept = thresh$c_t, lty = 2) +
  scale_x_continuous(limits = c(-5, 5))

Where the minimum and maximum thresholds at $\pm \infty$ were drawn at the edges of the plot, so they are visible.

We then bucket those scores according to to equation (8), for which gasr has a convenience function:

x <- gasr::discretize_from_thresholds(y, thresholds = thresh$c_t)

ggplot(data = tibble(x), aes(x)) +
  geom_bar()

Note that this results in a uniform distribution of goal scores, which isn't the traditional distribution from GAS theory. See the simulation-calibration vignette for more discussion about this topic.

The number of goals $n_i$ are uniformly distributed on ($1, \dots, n_{\text{max}}$) with $n_{\text{max}} = 5$.
The treatment effects $b_{ij}$ are assumed to be uniformly distributed on $(0, 2\delta)$ for fixed constant $\delta$.
- $B = U(0, 2\delta)$, $E(b_{ij}) = \delta$, $\text{Var}(b_{ij}) = \frac{\delta^{2/3}}{3}$.
The mean response in the control group is set to zero.
- $b_0 = 0$, $\mu_0 = 0$.
Correlations fixed by setting the variance of the random effects:
- $\sigma_u^2 = \rho_0$, and
- $\sigma_{\epsilon}^2 = 1 - \rho_0$, therefore
- equation (2) is $\rho_g = \rho_0 / (g\sigma_B^2 + 1)$.
  - $\rho_0$ in the control group, and
  - $\rho_1 = \rho_0 / (\sigma_B^2 + 1)$ in the treatment group.

Mean, variance, and covariance of the discretized goal scores

Figure 2(a) of @Urach2019 shows the relationship of of the expected value of the discretized goal attainment scores in the treatment group, $\mu_1 = E(x_{ij}(g = 1))$, and the standard deviation $\sigma$, versus treatment effect size $\delta$ for the reference scenario. The expected value for the continuous scores is:

$$ \begin{align} E(y_{ij}|g_i = 1) &= E(b_0) + E(\mu_i) + E(b_{ij}) + E(\epsilon_{ij}) \ &= 0 + 0 + \delta + 0 \ &= \delta \end{align} $$

and the variance of $y_{ij}$ can be decomposed into:

$$ \begin{align} \text{Var}(y_{ij}) = \sigma^2 &= \text{Var}(b_0) +\text{Var}(u_i) + \text{Var}(b_{ij}) + \text{Var}(\epsilon_{ij}) \ &= 0 + \sigma_u^2 + \frac{\delta^{2/3}}{3} + \sigma_{\epsilon}^2 \ &= \rho_0 + \frac{\delta^{2/3}}{3} + (1 - \rho_0) \ &= \frac{\delta^{2/3}}{3} + 1 \end{align} $$

# We can show this empirically with 100 simulations
set.seed(4)
d <- tibble(
  sim = 1:100,
  data = map(
    sim,
    ~sim_trial(n_subjects = 40, delta = 1, mean_control_response = 0,
               # Setting rho_0 = 0.3
               sigma_u = sqrt(0.7), sigma_e = sqrt(0.3))
  )
)
d %>%
  unnest(data) %>%
  group_by(group, sim) %>%
  summarise(
    mu = mean(score_continuous), sd = sd(score_continuous),
    .groups = "drop_last"
  ) %>%
  group_by(group) %>%
  summarise(
    n_sim = n(),
    mu = mean(mu), mu_se = mu / sqrt(n_sim),
    sd = mean(sd), sd_se = sd / sqrt(n_sim),
    .groups = "drop"
  )
# mu = 0.0 and 1.0 for control and treatment, as expected
# control: sd = 0.98 -> var = 0.96 ~ 1
# treatment: sd = 1.14 -> var = 1.30 (~4/3)

The expected value of the discrete scores are more complicated. Derivations are provided by @Urach2019 in Appendix 1, which we will attempt to work through here.

Using the cumulative normal distribution $\Phi$ defining the thresholds, they show that the marginal probability distribution of $x_{ij}$ is given by:

$$ \tag{9} P(x_{ij} = t | g_i = g, b_{ij} = b) = \Phi\left(\frac{c_t - gb - b_0}{\sigma_u^2 + \sigma_e^2}\right) - \Phi \left(\frac{c_{t-1} - gb + b_0}{\sigma_u^2 + \sigma_e^2} \right) $$

The moments of a distribution are defined in @Casella2002 as:

Definition 2.3.1 (p. 59) For each integer $n$, the $n$th moment of $X$, $\mu_n'$ is $$ \mu_n' = E(X^n). $$ The $n$th central moment of $X$, $\mu_n$ is $$ \mu_n = E(X - \mu)^n, $$ where $\mu = \mu_1' = EX$.

For a random variable $g(X)$ with probability distribution function $f_X(x)$, the integral definition of the expected value is:

Definition 2.2.1 (p. 55) The expected value or mean of a random variable $g(X)$ is $$ \begin{align} E(X) &= \int_{-\infty}^{\infty} g(x) f_X(x) dx \ \ \ \ \ \text{if X is continuous} \ &= \sum_{x \in \mathcal{X}} g(x) f_X(x) \ \ \ \ \ \ \ \ \ \ \ \text{if X is discrete} \ \end{align} $$

With $f_b$ representing the density of $b_{ij}$. For the uniform distribution in the reference scenario, the density function is $\frac{1}{2\delta-0} = \frac{1}{2\delta}$ for $b \in [0, 2\delta]$.

The non-central moments of $x_{ij}$ for $a >0$ are:

$$ \tag{10} \begin{align} E(x_{ij}^a | g_i = g) &= \int_B E(x_{ij}|g_i = g, b_{ij} = b) f_b (b) db \ &= \sum_{t \in \mathcal{T}} t^a \int_B P(x_{ij} = t | g_i = g, b_{ij} = b) f_b(b) db \end{align} $$

Then central moments are then given by:

$$ \begin{align} \mu_g &= E(x_{ij}|g_i = g) \ \sigma^2 = \text{Var}(x_{ij}|g_i = g) &= E(x_{ij}^2|g_i = g) - E(x_{ij}|g_i = g)^2 \end{align} $$ (unless I'm misunderstanding, $\mu_g$ is not a central moment, it is the first moment as per the @Casella2002 definition).

Plugging equation (9) into (10) for $a = 1$, and using $f_b = 1/2\delta$ for the uniform treatment effect distribution:

$$ \begin{align} \tag{11} E(x_{ij} | g_i = g) &= \sum_{t \in \mathcal{T}} t \int_B P(x_{ij} = t | g_i = g, b_{ij} = b) f_b(b) db \ &= \sum_{t \in \mathcal{T}} t \int_B \left[\Phi\left(\frac{c_t - gb - b_0}{\sigma_u^2 + \sigma_e^2}\right) - \Phi \left(\frac{c_{t-1} - gb + b_0}{\sigma_u^2 + \sigma_e^2} \right) \right] \frac{1}{2\delta} db \end{align} $$

We can simplify the cumulative distribution functions using the other assumptions from the reference scenario:

$$ \begin{align} \Phi \left(\frac{c_t - gb - b_0}{\sigma_u^2 + \sigma_{\epsilon}^2}\right) = \Phi \left(\frac{c_t - gb - 0}{\rho_0 + 1 - \rho_0} \right) = \Phi (c_t - gb) \end{align} $$

Then applying the formula for the standard normal CDF from equation (3.3.13) on p. 102 of @Casella2002:

$$ \Phi(c_t - gb) = \frac{1}{2\pi} \int_{-\infty}^x e^{-(c_t - gb)^2/2} db $$

Note that this integration cannot be performed directly (see p. 103 of @Casella2002 for details). We can numerically compute these values, however. Below shows $\Phi(c_t - b)$ for different values of $t$:

c_t <- stats::qnorm(seq(0, 1, 0.2), mean = 0, sd = 1)
d <- tibble(
  t = 1:5,
  t_label = str_c("t = ", t),
  c_t = c_t[2:6]
) %>%
  crossing(b = seq(-3, 3, 0.01), g = c(0, 1)) %>%
  mutate(phi = pnorm(c_t - g * b, mean = 0, sd = 1))
d %>%
  ggplot(aes(x = b, y = phi)) +
  geom_line(aes(color = factor(g))) +
  facet_wrap(~t_label) +
  theme(legend.position = c(0.9, 0.2))

And then we take the difference in adjacent distributions to get the probability of different values of $x_{ij}$, which are shown below for the treatment group ($g = 1$):

c_t <- stats::qnorm(seq(0, 1, 0.2), mean = 0, sd = 1)
calc_phi_diff <- function(b, t) {
  pnorm(c_t[t] - b, mean = 0, sd = 1) -
    pnorm(c_t[t - 1] - b, mean = 0, sd = 1)
}

tibble(t = 2:6) %>%
  crossing(b = seq(-3, 3, 0.01)) %>%
  mutate(
    prob = calc_phi_diff(b, t),
    t_label = str_c("phi(t = ", t, ") - phi(t = ", t - 1, ")")
  ) %>%
  ggplot(aes(x = b, y = prob)) +
  geom_line(aes(color = t_label)) +
  scale_color_viridis_d()

And then, we can integrate() equation (11) for different values of $t$ and $\delta$ to get the expected values:

integrate_phi <- function(t, delta, subdivisions = 1000) {
  stats::integrate(
    f = function(b) calc_phi_diff(b, t),
    lower = 0, upper = 2 * delta, subdivisions = subdivisions
  )$value
}
phi_integral <- tibble(t = as.integer(2:6)) %>%
  crossing(delta = seq(0.01, 5, 0.01)) %>%
  rowwise() %>%
  mutate(
    integral = integrate_phi(t, delta)
  ) %>%
  ungroup()

# Compute moments of the discrete goal scales by summing over t
x_moments <- phi_integral %>%
  group_by(delta) %>%
  summarise(
    integral_sum1 = sum((t - 4) * integral),
    integral_sum2 = sum((t - 4)^2 * integral),
    .groups = "drop"
  ) %>%
  mutate(
    mu = integral_sum1 / (2 * delta),
    var = (integral_sum2 / (2 * delta)) - mu^2
  )
x_moments %>%
  transmute(delta, mu, sigma = sqrt(var)) %>%
  pivot_longer(c(mu, sigma), names_to = "moment", values_to = "value") %>%
  ggplot(aes(x = delta, y = value, color = moment)) +
  geom_line() +
  scale_y_continuous(limits = c(0, 2))

The shape of the curves looks close to Figure 2(a), but I haven't quite gotten the scale right.

# Test this empirically
set.seed(74)

rho_0 <- 0
var_u <- rho_0
var_e <- 1 - rho_0

sim_data <-
  crossing(
    delta = seq(0, 5, 0.5),
    sim = 1:10
  ) %>%
  mutate(
    data = map(
      delta,
      ~sim_trial(
        n_subjects = 100, sigma_u = sqrt(var_u),
        group_allocation = list(treatment = 1.0),
        random_allocation = FALSE, score_dist = "unif",
        n_goals = 1, sigma_e = sqrt(var_e),
        delta = .x
      )
    )
  ) %>%
  mutate(
    mu = map_dbl(
      data,
      ~mean(.$score_discrete)
    ),
    sigma = map_dbl(
      data,
      ~sd(.$score_discrete)
    )
  ) %>%
  pivot_longer(c(mu, sigma), names_to = "moment", values_to = "value")
x_moments %>%
  transmute(delta, mu, sigma = sqrt(var)) %>%
  pivot_longer(c(mu, sigma), names_to = "moment", values_to = "value") %>%
  ggplot(aes(x = delta, y = value, color = moment)) +
  geom_line() +
  scale_y_continuous(limits = c(0, 2.5)) +
  geom_point(data = sim_data)

Likewise, the joint probability of $x_{ij}$ and $x_{ij'}$ for $j \neq j'$:

$$ \tag{12} P(x_{ij} = t, x_{ij'} = t'|g_i = g, b_{ij} = b, b_{ij'} = b') = P(z_{ij} \in (c_{t-1} - gb, c_t - gb], z_{ij'} \in (c_{t-1} - gb', c_t - gb']) $$

where $z_{ij} = u_i + \epsilon_{ij}$ such that $(z_{ij})^{n_i}{j=1}$ is multivariate normal with mean 0, $\text{Var}(z{ij}) = \sigma_u^2 + \sigma_{\epsilon}^2$ and $\text{Cov}(z_{ij}, z_{ij'}) = \sigma_u^2$.

Computing these probabilities involves computing a bivariate cumulative normal distribution:

rho_0 <- 0.3
var_u <- rho_0
var_e <- 1 - rho_0

library(mvtnorm)
calc_bivariate_integrand <- function(b, t1, t2) {
  pmvnorm(
    lower = c(c_t[t1 - 1] - b,
              c_t[t2 - 1] - b),
    upper = c(c_t[t1] - b,
              c_t[t2] - b),
    mean = c(0, 0),
    corr = matrix(c(var_u + var_e, var_u,
                    var_u, var_u + var_e),
                  nrow = 2, ncol = 2),
    keepAttr = FALSE
  )
}

prob_x1_x2 <- crossing(
  t1 = 2:6, t2 = 2:6,
  b = seq(-3, 3, 0.1)
) %>%
  mutate(
    prob = pmap_dbl(
      list(b, t1, t2),
      calc_bivariate_integrand
    )
  ) %>%
  mutate(x1 = forcats::fct_inorder(str_c("x1 = ", t1 - 4)),
         x2 = forcats::fct_inorder(str_c("x2 = ", t2 - 4)))

prob_x1_x2 %>%
  ggplot(aes(x = b, y = prob)) +
  geom_line() +
  facet_grid(x1 ~ x2) +
  scale_y_continuous(breaks = c(0, 0.5, 1))

This is showing us that the probability of both $x_{ij}$ (x1 above) and $x_{ij'}$ (x2) are unlikely to be the same (the diagonals above) except for the extreme values of $b$.

This probability is used to compute the expected value of the product of $x_{ij}$ and $x_{ij'}$:

$$ E(x_{ij} x_{ij'}| g_i= g) = \int_B \int_B \sum_{t \in \mathcal{T}} \sum_{t \in \mathcal{T}} t t' P(x_{ij} = t, x_{ij'} = t'|g_i = g, b_{ij} = b, b_{ij'} = b') f_b(b) f_b(b') db db' $$

# Re-define the previous integrand so that b1 can vary separately from b2
calc_bivariate_integrand <- function(b1, b2, t1, t2) {
  pmvnorm(
    lower = c(c_t[t1 - 1] - b1,
              c_t[t2 - 1] - b2),
    upper = c(c_t[t1] - b1,
              c_t[t2] - b2),
    mean = c(0, 0),
    corr = matrix(c(var_u + var_e, var_u,
                    var_u, var_u + var_e),
                  nrow = 2, ncol = 2),
    keepAttr = FALSE
  )
}

# Function to compute the double integral
integrate_prob_x1_x2 <- function(t1, t2, delta) {
  stats::integrate(
    f = function(b1) {
      # Need to map values here because integrate() passes multiple b values
      #  at once, but pmvnorm only takes one value at a time
      map_dbl(
        b1,
        function(b1) {
          stats::integrate(
            f = function(b2) {
              map_dbl(
                b2,
                function(b2) {
                  calc_bivariate_integrand(b1, b2, t1, t2)
                }
              )
            },
            lower = 0, upper = 2 * delta, subdivisions = 100
          )$value
        }
      )
    },
    lower = 0, upper = 2 * delta, subdivisions = 100
  )$value
}

e_x1_x2 <- crossing(
  t1 = 2:6, t2 = 2:6,
  delta = seq(0.01, 5, 0.1)
) %>%
  mutate(
    integral = pmap_dbl(
      list(t1, t2, delta),
      integrate_prob_x1_x2
    )
  ) %>%
  group_by(delta) %>%
  summarise(
    value = sum((t1 - 4) * (t2 - 4) * integral),
    .groups = "drop"
  ) %>%
  mutate(value = value / (2 * delta)^2)

#readr::write_rds(e_x1_x2, "inst/extdata/e-x1-x2.rds")

# To save time while building the vignette, the results of this integration
#  were saved separately (took about 80 seconds)
e_x1_x2 <- readr::read_rds(system.file("extdata/e-x1-x2.rds",
                                       package = "gasr"))

e_x1_x2 %>%
  ggplot(aes(delta, value)) +
  geom_line()

Finally, the expected correlation is given by:

$$ \rho_{x,g} = \text{Cov}(x_{ij}, x_{ij'}| g_i = g) = E(x_{ij}, x_{ij'} | g_i = g) - \mu_g^2 $$

e_x1_x2 %>%
  mutate(delta = round(delta, 2)) %>%
  left_join(
    x_moments %>% transmute(delta = round(delta, 2), mu),
    by = "delta"
  ) %>%
  mutate(rho_xg = value - mu^2) %>%
  ggplot(aes(delta, rho_xg)) +
  geom_line()

Once again, looks like the right shape, but not the right scale.

Simulation examples

gasr provides modularized functions for running simulations with the specified parameters in the reference scenario. For instance, if we let $\rho_0 = 0$, simulate 40 subjects equally allocated to control and treatment groups:

set.seed(77)

rho_0 <- 0
var_u <- rho_0
var_e <- 1 - rho_0

gas_data <- sim_subjects(
  n_subjects = 40, sigma_u = sqrt(var_u),
  group_allocation = list(control = 0.5, treatment = 0.5),
  random_allocation = FALSE
)
gas_data %>% glimpse()

Give each subject between 1 and 5 (=$n_{\text{max}}$) goals:

gas_data <- gas_data %>%
  sim_goals(n_goals_range = c(1, 5), sigma_e = sqrt(var_e)) %>%
  unnest(goals)
glimpse(gas_data)

Assign a uniform distribution treatment effect with $\delta$ = 0.3.

gas_data <- gas_data %>%
  sim_treatment_effect(delta = 0.3)
glimpse(gas_data)

The goal weights could be added here with sim_goal_weights, but we'll leave them unweighted for this example.

Calculate the latent goal scores by summing up the random effects (subject_re = $u_i$, goal_re = $\epsilon_{ij}$) and fixed effects (treatment_fe = $b_{ij}$, and a negative control group response $b_0$ = -0.3):

gas_data <- gas_data %>%
  mutate(
    score_continuous = subject_re + goal_re +
      ifelse(group == "treatment", treatment_fe, -0.3)
  )

The distribution of continuous scores by treatment group:

gas_data %>%
  ggplot(aes(x = score_continuous, fill = group, color = group)) +
  geom_density(alpha = 0.5) +
  geom_rug()

Now to discretize the scores, we define the thresholds using the inverse cumulative normal distribution:

thresh <- create_thresholds(n_levels = 5, score_dist = "unif")
thresh

And apply to the continuous scores:

gas_data <- gas_data %>%
  mutate(
    score_discrete = discretize_from_thresholds(score_continuous, thresh)
  )
gas_data %>%
  ggplot(aes(x = score_discrete, fill = group)) +
  geom_bar(position = position_dodge())

It is helpful to run these functions step-by-step to understand the simulations, but it is not very efficient. The sim_trial() function does all of the steps in one function:

set.seed(12)

gas_data <- sim_trial(
  n_subjects = 40, sigma_u = sqrt(var_u),
  group_allocation = list("control" = 0.5, "treatment" = 0.5),
  random_allocation = FALSE,
  n_goals_range = c(1, 5), sigma_e = sqrt(var_e),
  delta = 0.3, mean_control_response = -0.3,
  n_levels = 5
)
glimpse(gas_data)

(Note that this function also returns goal_weights, all equal to 1, and subject-level summary tscores).

gas_data %>%
  ggplot(aes(x = score_discrete, fill = group)) +
  geom_bar(position = position_dodge())

A simple hypothesis test would be to compare the subject-level mean goal attainment $\bar{x}i = \frac{1}{n_i} \sum x{ij}$ in the two groups via a Welch's $t$-test (allowing for unequal variance):

gas_data_mean <- gas_data %>%
  group_by(subject_id, group) %>%
  summarise(
    mean_score_discrete = mean(score_discrete),
    .groups = "drop"
  )
t.test(
  mean_score_discrete ~ group,
  data = gas_data_mean,
  alternative = "two.sided", var.equal = FALSE
)

In this case, the difference was statistically significant at $\alpha = 0.05$.

We can repeat this, say, 10000 times, run $t$-tests, and get the $p$ values:

set.seed(21)

gas_sims <-
  tibble(
    sim = 1:10000,
  ) %>%
  mutate(
    data = map(
      sim,
      ~sim_trial(
        n_subjects = 40, sigma_u = sqrt(var_u),
        group_allocation = list("control" = 0.5, "treatment" = 0.5),
        random_allocation = FALSE,
        n_goals_range = c(1, 5), sigma_e = sqrt(var_e),
        delta = 0.3, mean_control_response = -0.3,
        n_levels = 5
      ) %>%
        group_by(subject_id, group) %>%
        summarise(mean_score_discrete = mean(score_discrete),
                  .groups = "drop")
    )
  )
# Run t-tests on every simulation
gas_sims_ttest <- gas_sims %>%
  transmute(
    sim,
    ttest = map(
      data,
      ~t.test(mean_score_discrete ~ group,
              data = .,
              alternative = "two.sided", var.equal = FALSE)
    )
  ) %>%
  mutate(ttest = map(ttest, broom::tidy)) %>%
  unnest(ttest)

(This took about 8 minutes to run, for reference).

# readr::write_rds(
#   gas_sims_ttest,
#   here::here("inst", "extdata", "gas-sims-ttest.rds")
# )

# To save time while building the vignette, the results of the simulation
#  were saved separately
gas_sims_ttest <- readr::read_rds(
  system.file("extdata/gas-sims-ttest.rds", package = "gasr")
)

glimpse(gas_sims_ttest)

Count the $p$ values according to significance level $\alpha = 0.05$:

gas_sims_ttest %>% count(p.value < 0.05)

A total of r sum(gas_sims_ttest$p.value < 0.05) of these simulations (r scales::percent(mean(gas_sims_ttest$p.value < 0.05))) returned a $p$ value <0.05. And so the power to detect a significant treatment effect in this scenario is r scales::percent(mean(gas_sims_ttest$p.value < 0.05)) when using a $t$-test on per-subject means.