knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(AscRtain) library(ggplot2) library(ggnewscale)
Following https://osf.io/vrcuf/, can infer the biased OR for a binary exposure ($A$) on a binary outcome ($Y$) when both of the traits influence the probability of being present in the sample ($S$)
Assume that being present in the sample is simply:
$$ \mathbb{P}(S = 1 | A,Y) = \beta_0 + \beta_A A + \beta_Y Y + \beta_{AY} AY $$
where $A = {0,1}$ and $Y = {0,1}$. The expected odds ratio under this scenario is then:
$$ \mathbb{E}\left[\widehat{\operatorname{OR}}{S=1}\right] = \frac{\beta_0(\beta_0 + \beta_A + \beta_Y + \beta{AY})}{(\beta_0 + \beta_A)(\beta_0 + \beta_Y)} $$
Suppose that we know the fraction of the population that is present in our sample ($p_{S}$). We are only interested in the $\beta_*$ parameter values that give rise to a value of $p_{S}$ that is within the bounds of expectation:
$$ p_{S} = \beta_0 + \beta_A p_A + \beta_Y p_Y + \beta_{AY} p_{AY} $$
Scenario: we find an association between $A$ and $Y$ in our ascertained sample. Our question is what effects must $A$ and $Y$ have on sample ascertainment in order to induce the observed odds ratio $\operatorname{OR}$, assuming that the true odds ratio is 1.
Initialise a new VBB
(V-structure, Binary exposure, Binary outcome) class
x <- VBB$new() x
Example of how to calculate the odds ratio for given $\beta_*$ parameters:
x$or_calc(b0=0.1, ba=0.2, by=0.3, bay=0.4)
Search over a parameter space of possible values to identify whether some target odds ratio could be explained by sample ascertainment
x$parameter_space( target_or=2, pS=0.0275, pA=0.15, pY=0.1, pAY=0, b0_range=c(0,0.1), ba_range=c(-0.2,0.2), by_range=c(-0.2,0.2), bay_range=c(0,0), granularity=200 )
The parameter values that meet the target OR due to collider bias
x$param
Visualise the distribution of odds ratios found across the range of parameters
x$histogram()
Visualise the $\beta_*$ parameter ranges that meet the target odds ratio
x$scatter()
Or in 3D:
x$scatter3d()
Can try to do this in 3D also:
plot3Drgl::plotrgl()
Here, $A$ is ACE inhibitor use, $Y$ is Covid-19 status, and $S$ is presence in the first release of the COVID Symptom Tracker dataset.
Observational association of ACE-i influence on Covid-19 status gives OR $\approx 2$. Assume 7.8% of population take ACE-i, 10% are infected with coronavirus at the time of sampling, 1.9 million of an adult population of 54 million are present in the sample (3.5%). What influences of ACE-inhibitor use and Covid-19 status would be required to induce a collider bias of $\operatorname{OR}=2$?
Enter parameters without interactions to reduce the paramater space for now:
x <- VBB$new() x$parameter_space( target_or=2.07, pS=0.035, pA=0.078, pY=0.1, pAY=0.03, b0_range=c(0,0.1), ba_range=c(-0.1,0.1), by_range=c(-0.2,0.2), bay_range=c(0,0), granularity=200 ) x$scatter()
Within our sample the prevalence of ACEi use is 4.2%. In the general population it is 7.8%. This means we can narrow down the range of our simulations to only look at parameter values that would give that level of selection (reduced likelihood of selection when on ACEi).
Need to figure out what value of $\beta_A$ could give rise to the difference in prevalences between population and sample/
set.seed(20200420) n <- 1000000 Y <- rbinom(n, 1, 0.1) A <- rbinom(n, 1, 0.078) pS <- 0.03 - 0.015 * A + 0.1 * Y S <- rbinom(n, 1, pS) # target 4.2% in sample # assume 7.8% in uk dat <- dplyr::tibble(A,Y,S) subset(dat, S==1) %>% {table(.$A)/sum(.$S)}
Likely somewhere between $-0.1$ and $-0.2$.
x$scatter() + ggplot2::labs(x = expression(paste("Effect of ACE-i on inclusion probability (", beta[A],")")), y = expression(paste("Effect of Covid-19 on inclusion probability (", beta[Y],")")), colour = expression(paste("Baseline inc. prob. ", (beta[0])))) + ggplot2::annotate("rect", xmin=-0.01, xmax=-0.02, ymin=-Inf, ymax=Inf, alpha=0.4)
Now that we can fix the $\beta_A$ parameter to 0.015, we can introduce the interaction effect. Here, if somebody is a case for COVID and uses ACEi do they have a different probability of selection than just the marginal effects.
We can also ask - is our association actually more likely to be attenuated to OR=2, rather than inflated from OR=1. Let's search the parameter space for parameter sets that give OR=0.8 and superimpose on the figure.
x <- VBB$new() x$parameter_space( target_or=1.95, pS=0.035, pA=0.078, pY=0.1, pAY=0.078 * 0.1, b0_range=c(0,0.1), ba_range=c(-0.015, -0.015), by_range=c(-0.1,0.1), bay_range=c(-0.1,0.1), granularity=200 ) x1 <- VBB$new() x1$parameter_space( target_or=0.8, pS=0.035, pA=0.078, pY=0.1, pAY=0.078 * 0.1, b0_range=c(0,0.1), ba_range=c(-0.015, -0.015), by_range=c(-0.1,0.1), bay_range=c(-0.1,0.1), granularity=200 ) p <- x$param %>% ggplot2::ggplot(., aes(x = bay, y = by)) + ggplot2::geom_point(aes(colour = b0)) + ggplot2::labs(colour = expression(paste("Baseline inc. prob. ", (beta[0]), " OR > 2"))) + ggplot2::scale_colour_gradient(low="white", high="blue") + ggnewscale::new_scale_colour() + ggplot2::geom_point(data = x1$param, aes(colour = b0)) + ggplot2::scale_colour_gradient(low = "white", high = "red") + ggplot2::labs(x = expression(paste("Effect of ACE-i AND Covid-19 on inclusion probability (", beta[AY],")")), y = expression(paste("Effect of Covid-19 on inclusion probability (", beta[Y], ")")), colour = expression(paste("Baseline inc. prob. ", (beta[0]), " OR < 0.8"))) p
So it could go either way really, a lot of the figure is shaded in both blue and red, and we don't know what the actual selection based on covid is. This indicates that it is really not reliable to be making claims about causality from this type of sampling.
Note that some areas of the figure have both negative and positive bias - that is because it changes with different levels of background participation ($\beta_0$). This parameter is particularly difficult to estimate what the likely true value is.
Simulate individual level data according to the first result, where $A$ has no influence on $Y$. Note that using a population size of 1 million as the sample size doesn't matter, just the proportion sampled.
Parameters:
a <- subset(x$param, ba < -0.01 & ba > -0.02 & bay == min(abs(bay)))[1,] a
Simulate
set.seed(31415) n <- 1000000 Y <- rbinom(n, 1, a$pY) A <- rbinom(n, 1, a$pA) pS <- a$b0 + a$ba * A + a$by * Y S <- rbinom(n, 1, pS)
What proportion of the population are present in the sample?
sum(S) / length(S)
Estimate association between A
and Y
summary(glm(Y ~ A, family="binomial", subset=S==1))$coef[2,1] %>% exp
Compare to expected biased association
a$or
If a logistic regression in the complete population could be performed to estimate the effect of $A$ and $Y$ on $S$ then inverse probability weighting could be achieved by
The inverse of these probabilities are the weights to be used in logistic regression in the ascertained sample:
probs <- glm(S ~ A + Y + A:Y, family="binomial") %>% fitted.values() table(probs)
Associations in ascertained sample (biased):
summary(glm(Y ~ A, family="binomial", subset=S==1))$coef[2] %>% exp
Associations in total population (unbiased):
summary(glm(Y ~ A, family="binomial"))$coef[2] %>% exp
Weighted regression in ascertained sample (unbiased):
summary(glm(Y ~ A, family="binomial", subset=S==1, weight=1/probs))$coef[2] %>% exp
However, not having the complete population data available means we cannot actually estimate the effects of A and Y on S directly. If we did know those effects (on the log OR scale), then the weights are easily generated:
coef <- glm(S ~ A + Y + A:Y, family="binomial")$coef probs2 <- model.matrix(~ A + Y + A:Y, subset=S==1) %*% coef %>% plogis summary(glm(Y ~ A, family="binomial", subset=S==1, weight=1/probs2))$coef[2] %>% exp
Instead, we have a set of parameters that we know give rise to bias. There is a slight complication in that those parameters are in terms of risk difference, not log odds ratios. So we need to convert them:
# Define the case fraction mu <- sum(S) / length(S) coef2 <- c( qlogis(a$b0), a$ba / (mu * (1-mu)), a$by / (mu * (1-mu)), a$bay / (mu * (1-mu)) ) probs3 <- model.matrix(~ A + Y + A:Y, subset=S==1) %*% coef2 %>% plogis table(probs3) table(probs2) summary(glm(Y ~ A, family="binomial", subset=S==1, weight=1/probs3))$coef[2] %>% exp summary(glm(Y ~ A, family="binomial", subset=S==1))$coef[2] %>% exp
This goes some way towards attenuating the effect but it's obviously still some way off. This is probably because the result is very sensitive to the interaction term, and we don't seem to be capturing that well. e.g. Here is the weighting when the interaction term is included and omitted:
probs_int <- glm(S ~ A + Y + A:Y, family="binomial") %>% fitted.values() probs_noint <- glm(S ~ A + Y, family="binomial") %>% fitted.values() summary(glm(Y ~ A, family="binomial", subset=S==1, weight=1/probs_int))$coef[2] %>% exp summary(glm(Y ~ A, family="binomial", subset=S==1, weight=1/probs_noint))$coef[2] %>% exp
Omitting the interaction term, even though small, makes a big difference. Also the estimate of the interaction effect in risk difference terms is quite different from that simulated even though the other terms are still quite well estimated:
coef3 <- lm(S ~ A + Y + A:Y)$coef coef3[1] <- qlogis(coef3[1]) coef3[2:4] <- coef3[2:4] / (mu * (1-mu)) coef3 coef
The difference in the weighted regressions is large:
probs4 <- model.matrix(~ A + Y + A:Y, subset=S==1) %*% coef3 %>% plogis summary(glm(Y ~ A, family="binomial", subset=S==1, weight=1/probs4))$coef[2] %>% exp summary(glm(Y ~ A, family="binomial", subset=S==1, weight=1/probs3))$coef[2] %>% exp summary(glm(Y ~ A, family="binomial", subset=S==1, weight=1/probs2))$coef[2] %>% exp
library(AscRtain) load_all() x <- VBB$new() x$parameter_space( target_or=5, pS=0.035, pA=0.078, pY=0.1, pAY=0.078 * 0.1, b0_range=c(0,0.1), ba_range=c(-0.015, 0.015), by_range=c(-0.1,0.1), bay_range=c(0,0), granularity=200 ) x$scatter3d() x1 <- VBB$new() x1$parameter_space( target_or=1/5, pS=0.035, pA=0.078, pY=0.1, pAY=0.078 * 0.1, b0_range=c(0,0.1), ba_range=c(-0.015, 0.015), by_range=c(-0.1,0.1), bay_range=c(0,0), granularity=200 ) x1$scatter3d() p <- dplyr::bind_rows(dplyr::mutate(x$param, or=5), dplyr::mutate(x1$param, or=1/5)) dim(p) library(ggplot2) ggplot(p, aes(x = ba, y = by)) + geom_point() + facet_grid(or ~ .) + xlim(c(-0.015, 0.015)) + ylim(c(-0.1,0.1)) + labs(colour = expression(paste("Baseline inc. prob. ", (beta[0]), " OR > 2"))) + scale_colour_gradient(low="white", high="blue") + ggnewscale::new_scale_colour() + geom_point(data = x1$param, aes(colour = b0)) + scale_colour_gradient(low = "white", high = "red") + labs(x = expression(paste("Effect of ACE-i AND Covid-19 on inclusion probability (", beta[AY],")")), y = expression(paste("Effect of Covid-19 on inclusion probability (", beta[Y], ")")), colour = expression(paste("Baseline inc. prob. ", (beta[0]), " OR < 0.8"))) p x1 <- VBB$new() x1$parameter_space( target_or=0.8, pS=0.035, pA=0.078, pY=0.1, pAY=0.078 * 0.1, b0_range=c(0,0.1), ba_range=c(-0.015, -0.015), by_range=c(-0.1,0.1), bay_range=c(-0.1,0.1), granularity=200 ) library(ggplot2) p <- x$param %>% ggplot2::ggplot(., aes(x = bay, y = by)) + ggplot2::geom_point(aes(colour = b0)) + ggplot2::labs(colour = expression(paste("Baseline inc. prob. ", (beta[0]), " OR > 2"))) + ggplot2::scale_colour_gradient(low="white", high="blue") + ggnewscale::new_scale_colour() + ggplot2::geom_point(data = x1$param, ggplot2::aes(colour = b0)) + ggplot2::scale_colour_gradient(low = "white", high = "red") + ggplot2::labs(x = expression(paste("Effect of ACE-i AND Covid-19 on inclusion probability (", beta[AY],")")), y = expression(paste("Effect of Covid-19 on inclusion probability (", beta[Y], ")")), colour = expression(paste("Baseline inc. prob. ", (beta[0]), " OR < 0.8"))) p plot3D::scatter3D(x=x$param$ba, y=x$param$by, z=x$param$bay, colvar=x$param$b0) library(plot3Drgl) plotrgl() , ticktype = ticktype, theta=theta, phi=phi, bty=bty, xlab=xlab, ylab=ylab, zlab=zlab, clab=clab, ...)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.