knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
suppressWarnings(suppressPackageStartupMessages({ library(knitr) library(ggplot2) library(dplyr) library(pwr) })) knitr::opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE, dev="png")
( \newcommand{\var}[1]{\operatorname{var}\left(#1\right)} \newcommand{\E}[1]{\mathbb{E}\left[#1\right]} \newcommand{\logit}[1]{\operatorname{logit}\left(#1\right)} \DeclareMathOperator{\cov}{cov} )
The purpose of expanding sample sizes for genome-wide association studies, now happening at a rapid rate with the advent of population scale biobanks, is to detect more and more genetic variants associated with complex traits of interest. With 5 million samples, the effect sizes detectable are extremely small. While this provides exquisite power to detect extremely small effects, it opens up new avenues of subtle bias. One such problem arises because the recruited individuals are not representative of the general population. If they are ascertained, for example for higher education levels, collider bias will induce correlations amongst all the causes of higher education levels within the sample. This problem is typically ignored because the magnitude of the collider bias is relatively small, and unlikely to be detectable at genome-wide significance thresholds. However, with extreme statistical power that safety may disappear.
The UK has a population of 50 million adults. In the next major genotyping effort, 5 million will be sampled and genotyped. In principle this is to improve statistical power for GWAS. In many cases the GWAS signals of interest are only those that are very strongly associated, e.g. with $p < 5\times10^{-8}$. This likely guards against a lot of the potential problems that can arise due to subtle sampling issues such as collider bias in ascertained samples.
Question: When power to detect very small effects gets large, is the problem of collider bias arising due to ascertainment in GWAS now a realistic problem?
Question: As more of the population is sampled, how does this bias change in power and magnitude?
Use educational attainment ($S$) as the variable that influences ascertainment, and ascertainment is a binary variable $z$ which indicates if the individual is in the sample. Suppose that educational attainment influences the probability of being recruited into the sample, or at least having non-missing data in the sample. The logistic function can be used to model the relationship of influence of educational attainment on being sampled
$$ \logit{\mu_i} = \alpha + \delta S_i $$
where $\mu_i$ is the probability for the $i$th individual being included in the sample based on their value of $S$, and the intercept $\alpha$ indicates the proportion of individuals in the population who are sampled, $p$ by:
$$ p = \frac{1}{1 + \exp{-\alpha}} $$
The (biased) estimate of the effect of a genetic variant ($G$) on some predictor of education ($X$), when estimated only amongst recruited samples ($Z = 1$), is given by
$$ \widehat{\beta}{GX} = \frac{\sigma{GX}/\sigma_{G} - \sigma_{GZ}\sigma_{XZ} / \sigma_{Z}^2}{1 - (\sigma_{GZ}/\sigma_{Z})^2} $$
Given the relationship between education and recruitment, we can parameterise the bias in terms of:
The value of $Z$ indicates if an individual is present in the sample, and this is a function of
$$ \sigma_{GZ} = \cov\left(G,\frac{p}{1-p}\delta S\right) $$
$$ \widehat{\beta}{GX} = \frac{\sigma{GX}/2m(1-m) - \cov(G, \frac{p}{1-p}\delta S)\cov(X, \frac{p}{1-p}\delta S) / \sigma_{Z}^2}{1 - \left(\cov(G, \frac{p}{1-p}\delta S)/\sigma_{Z}\right)^2} $$
There are two different methods for estimating the collider effect.
Groenwold, Palmer and Tilling (2020, https://osf.io/vrcuf/) gives:
$$ \widehat{\beta}{xy|s} = \frac{\beta{xy} \sigma^2_{s} \sigma^2_{x} - \beta_{xs} \beta_{ys} \sigma^2_x \sigma^2_y}{\beta^2_{ys} \sigma^2_x \sigma^2_y + \sigma^2_s \sigma^2_x} $$
Pearl (2013, https://ftp.cs.ucla.edu/pub/stat_ser/r409_submitted_corrected.pdf) gives:
$$ \widehat{\beta}{xy|s} = \frac{\sigma{xy} - \sigma_{sy} \sigma_{sx}}{1 - \sigma^2_{sx}} $$
This assumes error terms are equal I think.
I get differing results between these solutions when the variances of $x$ and $y$ change. An alternative to the first one is:
$$ \widehat{\beta}{xy|s} = \frac{\beta{xy} \sigma^2_{s} \sigma^2_{x} - \beta_{xs} \beta_{ys} \sigma^2_x \sigma^2_y}{\beta^2_{ys} \sigma^2_x \sigma^2_y + \sigma^2_x} $$
Simulate continuous collider that influences a continuous conditioning variable
set.seed(20200420) n <- 100000 x <- rnorm(n) y <- rnorm(n) b_xy <- 0 b_xs <- 6 b_ys <- 8 b_sz <- 5 s <- x * b_xs + y * b_ys + rnorm(n) z <- s * b_sz + rnorm(n)
Bias in x-y assoc when conditioning on s
(b_xy * var(s) * var(x) - b_xs * b_ys * var(x) * var(y)) / (b_ys^2 * var(x) * var(y) + var(x)) lm(y ~ x + s)$coef[2]
Bias in x-y assoc when conditioning on z
b_xz <- lm(z ~ x)$coef[2] b_yz <- lm(z ~ y)$coef[2] (b_xy * var(z) * var(x) - b_xz * b_yz * var(x) * var(y)) / (b_yz^2 * var(x) * var(y) + var(x)) lm(y ~ x + z)$coef[2]
Is the same.. why?
Now make x binary. This also changes its mean and variance. Change y variance also
set.seed(12345) n <- 100000 maf <- 0.3 x <- rbinom(n, 2, maf) y <- rnorm(n, sd=3) s <- x * 6 + y * 8 + rnorm(n) z <- s * 5 + rnorm(n)
Bias in x-y assoc when conditioning on s
(b_xy * var(s) * var(x) - b_xs * b_ys * var(x) * var(y)) / (b_ys^2 * var(x) * var(y) + var(x)) lm(y ~ x + s)$coef[2]
Bias in x-y assoc when conditioning on z
b_xz <- lm(z ~ x)$coef[2] b_yz <- lm(z ~ y)$coef[2] (b_xy * var(z) * var(x) - b_xz * b_yz * var(x) * var(y)) / (b_yz^2 * var(x) * var(y) + var(x)) lm(y ~ x + z)$coef[2]
Now make z binary
x <- rbinom(n, 2, maf) y <- rnorm(n) s <- x * b_xs + y * b_ys l <- b_sz*s pr <- 1 / (1 + exp(-l)) z <- rbinom(n, 1, pr) table(z)
Condition on z
b_xz <- lm(z ~ x)$coef[2] b_yz <- lm(z ~ y)$coef[2] (b_xy * var(z) * var(x) - b_xz * b_yz * var(x) * var(y)) / (b_yz^2 * var(x) * var(y) + var(x)*var(z)) (b_xy * var(z) * var(x) - b_xz * b_yz * var(x) * var(y)) / (b_yz^2 * var(x) * var(y) + var(x)) lm(y ~ x + z)$coef[2] lm(y[z==1] ~ x[z==1])$coef[2]
Now it's weird because it seems to need the var(z) term in the denominator.. compare different methods:
library(dplyr) param <- expand.grid( n = 100000, maf = c(0.1, 0.5, 0.7), b_xy = c(0, 0.3), b_xs = c(0.1, 0.6), b_ys = c(0.4, 0.8), b_sz = c(0.4, 0.8), sdy = c(1, 3), alpha = c(0, 1, 2), sim=1 ) out <- list() for(i in 1:nrow(param)) { message(i) pa <- param[i, ] x <- rbinom(pa$n, 2, pa$maf) y <- rnorm(pa$n, sd=pa$sdy) s <- x * pa$b_xs + y * pa$b_ys l <- pa$alpha + pa$b_sz*s pr <- 1 / (1 + exp(-l)) z <- rbinom(pa$n, 1, pr) b_xz <- cov(x,z) / var(x) b_yz <- cov(y,z) / var(y) pa$est1 <- (b_xy * var(z) * var(x) - b_xz * b_yz * var(x) * var(y)) / (b_yz^2 * var(x) * var(y) + var(x)*var(z)) pa$est2 <- (b_xy * var(z) * var(x) - b_xz * b_yz * var(x) * var(y)) / (b_yz^2 * var(x) * var(y) + var(x)) pa$est3 <- (cov(x,y)/sd(x) - cov(x,z) * cov(y, z) / var(z)) / (1 - cov(x,z)^2/var(z)) pa$est4 <- lm(y ~ x + z)$coef[2] pa$est5 <- cov(y[z==1], x[z==1]) / var(x[z==1]) out[[i]] <- pa } out <- bind_rows(out) subset(out, select=c(est1, est2, est3, est4, est5)) %>% pairs subset(out, select=c(est1, est2, est3, est4, est5)) %>% cor
Ok just use the original version in Groenwold 2020, it has the best correlation across parameters with the actual conditioning method.
Can we present in terms of $s$ instead of $z$? What is cov(x,z) in terms of cov(x,s)
$$ \cov(X,Z) = \cov(X,\beta S) $$
$$ \beta = \frac{\cov(S,Z)}{\var{S}} $$
x <- rbinom(n, 2, maf) y <- rnorm(n) s <- x * b_xs + y * b_ys l <- b_sz*s pr <- 1 / (1 + exp(-l)) z <- rbinom(n, 1, pr) table(z) b_xz <- lm(z ~ x)$coef[2] b_yz <- lm(z ~ y)$coef[2] (b_xy * var(z) * var(x) - b_xz * b_yz * var(x) * var(y)) / (b_yz^2 * var(x) * var(y) + var(x)*var(z)) (b_xy * var(z) * var(x) - b_xz * b_yz * var(x) * var(y)) / (b_yz^2 * var(x) * var(y) + var(x)) lm(y ~ x + z)$coef[2] lm(y[z==1] ~ x[z==1])$coef[2]
Obtain $\beta_{yz}$ in terms of $\beta_{ys}$ and $\beta_{sz}$
b_xs = cov(x,s)/var(x) b_ys = cov(y,s)/var(y) b_sz = cov(s,z)/var(s) (b_xy * var(z) * var(x) - b_xz * b_yz * var(x) * var(y)) / (b_yz^2 * var(x) * var(y) + var(x)*var(z)) (b_xy * var(z) * var(x) - b_xs * b_sz * b_ys * b_sz * var(x) * var(y)) / ((b_ys * b_sz)^2 * var(x) * var(y) + var(x)*var(z))
Present b_sz in terms of OR in logistic regression model
x <- rbinom(n, 2, maf) y <- rnorm(n) s <- x * b_xs + y * b_ys l <- b_sz*s pr <- 1 / (1 + exp(-l)) z <- rbinom(n, 1, pr) b_xz <- lm(z ~ x)$coef[2] b_yz <- lm(z ~ y)$coef[2] logor = glm(z ~ s, family="binomial")$coef[2] # estimate lm bhat_sz = lm(z ~ s)$coef[2] # convert beta to logor prop <- sum(z == 1) / length(z) mu <- prop * (1 - prop) # get beta from logOR logor * mu cov(s,z)/var(s) lm(y[z==1] ~ x[z==1])$coef[2] b_xs = cov(x,s)/var(x) b_ys = cov(y,s)/var(y) b_sz = cov(s,z)/var(s) (b_xy * var(z) * var(x) - b_xz * b_yz * var(x) * var(y)) / (b_yz^2 * var(x) * var(y) + var(x)*var(z)) (b_xy * var(z) * var(x) - b_xs * b_sz * b_ys * b_sz * var(x) * var(y)) / ((b_ys * b_sz)^2 * var(x) * var(y) + var(x)*var(z)) (b_xy * var(z) * var(x) - b_xs * logor * mu * b_ys * logor * mu * var(x) * var(y)) / ((b_ys * logor * mu)^2 * var(x) * var(y) + var(x)*var(z))
Quick recap:
$$ \begin{aligned} \widehat{\beta}{xy|s} &= \frac{\beta{xy} \sigma^2_{s} \sigma^2_{x} - \beta_{xs} \beta_{ys} \sigma^2_x \sigma^2_y}{\beta^2_{ys} \sigma^2_x \sigma^2_y + \sigma^2_s \sigma^2_x} \ \widehat{\beta}{xy|z} &= \frac{\beta{xy} \sigma^2_{s} 2m(1-m) - \beta_{xs} \beta_{ys} \log{\text{OR}{sz}}^2 p^2(1-p)^2 2m(1-m) \sigma^2_y}{\beta^2{ys}\log{\text{OR}_{SZ}} 2m(1-m) \sigma^2_y + \sigma^2_s 2m(1-m)} \ \end{aligned} $$
So now we need to present betas in terms of variances. What explains $S$?
$$ \begin{aligned} S_i &= \beta_{SX} X_i + \beta_{SY} Y_i + e_i \ R^2_{S} &= \frac{2 \beta^2_{SX} m(1-m) + \beta^2_{SY} \var{Y}}{\var{Y}} \ V_S &= \frac{2\beta^2_{SX} m(1-m) + \beta^2_{SY}\var{Y}}{R^2_S} \ \end{aligned} $$
So Vx = 2m(1-m), Vy is fixed, and Vs will depend on Vx and the effects.
set.seed(12345) n <- 10000 maf <- 0.3 rsq_gs <- 0.1 rsq_xs <- 0.2 vx <- 1 vs <- 1 rsq_s <- rsq_gs + rsq_xs g <- rbinom(n,2,maf) x <- rnorm(n) b_gs <- sqrt(rsq_gs * vs / (2*maf*(1-maf))) b_xs <- sqrt(rsq_xs * vs / vx) ve <- vs - b_gs^2 * 2*maf*(1-maf) - b_xs^2 * vx s <- g * b_gs + x * b_xs + rnorm(n, sd=sqrt(ve)) var(s) cor(g, s)^2 cor(x, s)^2
$$ \beta_{GS} = \sqrt{\frac{R^2_{GS} \var{S}}{2m(1-m)}} $$ and
$$ \beta_{XS} = \sqrt{\frac{R^2_{XS} \var{S}}{\var{X}}} $$
and
$$ e_i \sim N\left(0, \var{S} - \beta_{XS}^2 \var{X} - 2 \beta_{GS}^2 m(1-m) \right) $$
Remove data, just parameters.
# input parameters n = 100000 prop <- 0.3 or_sz <- 2.5 b_xy = 0 rsq_xs = 0.05 rsq_ys = 0.1 maf = 0.3 vy = 1 vs = 1 # infer logor_sz <- log(or_sz) vz <- prop * (1-prop) vx <- 2 * maf * (1-maf) b_xs <- sqrt(rsq_xs * vs / (2*maf*(1-maf))) b_ys <- sqrt(rsq_ys * vs / vy) ve <- vs - b_xs^2 * 2*maf*(1-maf) - b_ys^2 * vy alpha <- log((1 - prop)/prop) # generate set.seed(678910) y <- rnorm(n) x <- rbinom(n, 2, maf) s <- x * b_xs + y * b_ys + rnorm(n, sd=sqrt(ve)) cor(s,x)^2 cor(s,y)^2 l <- alpha + logor_sz * s pr <- 1 / (1 + exp(-l)) z <- rbinom(n, 1, pr) # infer lm(y[z==1] ~ x[z==1])$coef[2] cor(y[z==1], x[z==1]) # derive bh_xs = cov(x,s)/var(x) bh_xz = cov(x,z)/var(x) bh_ys = cov(y,s)/var(y) bh_yz = cov(y,z)/var(y) bh_sz = cov(s,z)/var(s) (b_xy * var(z) * var(x) - bh_xz * bh_yz * var(x) * var(y)) / (bh_yz^2 * var(x) * var(y) + var(x)*var(z)) (b_xy * var(z) * var(x) - cov(x,z)/var(x) * cov(y,z)/var(y) * var(x) * var(y)) / ((cov(y,z)/var(y))^2 * var(x) * var(y) + var(x)*var(z)) (b_xy * vz * vx - b_xs * logor_sz * vz * b_ys * logor_sz * vz * vx * vy) / ((b_ys * logor_sz * vz)^2 * vx * vy + vx*vz)
How to test power. Two different ways, just use the native version
param <- expand.grid( rsq = seq(0.001, 0.4, by=0.001), n = c(100,1000, 10000) ) for(i in 1:nrow(param)) { param$p1[i] <- power.anova.test(groups=2, n=param$n[i], between.var=param$rsq[i], within.var=1-param$rsq[i],sig.level=0.05)$power param$p2[i] <- pwr.r.test(n=param$n[i],r=sqrt(param$rsq[i]),sig.level=0.05)$power } plot(param$p1, param$p2)
Now create function
simulate_y_structure <- function(n, prop, or_sz, b_xy, rsq_xs, rsq_ys, maf, vy=1, vs=1, sig.level=5e-8) { # infer logor_sz <- log(or_sz) vz <- prop * (1-prop) vx <- 2 * maf * (1-maf) b_xs <- sqrt(rsq_xs * vs / (2*maf*(1-maf))) b_ys <- sqrt(rsq_ys * vs / vy) ve <- vs - b_xs^2 * 2*maf*(1-maf) - b_ys^2 * vy alpha <- log((1 - prop)/prop) bh_xy <- (b_xy * vz * vx - b_xs * logor_sz * vz * b_ys * logor_sz * vz * vx * vy) / ((b_ys * logor_sz * vz)^2 * vx * vy + vx*vz) cor_xy <- sqrt(bh_xy^2 * vx / vy) * sign(bh_xy) pow <- power.anova.test(groups=2, n=n * prop, between.var=cor_xy^2, within.var = 1 - cor_xy^2, sig.level=sig.level)$power ns <- n * prop se <- sqrt((1 - cor_xy^2) * vs / (ns * vx)) pval <- pnorm(abs(bh_xy / se), lower.tail=FALSE) return(list(bh_xy=bh_xy, cor_xy=cor_xy, pow=pow, se=se, pval=pval, ns=ns)) } simulate_y_structure(n=50000000, prop=0.8, 2, 0, 0.001, 0.1, 0.5) simulate_y_structure(n=50000000, prop=0.2, 2, 0, 0.001, 0.1, 0.5)
Run some simulations
param <- expand.grid( n = 50000000, prop = seq(0.01,0.99,by=0.01), or_sz = c(1, 1.25, 1.5, 1.75, 2, 3), b_xy = c(0), rsq_xs = c(0.0001,0.0005, 0.001), rsq_ys = c(0.01, 0.05, 0.1), maf = 0.3 ) dim(param) res <- lapply(1:nrow(param), function(x) { P <- param[x,] o <- simulate_y_structure(n=P$n, prop=P$prop, P$or_sz, P$b_xy, P$rsq_xs, P$rsq_ys, P$maf) as_tibble(c(P, o)) }) %>% bind_rows() ggplot(res, aes(x=prop*n, y=cor_xy, group=as.factor(or_sz))) + geom_point(aes(colour=as.factor(or_sz))) + geom_line(aes(colour=as.factor(or_sz))) + facet_grid(rsq_xs ~ rsq_ys) + scale_colour_brewer() + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5)) ggplot(res, aes(x=prop*n, y=-log10(pval), group=as.factor(or_sz))) + geom_point(aes(colour=as.factor(or_sz))) + geom_line(aes(colour=as.factor(or_sz))) + facet_grid(rsq_xs ~ rsq_ys) + scale_colour_brewer() + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5)) ggplot(res, aes(x=prop*n, y=pow, group=as.factor(or_sz))) + geom_point(aes(colour=as.factor(or_sz))) + geom_line(aes(colour=as.factor(or_sz))) + facet_grid(rsq_xs ~ rsq_ys) + geom_vline(xintercept=5000000) + scale_colour_brewer() + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))
# simulate SNP x1 = rbinom(10000, 2, 0.4) # liability z = 1 + 0.2*x1 mean(z) # probability of case pr = 1/(1+exp(-z)) # sample cases y = rbinom(10000,1,pr) table(y) #now feed it to glm: (logor = glm(y ~ x1, family="binomial")$coef[2]) # estimate lm cov(x1,y) / var(x1) (beta = lm(y ~ x1)$coef[2]) # convert beta to logor prop <- sum(y == 1) / length(y) mu <- prop * (1 - prop) beta / mu # Is it different when you use the known intercept? propt <- (1 / (1+exp(-1))) mut <- propt * (1 - propt) beta / mut
$$ \cov(s,z) = p(1-p) \left(\E{s | z = 1} - \E{s | z = 0}\right) $$
Here $p$ is the proportion of individuals in the popoulation recruited into the sample. For simplicity call $\delta = \E{s | z = 1} - \E{s | z = 0}$ the difference in mean educational attainment values between individuals recruited and not recruited into the study.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.