In stephenslab/susieR: Sum of Single Effects Linear Regression

In this vignette, we briefly illustrate the different ways susie_rss can be called, and draw connections between running susie_rss on summary data, and running susie on individual-level data.

knitr::opts_chunk$set(collapse = TRUE,comment = "#",fig.width = 5,
                      fig.height = 3,fig.align = "center",
                      dpi = 120)

library(susieR)

Simulate a data set with 200 samples and 1,000 variables, in which the only first 4 variables affect the outcome.

set.seed(1)
n <- 200
p <- 1000
beta <- rep(0,p)
beta[1:4] <- 1
X <- matrix(rnorm(n*p),nrow = n,ncol = p)
X <- scale(X,center = TRUE,scale = FALSE)
y <- drop(X %*% beta + rnorm(n))

Compute summary statistics $\hat{b}_j, \hat{s}_j$ and the correlation matrix, ${\bf R}$. These quantities will be provided as input to susie_rss.

ss  <- univariate_regression(X,y)
dat <- compute_suff_stat(X,y,standardize = FALSE)
R   <- cov2cor(dat$XtX)

The susie and susie_rss analyses produce the exact same results when the summary statistics bhat, shat, var_y and n are provided to susie_rss (and when R is an "in sample" correlation estimate---that is, when it was computed from the same matrix X that was used to obtain the other statistics). If the covariate effects are removed from the genotypes in univariate regression, the in-sample LD matrix should compute from the genotype residuals where the covariate effects have been removed.

res1 <- susie(X,y,L = 10)
res2 <- susie_rss(bhat = ss$betahat,shat = ss$sebetahat,R = R,n = n,
                  var_y = var(y),L = 10,estimate_residual_variance = TRUE)
plot(coef(res1),coef(res2),pch = 20,xlab = "susie",ylab = "susie_rss")
abline(a = 0,b = 1,col = "skyblue",lty = "dashed")

When some but not all of these statistics are provided, the results should be similar, but not exactly the same.

Next let's compare the susie and susie_rss outputs when ${\bf X}, y$ are standardized before computing the summary statistics (by "standardize", we mean that $y$ and the columns of ${\bf X}$ are each divided by the sample standard deviation so that they each have the same standard deviation).

ss  <- univariate_regression(scale(X),scale(y))
dat <- compute_suff_stat(X,y,standardize = TRUE)
R   <- cov2cor(dat$XtX)

Then we compute the z-scores:

zhat <- ss$betahat/ss$sebetahat

When standardizing, providing susie_rss with summary data z (or bhat, shat), R and n is sufficient for susie_rss to recover the same results as susie:

res1 <- susie(scale(X),scale(y),L = 10)
res2 <- susie_rss(bhat = ss$betahat,shat = ss$sebetahat,R = R,n = n,
                  L = 10,estimate_residual_variance = TRUE)
res3 <- susie_rss(zhat,R,n = n,L = 10,estimate_residual_variance = TRUE)
layout(matrix(1:2,1,2))
plot(coef(res1),coef(res2),pch = 20,xlab = "susie",
          ylab = "susie_rss(bhat,shat)")
abline(a = 0,b = 1,col = "skyblue",lty = "dashed")
plot(coef(res1),coef(res3),pch = 20,xlab = "susie",ylab = "susie_rss(z)")
abline(a = 0,b = 1,col = "skyblue",lty = "dashed")

When the residual variance is not estimated in susie_rss, the susie_rss results may be close to susie, but may no longer be exactly the same:

res4 <- susie_rss(zhat,R,n = n,L = 10)
plot(coef(res1),coef(res4),pch = 20,xlab = "susie",ylab = "susie_rss")
abline(a = 0,b = 1,col = "skyblue",lty = "dashed")

Whenever R is an "in sample" correlation matrix, we recommend estimating the residual variance.

Without providing the sample size, n, the coefficients are interpreted as the "noncentrality parameters" (NCPs), and are (roughly) related to the susie parameters by a factor of $\sqrt{n}$:

res5 <- susie_rss(zhat,R,L = 10)
plot(coef(res1),coef(res5)/sqrt(n),pch = 20,xlab = "susie",
ylab = "susie_rss/sqrt(n)")
abline(a = 0,b = 1,col = "skyblue",lty = "dashed")