knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
subgxe
is an R package for combining summary data from multiple association
studies (or multiple phenotypes in a single study) by incorporating potential
gene-environment (G-E) interactions into the testing procedure. It is an
implementation of the p value-assisted subset testing for associations (pASTA)
framework proposed by Yu et al(2019). pASTA aims to identify a subset of
studies or traits that yields the strongest evidence of associations and
calculates a meta-analytic p-value from that subset. This vignette offers a
brief introduction to the basic use of subgxe
. For more details on how pASTA
works, please refer to the paper.
We simulated $K=5$ independent case-control studies that are included in the
package (subgxe::studies
) to illustrate the basic use of subgxe
. In each
data set, G
, E
, and D
denote the genetic variant, environmental factor,
and disease status (binary outcome) respectively. In this case, G
is coded as
binary (under a dominant or recessive susceptibility model). It can also be
coded as allele count (under the additive model). Two of the 5 studies have
non-null genetic associations with the true marginal genetic odds ratio being
1.09. Each study has 6,000 cases and 6,000 controls, with the total sample size
$n_k$ being 12000. For the specific underlying parameters of the data generating
model, please refer to the original article
(Table A4, Scenario 2).
# install.packages("subgxe") library(subgxe)
We first obtain a vector of input p-values of length $K$ by conducting an association test for each study. For each study $k$, the joint model with G-E interaction taken into account is $$ \mathrm{logit}[E(D_{ki}|G_{ki},E_{ki})] = \beta_0^{(k)} + \beta_G^{(k)}G_{ki} + \beta_E^{(k)}E_{ki} + \beta_{GE}^{(k)} G_{ki}E_{ki} $$ where $i = 1, \cdots, n_k$. This model can be further adjusted for potential confounders, which we drop from the formula for simplicity of notation.
To detect the genetic association while accounting for the G-E interaction, we
test the null hypothesis
$$
\beta_{G}^{(k)}=\beta_{GE}^{(k)}=0
$$
based on the joint model for each study. The coefficients can be estimated
by maximum likelihood using the glm
function. For alternative null hypotheses
and methods for estimation of coefficients, refer to the paper.
A common choice for testing the null hypothesis is the likelihood ratio test
(LRT) with 2 degrees of freedom, which can be carried out with the lrtest
function in the package lmtest
. We will use the results of the LRT in our
example. For comparative purposes, we also look at the p-values of the
marginal genetic associations obtained by Wald test, i.e., the p-values of
$\hat{\alpha}_G^{(k)}$ in the model
$$\text{logit}[E(D_{ki}|G_{ki})]=\alpha_0^{(k)}+\alpha_G^{(k)}G_{ki}$$
library(lmtest) # number of studies K <- 5 study.pvals.marg <- NULL study.pvals.joint <- NULL for (i in 1:K) { joint.model <- glm(D ~ G + E + I(G*E), data = studies[[i]], family = "binomial") null.model <- glm(D ~ E, data = studies[[i]], family = "binomial") marg.model <- glm(D ~ G, data = studies[[i]], family = "binomial") study.pvals.marg[i] <- summary(marg.model)$coef[2, 4] study.pvals.joint[i] <- lmtest::lrtest(null.model, joint.model)[2, 5] }
Now, we just need to specify the cor
parameter -- the correlation matrix of
the study-specific p-values. In this example, since the studies are independent,
the p-values are also independent, and therefore the cor
should be the
identity matrix. In a multiple-phenotype analysis where the phenotypes are
measured on the same set of subjects, one way to approximate the correlations
among p-values is to use the phenotypic correlations.
Now, lets use the pasta
function to perform subset analysis and obtain a
meta-analytic p-value for the genetic association.
study.sizes <- c(nrow(studies[[1]]), nrow(studies[[2]]), nrow(studies[[3]]), nrow(studies[[4]]), nrow(studies[[5]])) cor.matrix <- diag(1, K) pasta.joint <- pasta(p.values = study.pvals.joint, study.sizes = study.sizes, cor = cor.matrix ) pasta.marg <- pasta(p.values = study.pvals.marg, study.sizes = study.sizes, cor = cor.matrix ) pasta.joint$p.pasta pasta.joint$test.statistic$selected.subset pasta.marg$p.pasta pasta.marg$test.statistic$selected.subset
pasta
yields a meta-analytic p-value of
r formatC(pasta.joint$p.pasta, format = "f", digits = 3)
when the G-E interaction is taken into account, and identifies
the first two studies as non-null. On the other hand, if we only consider the
marginal associations, the meta-analytic p-value becomes much larger
(p=r formatC(pasta.marg$p.pasta, format = "f", digits = 3)
) and the
first three studies are identified as having significant associations.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.