knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
subgxe
is an R package for combining summary data from multiple association
studies or multiple phenotypes in a single study by incorporating potential
gene-environment (G-E) interactions into the testing procedure. It is an
implementation of the p value-assisted subset testing for associations (pASTA)
framework proposed by Yu et al(2019). The goal is to identify a subset of
studies or traits that yields the strongest evidence of associations and give a
meta-analytic p-value. This vignette offers a brief introduction to the basic
use of subgxe
. For more details on the algorithms used by subgxe
please refer
to the paper.
We use simulated data of $K=5$ independent case-control studies that come along
with the package to illustrate the basic use of subgxe
. In each data set,
G
, E
, and D
denote the genetic variant, environmental factor, and
disease status (binary outcome), respectively. In this case, G
is coded as
binary (under a dominant or recessive susceptibility model). It can also be
coded as allele count (under the additive model). Two of the 5 studies have
non-null genetic associations with the true marginal genetic odds ratio
being 1.09. Each study has 6,000 cases and 6,000 controls, with the total
sample size $n_k$ being 12,000. For the specific underlying parameters of the
data generating model, please refer to the original
article (Table A4, Scenario 2).
# library(devtools) # install_github("umich-cphds/subgxe", build_opts = c()) library(subgxe)
We first obtain a $K\times1$ vector of input p-values by conducting association test for each study. For study $k$, $k=1,\cdots,K$, the joint model with G-E interaction is $$\text{logit}[E(D_{ki}|G_{ki},E_{ki})]=\beta_0^{(k)}+\beta_G^{(k)}G_{ki}+\beta_E^{(k)}E_{ki}+\beta_{GE}^{(k)}G_{ki}E_{ki}$$ where $i=1,\cdots,n_k$. The model can be further adjusted for potential confounders, which we drop from the presentation for the simplicity of notation.
To detect the genetic association while accounting for the G-E interaction,
one can test the null hypothesis $$\beta_{G}^{(k)}=\beta_{GE}^{(k)}=0$$ based
on the joint model for each study $k$. The coefficients can be estimated by
maximum likelihood using the glm
function. For alternative null hypotheses and
methods for estimation of coefficients, see the
reference mentioned above.
A common choice for testing the null hypothesis
$\beta_{G}^{(k)}=\beta_{GE}^{(k)}=0$ is the likelihood ratio test (LRT) with 2
degrees of freedom, which can be carried out with the lrtest
function in the
package lmtest
. We use the results of LRT as an example to demonstrate the
use of subgxe
. For comparative purposes, we also look at the p-values of the
marginal genetic associations obtained by Wald test, i.e. the p-values of
$\hat{\alpha}_G^{(k)}$ in the model
$$\text{logit}[E(D_{ki}|G_{ki})]=\alpha_0^{(k)}+\alpha_G^{(k)}G_{ki}.$$
library(lmtest) K <- 5 # number of studies study.pvals.marg <- NULL study.pvals.joint <- NULL for(i in 1:K){ joint.model <- glm(D ~ G + E + I(G*E), data=studies[[i]], family="binomial") null.model <- glm(D ~ E, data=studies[[i]], family="binomial") marg.model <- glm(D ~ G, data=studies[[i]], family="binomial") study.pvals.marg[i] <- summary(marg.model)$coef[2,4] study.pvals.joint[i] <- lmtest::lrtest(null.model, joint.model)[2,5] }
Then we use the pasta()
function in the subgxe
package to conduct subset
analysis and obtain a meta-analytic p-value for the genetic association.
cor
parameter is a correlation matrix of the study-specific p-values.
In this example, since the studies are independent, the p-values are
independent as well, and therefore the cor
should be an identity matrix. In
a multiple-phenotype analysis where the phenotypes are measured on the same
set of subjects, one way to approximate the correlations among p-values is to
use the phenotypic correlations.study.sizes <- c(nrow(studies[[1]]), nrow(studies[[2]]), nrow(studies[[3]]), nrow(studies[[4]]), nrow(studies[[5]])) cor.matrix <- diag(1, K) pasta.joint <- pasta(p.values=study.pvals.joint, study.sizes=study.sizes, cor=cor.matrix) pasta.marg <- pasta(p.values=study.pvals.marg, study.sizes=study.sizes, cor=cor.matrix) pasta.joint$p.pasta # delete 'joint' pasta.joint$test.statistic$selected.subset pasta.marg$p.pasta # delete 'joint' pasta.marg$test.statistic$selected.subset
From the output we observe that when the G-E interaction is taken into account,
pasta
yields a meta-analytic p-value of
r formatC(pasta.joint$p.pasta, format = "f", digits = 3)
and identifies
the first two studies as non-null. On the other hand, if we only consider the
marginal associations, the meta-analytic p-value becomes much larger
(p=r formatC(pasta.marg$p.pasta, format = "f", digits = 3)
) and the
first three studies are identified as having significant associations.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.