This package provides functions for testing hypothesis of genetic association with a continously measurable trait when data are from the extreme phenotype sampling (EPS) design.
library(extremesampling)
The assumed linear regression model is of the form $$\mathbf{y} = X \beta + G \beta_g + \varepsilon,$$ where $y$ is a phenotype vector (continuosly measurable trait), $X$ is a matrix of non-genetic/environmental covariates (including intercept) and $G$ is a matrix of genetic covariates. The vector $\varepsilon$ is multivariate normal with mean $0$ and covariance $I\sigma^2$ ($I$ is the identity matrix).
We have implemented score tests for testing $H_0: \beta_g = 0$ against $H_0: \beta_g \neq 0$, for two special situations
We consider two types of data analysis
Complete case analysis. The response $y_i$, non-genetic covariates $\mathbf{x}_i$ and genetic covarirates $\mathbf{g}_i$ are observed only for extreme phenotype individuals, i.e. individuals who satisfy $y_i < l_i$ or $y_i > u_i$, where thresholds $l_i$ and $u_i$ are known, but might be different for each individual in the sample.
All case analysis. The response $y_i$ and non-genetic covariates $\mathbf{x}_i$ are observed for a full sample (randomly drawn from infinite population) while the genetic covarirates $\mathbf{g}_i$ are observed only for extreme phenotype individuals.
For complete case analysis, the thresholds must be supplied by the user.
For all case analysis, the genotype matrix (or vector) must be of the same length as $\mathbf{y}$ and coded as NA where genotypes are not observed (non-extreme individuals). Further, it is assumed that the genotype is a discrete variable. The genotype distribution can be group-specific, and a matrix (dummy coded categorical covariates) that indicates these groups must be supplied.
We simulate data to illustrate score test functions. Extreme phenotype thresholds are set accoriding to the quantiles of the residual phenotype distribution ($z$-extreme sampling).
N = 5000 # Number of individuals in a population xe1 = rnorm(n = N, mean = 2, sd = 1) # Environmental covariate xe2 = rbinom(n = N, size = 1, prob = 0.3) # Environmental covariate Gmat = cbind(rbinom(N,2,0.4),rbinom(N,2,0.1),rbinom(N,2,0.25), rbinom(N,2,0.01),rbinom(N,2,0.01),rbinom(N,2,0.01)) # Genetic variants # Generate phenotype betag = c(0.1,-0.1,0.3,0.5,-0.2,0) y = rnorm(N, mean = 10+5*xe1+1*xe2+Gmat%*%betag, sd = 4) # Identify extremes, here upper and lower 25% of residual phenotype distribution z = lm(y~xe1+xe2)$residuals uz = quantile(z,probs = 3/4,na.rm=TRUE) lz = quantile(z,probs = 1/4,na.rm=TRUE) extreme = (z < lz) | (z >= uz) li = lz + (y[extreme]-z[extreme]); ui = uz + (y[extreme]-z[extreme]) # Create the EPS complete case data set yCC = y[extreme]; zCC = z[extreme] xe1CC = xe1[extreme]; xe2CC = xe2[extreme] GmatCC = Gmat[extreme,] # Create the EPS all case data set yAC = y; zAC = z xe1AC = xe1; xe2AC = xe2 GmatAC = Gmat; GmatAC[!extreme,] = NA
Each of the six columns of the genetic covariate matrix $G$ are tested one at a time against the null model $\mathbf{y} = X\beta + \varepsilon$.
cc = epsCC.test(yCC~xe1CC + xe2CC, xg = GmatCC, l = li, u = ui)
print(cc)
ac = epsAC.test(yAC~xe1AC + xe2AC, xg = GmatAC)
print(ac)
# Direct test of all 6 variants cc_d = epsCC.rv.test(yCC~xe1CC + xe2CC, xg = GmatCC, l = li, u = ui)$p.value # Collapsing test (weights = 1 for all variants) cc_c = epsCC.rv.test(yCC~xe1CC + xe2CC, xg = GmatCC, l = li, u = ui, method = "collapse")$p.value # Variance component test (weights = 1 for all variants) cc_v = epsCC.rv.test(yCC~xe1CC + xe2CC, xg = GmatCC, l = li, u = ui, method = "varcomp")$p.value ccall = data.frame(cbind(cc_d, cc_c, cc_v)) names(ccall) = c("CC", "CC collapse", "CC varcomp")
print(ccall)
# Direct test of all 6 variants ac_d = epsAC.rv.test(yAC~xe1AC + xe2AC, xg = GmatAC)$p.value # Collapsing test (weights = 1 for all variants) ac_c = epsAC.rv.test(yAC~xe1AC + xe2AC, xg = GmatAC, method = "collapse")$p.value # Variance component test (weights = 1 for all variants) ac_v = epsAC.rv.test(yAC~xe1AC + xe2AC, xg = GmatAC, method = "varcomp")$p.value acall = data.frame(cbind(ac_d, ac_c, ac_v)) names(acall) = c("AC", "AC collapse", "AC varcomp")
print(acall)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.