Association analysis for HapAlleles and HapBlocks

Description

Given a GHap.blmm object (as supplied by the ghap.blmm function), this function returns association statistics for HapAllele or HapBlocks.

Usage

1
2
ghap.assoc(blmm, haplo, type = "HapAllele", gc = TRUE, 
           only.active.alleles = TRUE, ncores = 1)

Arguments

blmm

A GHap.blmm object, such as supplied by the ghap.blmm function.

haplo

A GHap.haplo object.

type

A character value specifying whether association analysis should be carried on alleles or blocks. If type="HapAlleles" (default), association results are based on chi-squared test computed from least squares regression. If type="HapBlock", HapAlleles within each block are fitted together and an F test is reported.

only.active.alleles

A logical value specifying whether calculations should be reported only for active haplotype alleles (default = TRUE).

gc

A logical value specifying whether genomic control should be performed (default = TRUE). Currently, this option does not take effect on HapBlocks.

ncores

A numeric value specifying the number of cores to be used in parallel computations (default = 1).

Details

Consider the residuals from the linear mixed model analysis:

\mathbf{\hat{e}} = \mathbf{y} - \mathbf{X\hat{b}} - \mathbf{Z\hat{u}}

These residuals can be viewed as adjusted records accounting for covariates and polygenic effects, i.e. \mathbf{y}_{adj} = \mathbf{\hat{e}}. We can then conduct least squares regression to test each HapAllele at a time for association with phenotypes. The fixed effect, error variance and test statistic of a given HapAllele are estimated as:

\hat{a}_i = (\mathbf{h}_i'\mathbf{h}_i)^{-1}\mathbf{h}_i'\mathbf{y}_{adj}

VAR(\hat{a}_i) = (\mathbf{h}_i'\mathbf{h}_i)^{-1}\hat{σ}_e^2

t_i^2 = \frac{\hat{a}_i^2}{VAR(\hat{a}_i)}

Under the null hypothesis that the regression coefficient is zero t_i^2 \sim χ^2(ν = 1). Given a GHap.blmm object, the function regresses adjusted records on each HapAllele. As the linear mixed model admits repeated measures through \mathbf{Z}, the adjusted records are dully mapped against the vector of HapGenotypes.

The user must be aware of two known caveats associated with this approach. First, by pre-adjusting records instead of estimating HapAllele effects based on generalized least squares equations we ignore covariance structure and therefore bias the estimates downwards (Svishcheva et al., 2012). Second, each HapAllele being tested is also included in the kinship matrix, such that the HapAllele is included twice in the model: as fixed and random effect. This problem is known as proximal contamination (Listgarten et al., 2012). In the first case, we can use genomic control to recover p-values to an unbiased scale (Devlin and Roeder, 1999; Amin et al., 2007). However, not much can be done regarding the estimates of the effects. As a general recommendation, if the user is only interested in the p-values, the ghap.assoc analysis should be sufficient. When effect estimates are of interest, the user can include the candidate HapAllele as a fixed effect in the full model in ghap.blmm. For the second case, a leave-one-chromosome-out (LOCO analysis) procedure can mitigate proximal contamination (Yang et al., 2014).

If the argument type is supplied with the option "HapBlock" instead of the default "HapAllele", the following HapBlock test is applied: Let HapAlleles 1, 2, ..., m have frequencies p_1, p_2, ..., p_m. Also, let α_{1j} be the average effect of substituting HapAllele 1 by HapAllele j, such that there are m - 1 independent substitution effects to be estimated. We define as HapAllele 1 the one presenting the smallest frequency. Under this framework, we can obtain least squares estimates as:

\hat{\mathbf{α}} = (\mathbf{M}'\mathbf{M})^{-1}\mathbf{M}'\mathbf{y}_{adj}

where \mathbf{M} is a n x m - 1 sub-HapGenotypes matrix for the HapAlleles within the block. Each entry m_{ij} of this matrix take values 2p_j, -(1-2p_j) and -2(1-p_j) for HapGenotypes 0, 1 and 2, respectively. This parameterization follows Da (2015).

Then, given the explained (ESS) and the residual (RSS) sum of squares, an F test is computed instead of the chi-squared test:

F = \frac{ESS/ν_1}{RSS/ν_2}

ESS = (\mathbf{M}\hat{\mathbf{α}} - \hat{μ}_y)'(\mathbf{M}\hat{\mathbf{α}} - \hat{μ}_y)

RSS = (\mathbf{y}_{adj} - \mathbf{M}\hat{\mathbf{α}})'(\mathbf{y}_{adj} - \mathbf{M}\hat{\mathbf{α}})

\hat{μ}_y = \frac{1}{N} ∑_{i=1}^{n} y_i

ν_1 = m - 1

ν_2 = n - m - 1

Under the null hypothesis VAR(\mathbf{M}α) = 0 (i.e, variance due to the HapBlock is zero) F \sim F(ν_1,ν_2).

Value

If type="HapAllele", the function returns a data.frame with the following columns:

BLOCK

Block alias.

CHR

Chromosome name.

BP1

Block start position.

BP2

Block end position.

ALLELE

Haplotype allele identity.

BETA

BLUE for the fixed effect of the haplotype allele.

SE

Standard error for the fixed effect.

FREQ

Frequency of the haplotype allele.

CHISQ.OBS

Observed value for the test statistics. If gc = TRUE (default), these values are scaled by the inflation factor. Inflation is computed through regression of observed quantiles onto expected quantiles. In order to avoid overestimation, only HapAlleles with test statistics within three standard deviations from the mean are used to compute the inflation factor.

CHISQ.EXP

Expected values for the test statistics.

logP

log10(1/P) or -log10(P) for the fixed effect.

If type="HapBlock", the function returns a data.frame with the following columns:

BLOCK

Block alias.

CHR

Chromosome name.

BP1

Block start position.

BP2

Block end position.

N.ALLELES

Number of HapAlleles per block.

F.TEST

F statistic for HapBlock association.

logP

log10(1/P) or -log10(P) for the F statistic.

Author(s)

Yuri Tani Utsunomiya <ytutsunomiya@gmail.com>

References

N. Amin et al. A Genomic Background Based Method for Association Analysis in Related Individuals. PLoS ONE. 2007. 2:e1274.

Y. Da. Multi-allelic haplotype model based on genetic partition for genomic prediction and variance component estimation using SNP markers. BMC Genet. 2015. 16:144.

B. Devlin and K. Roeder. Genomic control for association studies. Biometrics. 1999. 55:997-1004.

J. Listgarten et al. Improved linear mixed models for genome-wide association studies. Nat. Methods. 2012. 9:525-526.

G. R. Svishcheva et al. Rapid variance components-based method for whole-genome association analysis. Nat Genet. 2012. 44:1166-1170.

J. Yang et al. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 2014. 46: 100-106.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# #### DO NOT RUN IF NOT NECESSARY ###
# 
# # Copy the example data in the current working directory
# ghap.makefile()
# 
# # Load data
# phase <- ghap.loadphase("human.samples", "human.markers", "human.phase")
# 
# # Subset data - randomly select 3000 markers with maf > 0.02
# maf <- ghap.maf(phase, ncores = 2)
# set.seed(1988)
# markers <- sample(phase$marker[maf > 0.02], 3000, replace = FALSE)
# phase <- ghap.subsetphase(phase, unique(phase$id), markers)
# rm(maf,markers)
# 
# # Generate block coordinates based on windows of 10 markers, sliding 5 marker at a time
# blocks <- ghap.blockgen(phase, 10, 5, "marker")
# 
# # Generate matrix of haplotype genotypes
# ghap.haplotyping(phase, blocks, batchsize = 100, ncores = 2, freq = 0.05, outfile = "example")
# 
# # Load haplotype genotypes
# haplo <- ghap.loadhaplo("example.hapsamples", "example.hapalleles", "example.hapgenotypes")
# 
# # Compute kinship matrix
# K <- ghap.kinship(haplo, batchsize = 100)
# 
# # Quantitative trait with 50% heritability
# # One major haplotype accounting for 30% of the genetic variance
# sim <- ghap.simpheno(haplo = haplo, K = K, h2 = 0.5, g2 = 0.3, major = 1000,seed=1988)
# 
# #Continuous model
# model <- ghap.mme(fixed = phenotype ~ 1, random = "individual", data = sim$data, K = K,
#                   varcomp = c(sim$vare,sim$varu))
# 
# ### RUN ###
# 
# #HapAllele GWAS
# gwas.allele <- ghap.assoc(blmm = model, haplo = haplo, type = "HapAllele", gc = TRUE, ncores=2)
# gwas.allele$POS <- (gwas.allele$BP1+gwas.allele$BP2)/2e+6
# plot(gwas.allele$POS,gwas.allele$logP, xlab="Chromosome 2 position (in Mb)", 
#      ylab=expression(-log[10](p)), pch=20, col="#471FAA99")
# abline(v=(haplo$bp1[1000]+haplo$bp2[1000])/2e+6,lty=3)
# 
# #HapBlock GWAS
# gwas.block <- ghap.assoc(blmm = model, haplo = haplo, type = "HapBlock", ncores=2)
# gwas.block$POS <- (gwas.block$BP1+gwas.block$BP2)/2e+6
# plot(gwas.block$POS,gwas.block$logP, xlab="Chromosome 2 position (in Mb)",
#      ylab=expression(-log[10](p)), pch=20, col="#471FAA99")
# abline(v=(haplo$bp1[1000]+haplo$bp2[1000])/2e+6,lty=3)