# Association analysis for HapAlleles and HapBlocks

### Description

Given a GHap.blmm object (as supplied by the `ghap.blmm`

function), this function returns association statistics for HapAllele or HapBlocks.

### Usage

1 2 | ```
ghap.assoc(blmm, haplo, type = "HapAllele", gc = TRUE,
only.active.alleles = TRUE, ncores = 1)
``` |

### Arguments

`blmm` |
A GHap.blmm object, such as supplied by the |

`haplo` |
A GHap.haplo object. |

`type` |
A character value specifying whether association analysis should be carried on alleles or blocks. If type="HapAlleles" (default), association results are based on chi-squared test computed from least squares regression. If type="HapBlock", HapAlleles within each block are fitted together and an F test is reported. |

`only.active.alleles` |
A logical value specifying whether calculations should be reported only for active haplotype alleles (default = TRUE). |

`gc` |
A logical value specifying whether genomic control should be performed (default = TRUE). Currently, this option does not take effect on HapBlocks. |

`ncores` |
A numeric value specifying the number of cores to be used in parallel computations (default = 1). |

### Details

Consider the residuals from the linear mixed model analysis:

*\mathbf{\hat{e}} = \mathbf{y} - \mathbf{X\hat{b}} - \mathbf{Z\hat{u}}*

These residuals can be viewed as adjusted records accounting for covariates and polygenic effects, i.e. *\mathbf{y}_{adj} = \mathbf{\hat{e}}*. We can then conduct least squares regression to test each HapAllele at a time for association with phenotypes. The fixed effect, error variance and test statistic of a given HapAllele are estimated as:

*\hat{a}_i = (\mathbf{h}_i'\mathbf{h}_i)^{-1}\mathbf{h}_i'\mathbf{y}_{adj}*

*VAR(\hat{a}_i) = (\mathbf{h}_i'\mathbf{h}_i)^{-1}\hat{σ}_e^2*

*t_i^2 = \frac{\hat{a}_i^2}{VAR(\hat{a}_i)}*

Under the null hypothesis that the regression coefficient is zero *t_i^2 \sim χ^2(ν = 1)*. Given a GHap.blmm object, the function regresses adjusted records on each HapAllele. As the linear mixed model admits repeated measures through *\mathbf{Z}*, the adjusted records are dully mapped against the vector of HapGenotypes.

The user must be aware of two known caveats associated with this approach. First, by pre-adjusting records instead of estimating HapAllele effects based on generalized least squares equations we ignore covariance structure and therefore bias the estimates downwards (Svishcheva et al., 2012). Second, each HapAllele being tested is also included in the kinship matrix, such that the HapAllele is included twice in the model: as fixed and random effect. This problem is known as proximal contamination (Listgarten et al., 2012). In the first case, we can use genomic control to recover p-values to an unbiased scale (Devlin and Roeder, 1999; Amin et al., 2007). However, not much can be done regarding the estimates of the effects. As a general recommendation, if the user is only interested in the p-values, the ghap.assoc analysis should be sufficient. When effect estimates are of interest, the user can include the candidate HapAllele as a fixed effect in the full model in ghap.blmm. For the second case, a leave-one-chromosome-out (LOCO analysis) procedure can mitigate proximal contamination (Yang et al., 2014).

If the argument type is supplied with the option "HapBlock" instead of the default "HapAllele", the following HapBlock test is applied: Let HapAlleles *1*, *2*, ..., *m* have frequencies *p_1*, *p_2*, ..., *p_m*. Also, let *α_{1j}* be the average effect of substituting HapAllele *1* by HapAllele *j*, such that there are *m - 1* independent substitution effects to be estimated. We define as HapAllele *1* the one presenting the smallest frequency. Under this framework, we can obtain least squares estimates as:

*\hat{\mathbf{α}} = (\mathbf{M}'\mathbf{M})^{-1}\mathbf{M}'\mathbf{y}_{adj}*

where *\mathbf{M}* is a *n* x *m - 1* sub-HapGenotypes matrix for the HapAlleles within the block. Each
entry *m_{ij}* of this matrix take values *2p_j*, *-(1-2p_j)* and *-2(1-p_j)* for HapGenotypes *0*, *1* and *2*, respectively. This parameterization follows Da (2015).

Then, given the explained (*ESS*) and the residual (*RSS*) sum of squares, an F test is computed instead of the chi-squared test:

*F = \frac{ESS/ν_1}{RSS/ν_2}*

*ESS = (\mathbf{M}\hat{\mathbf{α}} - \hat{μ}_y)'(\mathbf{M}\hat{\mathbf{α}} - \hat{μ}_y)*

*RSS = (\mathbf{y}_{adj} - \mathbf{M}\hat{\mathbf{α}})'(\mathbf{y}_{adj} - \mathbf{M}\hat{\mathbf{α}})*

*\hat{μ}_y = \frac{1}{N} ∑_{i=1}^{n} y_i*

*ν_1 = m - 1*

*ν_2 = n - m - 1*

Under the null hypothesis *VAR(\mathbf{M}α) = 0* (i.e, variance due to the HapBlock is zero) *F \sim F(ν_1,ν_2)*.

### Value

If type="HapAllele", the function returns a data.frame with the following columns:

`BLOCK` |
Block alias. |

`CHR` |
Chromosome name. |

`BP1` |
Block start position. |

`BP2` |
Block end position. |

`ALLELE` |
Haplotype allele identity. |

`BETA` |
BLUE for the fixed effect of the haplotype allele. |

`SE` |
Standard error for the fixed effect. |

`FREQ` |
Frequency of the haplotype allele. |

`CHISQ.OBS` |
Observed value for the test statistics. If gc = TRUE (default), these values are scaled by the inflation factor. Inflation is computed through regression of observed quantiles onto expected quantiles. In order to avoid overestimation, only HapAlleles with test statistics within three standard deviations from the mean are used to compute the inflation factor. |

`CHISQ.EXP` |
Expected values for the test statistics. |

`logP` |
log10(1/P) or -log10(P) for the fixed effect. |

If type="HapBlock", the function returns a data.frame with the following columns:

`BLOCK` |
Block alias. |

`CHR` |
Chromosome name. |

`BP1` |
Block start position. |

`BP2` |
Block end position. |

`N.ALLELES` |
Number of HapAlleles per block. |

`F.TEST` |
F statistic for HapBlock association. |

`logP` |
log10(1/P) or -log10(P) for the F statistic. |

### Author(s)

Yuri Tani Utsunomiya <ytutsunomiya@gmail.com>

### References

N. Amin et al. A Genomic Background Based Method for Association Analysis in Related Individuals. PLoS ONE. 2007. 2:e1274.

Y. Da. Multi-allelic haplotype model based on genetic partition for genomic prediction and variance component estimation using SNP markers. BMC Genet. 2015. 16:144.

B. Devlin and K. Roeder. Genomic control for association studies. Biometrics. 1999. 55:997-1004.

J. Listgarten et al. Improved linear mixed models for genome-wide association studies. Nat. Methods. 2012. 9:525-526.

G. R. Svishcheva et al. Rapid variance components-based method for whole-genome association analysis. Nat Genet. 2012. 44:1166-1170.

J. Yang et al. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 2014. 46: 100-106.

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | ```
# #### DO NOT RUN IF NOT NECESSARY ###
#
# # Copy the example data in the current working directory
# ghap.makefile()
#
# # Load data
# phase <- ghap.loadphase("human.samples", "human.markers", "human.phase")
#
# # Subset data - randomly select 3000 markers with maf > 0.02
# maf <- ghap.maf(phase, ncores = 2)
# set.seed(1988)
# markers <- sample(phase$marker[maf > 0.02], 3000, replace = FALSE)
# phase <- ghap.subsetphase(phase, unique(phase$id), markers)
# rm(maf,markers)
#
# # Generate block coordinates based on windows of 10 markers, sliding 5 marker at a time
# blocks <- ghap.blockgen(phase, 10, 5, "marker")
#
# # Generate matrix of haplotype genotypes
# ghap.haplotyping(phase, blocks, batchsize = 100, ncores = 2, freq = 0.05, outfile = "example")
#
# # Load haplotype genotypes
# haplo <- ghap.loadhaplo("example.hapsamples", "example.hapalleles", "example.hapgenotypes")
#
# # Compute kinship matrix
# K <- ghap.kinship(haplo, batchsize = 100)
#
# # Quantitative trait with 50% heritability
# # One major haplotype accounting for 30% of the genetic variance
# sim <- ghap.simpheno(haplo = haplo, K = K, h2 = 0.5, g2 = 0.3, major = 1000,seed=1988)
#
# #Continuous model
# model <- ghap.mme(fixed = phenotype ~ 1, random = "individual", data = sim$data, K = K,
# varcomp = c(sim$vare,sim$varu))
#
# ### RUN ###
#
# #HapAllele GWAS
# gwas.allele <- ghap.assoc(blmm = model, haplo = haplo, type = "HapAllele", gc = TRUE, ncores=2)
# gwas.allele$POS <- (gwas.allele$BP1+gwas.allele$BP2)/2e+6
# plot(gwas.allele$POS,gwas.allele$logP, xlab="Chromosome 2 position (in Mb)",
# ylab=expression(-log[10](p)), pch=20, col="#471FAA99")
# abline(v=(haplo$bp1[1000]+haplo$bp2[1000])/2e+6,lty=3)
#
# #HapBlock GWAS
# gwas.block <- ghap.assoc(blmm = model, haplo = haplo, type = "HapBlock", ncores=2)
# gwas.block$POS <- (gwas.block$BP1+gwas.block$BP2)/2e+6
# plot(gwas.block$POS,gwas.block$logP, xlab="Chromosome 2 position (in Mb)",
# ylab=expression(-log[10](p)), pch=20, col="#471FAA99")
# abline(v=(haplo$bp1[1000]+haplo$bp2[1000])/2e+6,lty=3)
``` |