In knausb/GAPIT3documentation: Documentation for GAPIT3

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

#library(GAPIT3documentation)
knitr::opts_chunk$set(fig.align = "center")

Petit et al. 2020 notation

https://doi.org/10.3389/fgene.2020.566314

$$ \gamma = X\alpha + K\beta + \epsilon $$

$\gamma$ is a vector of phenotypes
$X$ Is a vector of samples where each sample has been scored for a marker
$\alpha$ is a vector of effect size for the markers
$K$ is a matrix of kinships
$\beta$
$\epsilon_{ \mathcal{N}(0,\,\sigma^{2}) }$ is normally distributed error

Fixed effects: $X, \alpha$
Random effects: $K, \beta$, and $\epsilon$

T-test

$$ y_{i} = s_{i} + \epsilon_{ \mathcal{N}(0,\,\sigma^{2}) } $$

$y$ is a vector of phenotypic measurements
$s_{i}$ genotypes
$\epsilon_{ \mathcal{N}(0,\,\sigma^{2}) }$ is normally distributed error

The t-test is not implemented in GAPIT3, but is provided as a conceptual starting point for the more complex models that are implemented in GAPIT3. The t-test is a good starting point because the models implemented in GAPIT3 can be seen as models the build on the t-test by adding parameters.

GLM

The General Linear Model [@lipka2012gapit] (not to be confused with the generalized linear model)

$$ y = s + Q + \epsilon_{ \mathcal{N}(0,\,\sigma^{2}) } $$

$y$ is a vector of phenotypic measurements
$s$ genotypes
$Q$ indicates population structure (a categorical indicator)
$\epsilon_{dnorm}$ is normally distributed error

The $Q$ matrix may be obtained from clustering algorithms such as 'admixture', 'DAPC', 'kmeans', 'STRUCTURE', or 'fastSTRUCTURE.'

MLM

The Mixed Linear Model [@flint2005maize; @lipka2012gapit]

$$ y = s_{i} + Q + K + \epsilon_{ \mathcal{N}(0,\,\sigma^{2}) } \ $$

$y$ is a vector of phenotypic measurements
$s_{i}$ genotypes
$Q$ indicates population structure
$K$ is a kinship matrix, a sample by sample matrix indicating relatedness among samples (a random effect or cofactor)
$\epsilon_{dnorm}$ is normally distributed error

The MLM model is 'mixed' in that it incorporates 'fixed effects' (population structure, $Q$) and 'random effects' (kinship, $K$) into the model.

MLMM

Multiple Loci Mixed Model [@segura2012efficient]

$$ y = s_{i} + S + Q + K + \epsilon_{ \mathcal{N}(0,\,\sigma^{2}) } $$

$y$ is a vector of phenotypic measurements
$s_{i}$ genotypes
$S$ 'pseudo QTN' - better get more info here
$Q$ indicates population structure
$K$ is a kinship matrix, a sample by sample matrix indicating relatedness among samples (a random effect or cofactor)
$\epsilon_{dnorm}$ is normally distributed error

Uses forward and backward stepwise regression

Iterative!

SUPER

Settlement of MLM Under Progressively Exclusive Relationship [@tang2016gapit]. SUPER is an advanced version of FaST-Select [@wang2014super].

$$ y = s_{i} + Q + K + \epsilon_{ \mathcal{N}(0,\,\sigma^{2}) } $$

$y$ is a vector of phenotypic measurements
$s_{i}$ genotypes
$Q$ indicates population structure
$K$ is a kinship matrix, a sample by sample matrix indicating relatedness among samples (a random effect or cofactor)
$\epsilon_{dnorm}$ is normally distributed error

$K$ is derived from associated markers, The 'association' is determined by selecting markers that exhibit low linkage disequilibrium (a form of correlation) with markers of interest. The value of 'low' is specified by the user.

Iterative!
Linkage disequilibrium

SUPER is an advanced version of FaST-Select, The major difference between SUPER and FaST-Select is that SUPER uses bin approach to select associated markers.

FarmCPU

Fixed and random model Circulating Probability Unification [@liu2016iterative,]

$$ y = s_{i} + S + \epsilon_{ \mathcal{N}(0,\,\sigma^{2}) } \ y = K + \epsilon_{dnorm} $$

FarmCPU completely removes the confounding among kinship and markers by using a fixed effect model without a kinship derived either from all markers, or associated markers. Instead, kinship is derived from associated markers is used to select more associated markers using a maximum likelihood method.

This process overcome the model overfitting problems of stepwise regression. FarmCPU uses both fixed effect model and the random effect model iteratively

Iterative!
Linkage disequilibrium

BLINK

Bayesian-information and Linkage-disequilibrium Iteratively Nested Keyway [@huang2019blink]

$$ y = s_{i} + S + \epsilon_{ \mathcal{N}(0,\,\sigma^{2}) } \ y = S + \epsilon_{ \mathcal{N}(0,\,\sigma^{2}) } $$

The random effect model in FarmCPU to select associated markers using maximum likelihood method remains a high computing cost for large number of individuals. BLINK approximates the maximum likelihood using Bayesian Information Content (BIC) [BJK: Bayesian Information Criterion] in a fixed effect model to decrease the computational burden.

Iterative!
Linkage disequilibrium

Other major options

Compressed MLM (CMLM) individuals are replaced by groups determined by statistical clustering [@zhang2010mixed].

The optimal compression level (group number) is determined by a series of mixed models. The user specifies the search range and interval (group.from, group.to and group.by). Otherwise, MLM.

The mixed linear model (MLM) is an extreme case of CMLM where each individual is considered it's own group. The general linear model (GLM) is another extreme case of CMLM where all individuals are considered to be part of one group.

Enriched CMLM (ECMLM),

P3D/EMMAx In addition to implementing compression, GAPIT uses EMMAx/P3D1,17 to reduce computing time for MLM, CMLM, ECMLM, and SUPER. If specified, the additive genetic (σ2a) and residual (σ2e) variance components will be estimated prior to conducting GWAS. These estimates are then used for each SNP where a mixed model is fitted.

EMMA [@kang2008efficient].

Genomic selection

Genomic best linear unbiased prediction (gBLUP)
Compressed best linear unbiased prediction (cBLUP)
SUPER genomic best linear unbiased prediction (sBLUP)