Description Introducing gaston Genotype matrices Crossproducts of standardized matrices Linear Mixed Models Author(s)

Manipulation of genetic data (SNPs), computation of Genetic Relationship Matrix, Linkage Disequilibrium, etc. Efficient algorithms for Linear Mixed Model (AIREML, diagonalisation trick).

Gaston offers functions for efficient manipulation of large genotype (SNP) matrices, and state-of-the-art implementation of algorithms to fit Linear Mixed Models, that can be used to compute heritability estimates or to perform association tests.

Thanks to the packages `Rcpp`

,
`RcppParallel`

,
`RcppEigen`

, gaston
functions are mainly written in C++.

Many functions are multithreaded;
the number of threads can be setted through `RcppParallel`

function `setThreadOptions`

.
It is advised to try several values for the number of threads, as
using too many threads might be conterproductive due to an important
overhead.

Some functions have a `verbose`

argument, which controls the
function verbosity. To mute all functions at once you can use
`options(gaston.verbose = FALSE)`

.

An S4 class for genotype matrices is defined, named `bed.matrix`

.
Each row corresponds to an individual, and each column to a SNP. They can
be read from files using `read.bed.matrix`

and saved using `write.bed.matrix`

. The function `read.vcf`

reads
VCF files.

In first approach, a bed.matrix behaves as a "read-only" matrix containing only
0, 1, 2 and NAs, unless the genotypes are standardized (use `standardize<-`

).
They are stored in a compact form, each genotype being coded on 2 bits (hence
4 genotypes per byte).

Bed.matrices can be converted to numerical matrices with `as.matrix`

,
and multiplied with numeric vectors or matrices with `%*%`

(this
feature can be used e.g. to simulate quantitative phenotypes, see a basic example in the example
section of `association.test`

).

It is possible to subset bed.matrices just as base matrices, writing e.g.
`x[1:100,]`

to extract the first 100 individuals, or `x[1:100,1000:1999]`

for extract the SNPs 1000 to 1999 for these 100 individuals. The use of logical
vectors for subsetting is allowed too. The functions
`select.inds`

and `select.snps`

can also be used for
subsetting with a nice syntax.

Some basic descriptive statistics can be added to a bed.matrix with `set.stats`

(since
`gaston 1.4`

, this function is called by default by all functions that create a bed.matrix, unless
`options(gaston.auto.set.stats = FALSE)`

was set.
Hardy-Weinberg Equilibrium can be tested for all SNPs with `set.hwe`

.

If *X* is a standardized *n x q* genotype matrix, a Genetic Relationship Matrix
(GRM) of the individuals can be computed as

*GRM = XX’/(q-1)*

where *q* is the number of SNPs.
This computation is done by the function `GRM`

. The eigen decomposition of the GRM produces
the Principal Components (PC) of the data set. If needed, the loadings
corresponding to the PCs can be retrieved using `bed.loadings`

.

Doing the above crossproduct in the reverse order produces a moment estimate of the Linkage Disequilibrium:

*LD = X’X/(n-1)*

where *n* is the number of individuals. This computation is done by the function
`LD`

(usually, only parts of the whole LD matrix is computed). This method is
also used by `LD.thin`

to extract a set of SNPs in low linkage disequilibrium
(it is often recommended to perform this operation before computing the GRM).

`lmm.aireml`

is a function for linear mixed models parameter estimation
and BLUP computations.

The model considered is of the form

* Y = X beta + omega_1 + ... + omega_k + epsilon *

with *omega_i ~ N(0, tau_i K_i)* for *i = 1, ..., k* and
*epsilon ~ N(0, sigma^2 I_n)*.

Note that very often in genetics a mixed model is written as

* Y = X beta + Zu + epsilon *

with *Z* a standardized genotype matrix, and *u ~ N(0, tau I_q)*. In that case,
denoting *omega = Zu*, *omega ~ N(0,tau ZZ')*
and letting *K=ZZ'* we get a mixed model of the previous form.

When *k=1* in the above general model (only one random term *omega*), the likelihood
can be computed very efficiently using the eigen decomposition of
*K = var(omega)*. This "diagonalization trick"
is used in `lmm.diago.likelihood`

and `lmm.diago`

, to compute
the likelihood and for parameter estimation, respectively.

Two small functions complete this set of functions: `lmm.simu`

, to
simulate data under a linear mixed model, and `random.pm`

, to generate
random positive matrices. Both are used in examples and can be useful for data simulation.

Hervé Perdry and Claire Dandine-Roulland

Maintainer: Hervé Perdry

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.