gaston-package: gaston
In gaston: Genetic Data Handling (QC, GRM, LD, PCA) & Linear Mixed Models

gaston-package

R Documentation

gaston

Description

Manipulation of genetic data (SNPs), computation of Genetic Relationship Matrix, Linkage Disequilibrium, etc. Efficient algorithms for Linear Mixed Model (AIREML, diagonalisation trick).

Introducing gaston

Gaston offers functions for efficient manipulation of large genotype (SNP) matrices, and state-of-the-art implementation of algorithms to fit Linear Mixed Models, that can be used to compute heritability estimates or to perform association tests.

Thanks to the packages Rcpp, RcppParallel, RcppEigen, gaston functions are mainly written in C++.

Many functions are multithreaded; the number of threads can be setted through RcppParallel function setThreadOptions. It is advised to try several values for the number of threads, as using too many threads might be conterproductive due to an important overhead.

Some functions have a verbose argument, which controls the function verbosity. To mute all functions at once you can use options(gaston.verbose = FALSE).

Genotype matrices

An S4 class for genotype matrices is defined, named bed.matrix. Each row corresponds to an individual, and each column to a SNP. They can be read from files using read.bed.matrix and saved using write.bed.matrix. The function read.vcf reads VCF files.

In first approach, a bed.matrix behaves as a "read-only" matrix containing only 0, 1, 2 and NAs, unless the genotypes are standardized (use standardize<-). They are stored in a compact form, each genotype being coded on 2 bits (hence 4 genotypes per byte).

Bed.matrices can be converted to numerical matrices with as.matrix, and multiplied with numeric vectors or matrices with %*% (this feature can be used e.g. to simulate quantitative phenotypes, see a basic example in the example section of association.test).

It is possible to subset bed.matrices just as base matrices, writing e.g. x[1:100,] to extract the first 100 individuals, or x[1:100,1000:1999] for extract the SNPs 1000 to 1999 for these 100 individuals. The use of logical vectors for subsetting is allowed too. The functions select.inds and select.snps can also be used for subsetting with a nice syntax.

Some basic descriptive statistics can be added to a bed.matrix with set.stats (since gaston 1.4, this function is called by default by all functions that create a bed.matrix, unless options(gaston.auto.set.stats = FALSE) was set. Hardy-Weinberg Equilibrium can be tested for all SNPs with set.hwe.

Crossproducts of standardized matrices

If X is a standardized n\times q genotype matrix, a Genetic Relationship Matrix (GRM) of the individuals can be computed as

GRM = {1\over q-1} XX’

where q is the number of SNPs. This computation is done by the function GRM. The eigen decomposition of the GRM produces the Principal Components (PC) of the data set. If needed, the loadings corresponding to the PCs can be retrieved using bed.loadings.

Doing the above crossproduct in the reverse order produces a moment estimate of the Linkage Disequilibrium:

LD = {1\over n-1} X’X

where n is the number of individuals. This computation is done by the function LD (usually, only parts of the whole LD matrix is computed). This method is also used by LD.thin to extract a set of SNPs in low linkage disequilibrium (it is often recommended to perform this operation before computing the GRM).

Linear Mixed Models

lmm.aireml is a function for linear mixed models parameter estimation and BLUP computations.

The model considered is of the form

Y = X\beta + \omega_1 + \ldots + \omega_k + \varepsilon

with \omega_i \sim N(0,\tau_i K_i) for i \in 1, \dots,k and \varepsilon \sim N(0,\sigma^2 I_n) .

Note that very often in genetics a mixed model is written as

Y = X\beta + Zu + \varepsilon

with Z a standardized genotype matrix, and u\sim N(0, \tau I_q). In that case, denoting \omega = Zu, \omega \sim N(0, \tau ZZ') and letting K=ZZ' we get a mixed model of the previous form.

When k=1 in the above general model (only one random term \omega), the likelihood can be computed very efficiently using the eigen decomposition of K = \mathrm{var}(\omega). This "diagonalization trick" is used in lmm.diago.likelihood and lmm.diago, to compute the likelihood and for parameter estimation, respectively.

Two small functions complete this set of functions: lmm.simu, to simulate data under a linear mixed model, and random.pm, to generate random positive matrices. Both are used in examples and can be useful for data simulation.