gaston-package | R Documentation |
Manipulation of genetic data (SNPs), computation of Genetic Relationship Matrix, Linkage Disequilibrium, etc. Efficient algorithms for Linear Mixed Model (AIREML, diagonalisation trick).
Gaston offers functions for efficient manipulation of large genotype (SNP) matrices, and state-of-the-art implementation of algorithms to fit Linear Mixed Models, that can be used to compute heritability estimates or to perform association tests.
Thanks to the packages Rcpp
,
RcppParallel
,
RcppEigen
, gaston
functions are mainly written in C++.
Many functions are multithreaded;
the number of threads can be setted through RcppParallel
function setThreadOptions
.
It is advised to try several values for the number of threads, as
using too many threads might be conterproductive due to an important
overhead.
Some functions have a verbose
argument, which controls the
function verbosity. To mute all functions at once you can use
options(gaston.verbose = FALSE)
.
An S4 class for genotype matrices is defined, named bed.matrix
.
Each row corresponds to an individual, and each column to a SNP. They can
be read from files using read.bed.matrix
and saved using write.bed.matrix
. The function read.vcf
reads
VCF files.
In first approach, a bed.matrix behaves as a "read-only" matrix containing only
0, 1, 2 and NAs, unless the genotypes are standardized (use standardize<-
).
They are stored in a compact form, each genotype being coded on 2 bits (hence
4 genotypes per byte).
Bed.matrices can be converted to numerical matrices with as.matrix
,
and multiplied with numeric vectors or matrices with %*%
(this
feature can be used e.g. to simulate quantitative phenotypes, see a basic example in the example
section of association.test
).
It is possible to subset bed.matrices just as base matrices, writing e.g.
x[1:100,]
to extract the first 100 individuals, or x[1:100,1000:1999]
for extract the SNPs 1000 to 1999 for these 100 individuals. The use of logical
vectors for subsetting is allowed too. The functions
select.inds
and select.snps
can also be used for
subsetting with a nice syntax.
Some basic descriptive statistics can be added to a bed.matrix with set.stats
(since
gaston 1.4
, this function is called by default by all functions that create a bed.matrix, unless
options(gaston.auto.set.stats = FALSE)
was set.
Hardy-Weinberg Equilibrium can be tested for all SNPs with set.hwe
.
If X
is a standardized n\times q
genotype matrix, a Genetic Relationship Matrix
(GRM) of the individuals can be computed as
GRM = {1\over q-1} XX’
where q
is the number of SNPs.
This computation is done by the function GRM
. The eigen decomposition of the GRM produces
the Principal Components (PC) of the data set. If needed, the loadings
corresponding to the PCs can be retrieved using bed.loadings
.
Doing the above crossproduct in the reverse order produces a moment estimate of the Linkage Disequilibrium:
LD = {1\over n-1} X’X
where n
is the number of individuals. This computation is done by the function
LD
(usually, only parts of the whole LD matrix is computed). This method is
also used by LD.thin
to extract a set of SNPs in low linkage disequilibrium
(it is often recommended to perform this operation before computing the GRM).
lmm.aireml
is a function for linear mixed models parameter estimation
and BLUP computations.
The model considered is of the form
Y = X\beta + \omega_1 + \ldots + \omega_k + \varepsilon
with \omega_i \sim N(0,\tau_i K_i)
for i \in 1, \dots,k
and
\varepsilon \sim N(0,\sigma^2 I_n)
.
Note that very often in genetics a mixed model is written as
Y = X\beta + Zu + \varepsilon
with Z
a standardized genotype matrix, and u\sim N(0, \tau I_q)
. In that case,
denoting \omega = Zu
, \omega \sim N(0, \tau ZZ')
and letting K=ZZ'
we get a mixed model of the previous form.
When k=1
in the above general model (only one random term \omega
), the likelihood
can be computed very efficiently using the eigen decomposition of
K = \mathrm{var}(\omega)
. This "diagonalization trick"
is used in lmm.diago.likelihood
and lmm.diago
, to compute
the likelihood and for parameter estimation, respectively.
Two small functions complete this set of functions: lmm.simu
, to
simulate data under a linear mixed model, and random.pm
, to generate
random positive matrices. Both are used in examples and can be useful for data simulation.
Hervé Perdry and Claire Dandine-Roulland
Maintainer: Hervé Perdry
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.