options(width=88) library(kableExtra)
r paste0("[![", Sys.Date(),"]","(",paste0("https://img.shields.io/badge/last%20git%20commit-", gsub('-', '--', Sys.Date()),"-turquoise.svg"), ")]","(",'https://github.com/myles-lewis/glmmSeq/blob/master/NEWS.md',")")
The aim of this package is to model gene expression with a general linear mixed
model (GLMM) as described in the R4RA study [1]. The most widely used mainstream
differential gene expression analysis tools (e.g
Limma,
DESeq2,
edgeR) are all unable to fit
mixed effects linear models. This package however fits negative binomial mixed
effects models at individual gene level using the negative.binomial
function
from MASS
and the glmer
function in
lme4
which enables random effect,
as as well as mixed effects, to be modelled.
install.packages("glmmSeq")
devtools::install_github("myles-lewis/glmmSeq")
Or you can source the functions individually:
functions = list.files("./R", full.names = TRUE) invisible(lapply(functions, source))
But you will need to load in the additional libraries:
# Install CRAN packages invisible(lapply(c("MASS", "car", "ggplot2", "ggpubr", "lme4","lmerTest", "methods", "parallel", "plotly", "pbapply", "pbmcapply"), function(p){ if(! p %in% rownames(installed.packages())) { install.packages(p) } library(p, character.only=TRUE) })) # Install BioConductor packages if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") invisible(lapply(c("qvalue"), function(p){ if(! p %in% rownames(installed.packages())) BiocManager::install(p) library(p, character.only=TRUE) }))
To get started, first we load in the package:
library(glmmSeq) set.seed(1234)
This vignette will demonstrate the power of this package using a minimal example from the PEAC data set. Here we focus on the synovial biopsy RNA-Seq data from this cohort of patients with early rheumatoid arthritis.
data(PEAC_minimal_load)
This data contains:
These are outlined in the following subsections.
metadata$EULAR_binary = NA metadata$EULAR_binary[metadata$EULAR_6m %in% c("Good", "Moderate" )] = "responder" metadata$EULAR_binary[metadata$EULAR_6m %in% c("Non-response")] = "non_responder" kable(head(metadata), row.names = F) %>% kable_styling()
kable(head(tpm)) %>% kable_styling() %>% scroll_box(width = "100%")
Using negative binomial models requires gene dispersion estimates to be made. This can be achieved in a number of ways. A common way to calculate this for gene i is to use the equation:
Dispersioni = (variancei - meani)/meani2
This can be calculated using:
disp <- apply(tpm, 1, function(x){ (var(x, na.rm=TRUE)-mean(x, na.rm=TRUE))/(mean(x, na.rm=TRUE)**2) }) head(disp)
Alternatively, we recommend using edgeR to estimate of the common, trended and tagwise dispersions across all tags:
disp <- setNames(edgeR::estimateDisp(tpm)$tagwise.dispersion, rownames(tpm)) head(disp)
or with DESeq2 using the raw counts:
dds <- DESeqDataSetFromTximport(txi = txi, colData = metadata, design = ~ 1) dds <- DESeq(dds) dispersions <- setNames(dispersions(dds), rownames(txi$counts))
There is also an option to include size factors for each gene. Again this can be estimated using:
sizeFactors <- colSums(tpm) sizeFactors <- sizeFactors / mean(sizeFactors) # normalise to mean = 1 head(sizeFactors)
Or using edgeR these can be calculated from the raw read counts:
sizeFactors <- calcNormFactors(counts, method="TMM")
Similarly, with DESeq2:
sizeFactors <- estimateSizeFactorsForMatrix(counts)
Note the sizeFactors
vector needs to be centred around 1, since it used
directly as an offset of form log(sizeFactors)
in the GLMM model.
To fit a model for one gene over time we use a formula such as:
gene expression ~ fixed effects + random effects
In R the formula is defined by both the fixed-effects and random-effects part
of the model, with the response on the left of a ~ operator and the
terms, separated by + operators, on the right. Random-effects terms are
distinguished by vertical bars ("|") separating expressions for design matrices
from grouping factors. For more information see the ?lme4::glmer
.
In this case study we want to use time and response as fixed effects and the patients as random effects:
To fit this model for all genes we can use the glmmSeq
function. Note that
this analysis can take some time, with 4 cores:
results <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID), countdata = tpm, metadata = metadata, dispersion = disp, progress = TRUE)
or alternatively using two-factor classification with EULAR_binary:
results2 <- glmmSeq(~ Timepoint * EULAR_binary + (1 | PATID), countdata = tpm, metadata = metadata, dispersion = disp)
This creates a GlmmSeq object which contains the following slots:
names(attributes(results))
The variables used by the model are in the @modeldata
:
kable(results@modelData) %>% kable_styling()
The model fit statistics can be viewed in the @stats
slot which is a list of
items including fitted model coefficients, their standard errors and the results
of statistical tests on terms within the model using Wald type 2 Chi square. To
see the most significant interactions we can order pvals
by
Timepoint:EULAR_6m
:
stats <- summary(results) kable(stats[order(stats[, 'P_Timepoint:EULAR_6m']), ]) %>% kable_styling() %>% scroll_box(width = "100%", height = "400px")
Summary statistics on a single gene can be shown specifying the argument gene
.
summary(results, gene = "MS4A1")
Estimated means based on each gene's fitted model to show fixed effects and
their 95% confidence intervals can be seen in the @predict
slot:
predict = data.frame(results@predict) kable(predict) %>% kable_styling() %>% scroll_box(width = "100%", height = "400px")
The q-values from each of the p-value columns can be calculated using
glmmQvals
. This will print a significance table based on the cut-off (default
p=0.05) and add a matrix named qvals
to the @stats
slot:
results <- glmmQvals(results)
The glmmTMB
package is an alternative to lme4
for fitting negative binomial
GLMM models. Note, that this package optimises the dispersion parameter for each
GLMM model. This has the advantage that a predefined list of dispersions for
each gene is not required. But it also means that model fitting is significantly
slower than a negative binomial GLMM fit by lme4::glmer
with family function
MASS::negative.binomial
for which the dispersion parameter must be
pre-specified. Each glmmTMB
model takes about 0.8 seconds, so that even with 8
cores, a typical transcriptome of 20,000 genes would take about 30 minutes,
although this will vary depending on the complexity of the design formula. In
comparison, the standard glmer
pipeline with known dispersions supplied takes
about 2 minutes on 8 cores.
Although glmmTMB
is slower than lme::glmer
with known dispersion, glmmTMB
is still faster than the equivalent lme4
function glmer.nb
when the
dispersion is unknown.
Setting method = "glmmTMB"
when calling glmmSeq()
switches from using
lme4::glmer
(the default) to the glmmTMB
package. The dispersion
argument
is then no longer required. The family
argument can be used to specify
different family functions and glmmTMB
provides nbinom1
and nbinom2
(glmmSeq
uses the latter by default). Other family functions e.g. poisson
can be used.
An alternative to fitting negative binomial GLMM is to use transformed data and
fit gaussian LMM. The lmmSeq
function will fit many LMM across genes (or other
data) where random effects/mixed effects models are useful. It is almost 3x
faster than glmmSeq
. For example, gaussian normalised DNA methylation data can
be analysed using this method. In the example below the RNA-Seq gene expression
data is log transformed so that it is approximately Gaussian. Alternatively
variance stabilising transformation (VST) can be applied to count data through
DESeq2 or the voom transformation in limma voom.
logtpm <- log2(tpm + 1) lmmres <- lmmSeq(~ Timepoint * EULAR_6m + (1 | PATID), maindata = logtpm, metadata = metadata, progress = TRUE) summary(lmmres, "MS4A1")
By default, p-values for each model term are computed using Wald type 2
Chi-squared test as per car::Anova()
. The underlying code for this has been
optimised for speed.
For LMM via lmmSeq()
, an alternative to the Wald type 2 Chi-squared test is an
F-test with Satterthwaite denominator degrees of freedom, which has been
implemented using the lmerTest
package and is enabled using test.stat = "F"
.
This is not available for GLMM.
However, if a reduced model formula is specified by setting reduced
, then a
likelihood ratio test is performed instead using anova
. This will double
computation time since two (G)LMM have to be fitted for each gene. For further
information on inference testing on GLMM, we recommend
Ben Bolker's GLMM FAQ page.
glmmLRT <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID), reduced = ~ Timepoint + EULAR_6m + (1 | PATID), countdata = tpm, metadata = metadata, dispersion = disp, verbose = FALSE) summary(glmmLRT, "MS4A1")
Similarly you can run the script for an individual gene (make sure you use
drop = FALSE
to maintain countdata
as a matrix).
MS4A1glmm <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID), countdata = tpm["MS4A1", , drop = FALSE], metadata = metadata, dispersion = disp, verbose = FALSE)
The (g)lmer or glmmTMB fitted object for a single gene can be obtained using the
function glmmRefit()
:
fit <- glmmRefit(results, gene = "MS4A1") fit
These (g)lmer objects can be passed to other packages notably emmeans
for
visualising estimated marginal means. Note these are based on the model, not
directly on the data.
library(emmeans) emmeans(fit, ~ Timepoint | EULAR_6m) emmip(fit, ~ Timepoint | EULAR_6m)
glmmRefit()
allows a different model to be fitted using the original data. The
example below shows how to refit the model without the interaction term and then
perform a likelihood ratio test using anova
. Note for glmmSeq()
objects the
LHS of the reduced formula must be "count"
, while for lmmSeq()
objects the
LHS must be "gene"
. For glmmTMB analyses, the GLM family can also be changed.
fit2 <- glmmRefit(results, gene = "MS4A1", formula = count ~ Timepoint + EULAR_6m + (1 | PATID)) anova(fit, fit2)
For variables such as time, which are matched according to an ID (the random effect), we can examine the fitted model using plots which show estimated means and confidence intervals based on coefficients for the fitted regression model, overlaid upon the underlying data. In this case the samples are matched longitudinally over time.
Plots can be viewed using either ggplot or base graphics. We can start looking at the gene with the most significant interaction IGHV3-23:
plotColours <- c("skyblue", "goldenrod1", "mediumseagreen") modColours <- c("dodgerblue3", "goldenrod3", "seagreen4") shapes <- c(17, 19, 18) ggmodelPlot(results, geneName = "IGHV3-23", x1var = "Timepoint", x2var="EULAR_6m", xlab="Time", colours = plotColours, shapes = shapes, lineColours = plotColours, modelColours = modColours, modelSize = 10)
Alternatively plots can be generated using base graphics, here with or without
the model fit overlaid. By default p-value labels are taken from the column
names of the pvals
object in the stats
slot of the S4 result object. These
can be relabelled using the plab
argument.
oldpar <- par(mfrow=c(1, 2)) modelPlot(results2, geneName = "FGF14", x1var = "Timepoint", x2var="EULAR_binary", fontSize=0.65, colours=c("coral", "mediumseagreen"), modelColours = c("coral", "mediumseagreen"), modelLineColours = "black", modelSize = 2) modelPlot(results, geneName = "EMILIN3", x1var = "Timepoint", x2var = "EULAR_6m", colours = plotColours, plab = c("time", "response", "time:response"), addModel = FALSE) par(oldpar)
To plot the model fits alone set addPoints = FALSE
.
library(ggpubr) p1 <- ggmodelPlot(results, "ADAM12", x1var="Timepoint", x2var="EULAR_6m", xlab="Time", addPoints = FALSE, colours = plotColours) p2 <- ggmodelPlot(results, "EMILIN3", x1var="Timepoint", x2var="EULAR_6m", xlab="Time", fontSize=8, x2Offset=1, addPoints = FALSE, colours = plotColours) ggarrange(p1, p2, ncol=2, common.legend = T, legend="bottom")
The comparative fold change (for x1var
variables) between conditions (x2var
and x2Values
variables) can be plotted using an fcPlot
for all genes to
highlight significance. This type of plot is most suited to look for interaction
between time (x1var
) and a two-level factor (x2var
), looking at change
between two timepoints. In the example below from the R4RA study [1], gene
expression pre- and post-drug treatment is compared between two drugs (rituximab
& tocilizumab), using the design formula gene_counts ~ time * drug
. By setting
graphics = "plotly"
this can be viewed interactively.
r4ra_glmm <- glmmSeq(~ time * drug + (1 | Patient_ID), countdata = tpmdata, metadata, dispersion = dispersions, cores = 8, removeSingles = T) r4ra_glmm <- glmmQvals(r4ra_glmm) labels = c(..) # Genes to label fcPlot(r4ra_glmm, x1var = "time", x2var = "drug", graphics = "plotly", pCutoff = 0.05, useAdjusted = TRUE, labels = labels, colours = c('grey', 'green3', 'gold3', 'blue'))
knitr::include_graphics("r4ra_glmm_fcplot.png")
Log2 fold change between the two time points for individuals treated with rituximab on the x axis and individuals treated with tocilizumab on the y axis with each point representing a gene. Genes showing an interaction between time and drug are coloured blue or gold depending on whether their fold change is greater post-rituximab (blue) or post-tocilizumab (gold). Genes without interaction, but changing significantly over time are coloured green and tend to lie along the line of identity. See the Longitudinal tab in https://r4ra.hpc.qmul.ac.uk for an interactive version of the above plot.
An MA plot is an application of a Bland–Altman plot. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values.
labels = c('MS4A1', 'FGF14', 'IL2RG', 'IGHV3-23', 'ADAM12', 'IL36G', 'BLK', 'SAA1', 'CILP', 'EMILIN3', 'EMILIN2', 'IGHJ6', 'CXCL9', 'CXCL13') maPlots <- maPlot(results, x1var="Timepoint", x2var="EULAR_6m", x2Values=c("Good", "Non-response"), colours=c('grey', 'midnightblue', 'mediumseagreen', 'goldenrod'), labels=labels, graphics="ggplot") maPlots$combined
Mixed effects models are tricky to fit and lme4::glmer
sometimes returns
errors. In fact, a significant amount of code in glmmSeq
is devoted to
catching and handling errors to allow parallelisation to continue. Errors in
genes are stored in the errors
slot.
results@errors[1] # first gene error
Sometimes glmmSeq
returns errors in all genes. This usually means a problem
with not enough samples in each timepoint or a mistake in the formula. Since
version 0.5.5 glmmSeq
now returns a vector of the error messages for all
genes, which can be useful for debugging. In the example below, only timepoint 0
is specified.
results3 <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID), countdata = tpm[, metadata$Timepoint == 0], metadata = metadata[metadata$Timepoint == 0, ], dispersion = disp)
If there are errors which are not caught by the error checking core mechanism,
this can lead to problems with grouping results after the core models have been
fit. Setting returnList=TRUE
when calling glmmSeq
returns the list output
direct from mclapply
(or parLapply
on windows). This can be helpful for
debugging unforeseeen problems in the core loop.
glmmSeq was developed by the bioinformatics team at the Experimental Medicine & Rheumatology department and Centre for Translational Bioinformatics at Queen Mary University London.
If you use this package please cite as:
citation("glmmSeq")
Statistical software used in this package:
lme4: Douglas Bates, Martin Maechler, Ben Bolker, Steve Walker (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48. doi: 10.18637/jss.v067.i01.
car: John Fox and Sanford Weisberg (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/
MASS: Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
glmmTMB: Mollie Brooks, Ben Bolker, Kasper Kristensen, Martin Maechler, Arni Magnusson (2022). glmmTMB: Generalized Linear Mixed Models using Template Model Builder
qvalue: John D. Storey, Andrew J. Bass, Alan Dabney and David Robinson (2020). qvalue: Q-value estimation for false discovery rate control. R package version 2.22.0. https://github.com/StoreyLab/qvalue
lmerTest: Alexandra Kuznetsova, Per Brockhoff, Rune Christensen, Sofie Jensen. lmerTest: Tests in Linear Mixed Effects Models
emmeans: Russell V. Lenth, Paul Buerkner, Maxime Herve, Jonathon Love, Fernando Miguez, Hannes Riebl, Henrik Singmann. emmeans: Estimated Marginal Means, aka Least-Squares Means
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.