options(width=88)
library(kableExtra)

Lifecycle: Maturing License: MIT CRAN status Downloads r paste0("[![", Sys.Date(),"]","(",paste0("https://img.shields.io/badge/last%20git%20commit-", gsub('-', '--', Sys.Date()),"-turquoise.svg"), ")]","(",'https://github.com/myles-lewis/glmmSeq/blob/master/NEWS.md',")") GitHub issues Travis

glmmSeq

glmmSeq logo

The aim of this package is to model gene expression with a general linear mixed model (GLMM) as described in the R4RA study [1]. The most widely used mainstream differential gene expression analysis tools (e.g Limma, DESeq2, edgeR) are all unable to fit mixed effects linear models. This package however fits negative binomial mixed effects models at individual gene level using the negative.binomial function from MASS and the glmer function in lme4 which enables random effect, as as well as mixed effects, to be modelled.

Installing from CRAN

install.packages("glmmSeq")

Installing from Github

devtools::install_github("myles-lewis/glmmSeq")

Installing Locally

Or you can source the functions individually:

functions = list.files("./R", full.names = TRUE)
invisible(lapply(functions, source))

But you will need to load in the additional libraries:

# Install CRAN packages
invisible(lapply(c("MASS", "car", "ggplot2", "ggpubr", "lme4","lmerTest",
                   "methods", "parallel", "plotly", "pbapply", "pbmcapply"),
                 function(p){
                   if(! p %in% rownames(installed.packages())) {
                     install.packages(p)
                   }
                   library(p, character.only=TRUE)
                 }))

# Install BioConductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
invisible(lapply(c("qvalue"), function(p){
  if(! p %in% rownames(installed.packages())) BiocManager::install(p)
  library(p, character.only=TRUE)
}))

Overview

To get started, first we load in the package:

library(glmmSeq)
set.seed(1234)

This vignette will demonstrate the power of this package using a minimal example from the PEAC data set. Here we focus on the synovial biopsy RNA-Seq data from this cohort of patients with early rheumatoid arthritis.

data(PEAC_minimal_load)

This data contains:

These are outlined in the following subsections.

Metadata

metadata$EULAR_binary  = NA
metadata$EULAR_binary[metadata$EULAR_6m %in%
                        c("Good", "Moderate" )] = "responder"
metadata$EULAR_binary[metadata$EULAR_6m %in% c("Non-response")] = "non_responder"

kable(head(metadata), row.names = F) %>% kable_styling()

Count data

kable(head(tpm)) %>% kable_styling() %>%
  scroll_box(width = "100%")

Dispersion

Using negative binomial models requires gene dispersion estimates to be made. This can be achieved in a number of ways. A common way to calculate this for gene i is to use the equation:

Dispersioni = (variancei - meani)/meani2

This can be calculated using:

disp <- apply(tpm, 1, function(x){
  (var(x, na.rm=TRUE)-mean(x, na.rm=TRUE))/(mean(x, na.rm=TRUE)**2)
  })

head(disp)

Alternatively, we recommend using edgeR to estimate of the common, trended and tagwise dispersions across all tags:

disp  <- setNames(edgeR::estimateDisp(tpm)$tagwise.dispersion, rownames(tpm))

head(disp)

or with DESeq2 using the raw counts:

dds <- DESeqDataSetFromTximport(txi = txi, colData = metadata, design = ~ 1)
dds <- DESeq(dds)
dispersions <- setNames(dispersions(dds), rownames(txi$counts))

Size Factors

There is also an option to include size factors for each gene. Again this can be estimated using:

sizeFactors <- colSums(tpm)  
sizeFactors <- sizeFactors / mean(sizeFactors)  # normalise to mean = 1

head(sizeFactors)

Or using edgeR these can be calculated from the raw read counts:

sizeFactors <- calcNormFactors(counts, method="TMM")

Similarly, with DESeq2:

sizeFactors <- estimateSizeFactorsForMatrix(counts)

Note the sizeFactors vector needs to be centred around 1, since it used directly as an offset of form log(sizeFactors) in the GLMM model.

Fitting Models

To fit a model for one gene over time we use a formula such as:

gene expression ~ fixed effects + random effects

In R the formula is defined by both the fixed-effects and random-effects part of the model, with the response on the left of a ~ operator and the terms, separated by + operators, on the right. Random-effects terms are distinguished by vertical bars ("|") separating expressions for design matrices from grouping factors. For more information see the ?lme4::glmer.

In this case study we want to use time and response as fixed effects and the patients as random effects:

gene expression ~ time + response + (1 | patient)

To fit this model for all genes we can use the glmmSeq function. Note that this analysis can take some time, with 4 cores:

results <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID),
                   countdata = tpm,
                   metadata = metadata,
                   dispersion = disp,
                   progress = TRUE)

or alternatively using two-factor classification with EULAR_binary:

results2 <- glmmSeq(~ Timepoint * EULAR_binary + (1 | PATID),
                    countdata = tpm,
                    metadata = metadata,
                    dispersion = disp)

Outputs

This creates a GlmmSeq object which contains the following slots:

names(attributes(results))

The variables used by the model are in the @modeldata:

kable(results@modelData) %>% kable_styling()

The model fit statistics can be viewed in the @stats slot which is a list of items including fitted model coefficients, their standard errors and the results of statistical tests on terms within the model using Wald type 2 Chi square. To see the most significant interactions we can order pvals by Timepoint:EULAR_6m:

stats <- summary(results)

kable(stats[order(stats[, 'P_Timepoint:EULAR_6m']), ]) %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")

Summary statistics on a single gene can be shown specifying the argument gene.

summary(results, gene = "MS4A1")

Estimated means based on each gene's fitted model to show fixed effects and their 95% confidence intervals can be seen in the @predict slot:

predict = data.frame(results@predict)
kable(predict) %>%
  kable_styling() %>%
  scroll_box(width = "100%", height = "400px")

Q-values

The q-values from each of the p-value columns can be calculated using glmmQvals. This will print a significance table based on the cut-off (default p=0.05) and add a matrix named qvals to the @stats slot:

results <- glmmQvals(results)

glmmTMB models

The glmmTMB package is an alternative to lme4 for fitting negative binomial GLMM models. Note, that this package optimises the dispersion parameter for each GLMM model. This has the advantage that a predefined list of dispersions for each gene is not required. But it also means that model fitting is significantly slower than a negative binomial GLMM fit by lme4::glmer with family function MASS::negative.binomial for which the dispersion parameter must be pre-specified. Each glmmTMB model takes about 0.8 seconds, so that even with 8 cores, a typical transcriptome of 20,000 genes would take about 30 minutes, although this will vary depending on the complexity of the design formula. In comparison, the standard glmer pipeline with known dispersions supplied takes about 2 minutes on 8 cores.

Although glmmTMB is slower than lme::glmer with known dispersion, glmmTMB is still faster than the equivalent lme4 function glmer.nb when the dispersion is unknown.

Setting method = "glmmTMB" when calling glmmSeq() switches from using lme4::glmer (the default) to the glmmTMB package. The dispersion argument is then no longer required. The family argument can be used to specify different family functions and glmmTMB provides nbinom1 and nbinom2 (glmmSeq uses the latter by default). Other family functions e.g. poisson can be used.

Gaussian mixed effects models

An alternative to fitting negative binomial GLMM is to use transformed data and fit gaussian LMM. The lmmSeq function will fit many LMM across genes (or other data) where random effects/mixed effects models are useful. It is almost 3x faster than glmmSeq. For example, gaussian normalised DNA methylation data can be analysed using this method. In the example below the RNA-Seq gene expression data is log transformed so that it is approximately Gaussian. Alternatively variance stabilising transformation (VST) can be applied to count data through DESeq2 or the voom transformation in limma voom.

logtpm <- log2(tpm + 1)
lmmres <- lmmSeq(~ Timepoint * EULAR_6m + (1 | PATID),
                   maindata = logtpm,
                   metadata = metadata,
                   progress = TRUE)
summary(lmmres, "MS4A1")

Hypothesis testing

By default, p-values for each model term are computed using Wald type 2 Chi-squared test as per car::Anova(). The underlying code for this has been optimised for speed.

For LMM via lmmSeq(), an alternative to the Wald type 2 Chi-squared test is an F-test with Satterthwaite denominator degrees of freedom, which has been implemented using the lmerTest package and is enabled using test.stat = "F". This is not available for GLMM.

However, if a reduced model formula is specified by setting reduced, then a likelihood ratio test is performed instead using anova. This will double computation time since two (G)LMM have to be fitted for each gene. For further information on inference testing on GLMM, we recommend Ben Bolker's GLMM FAQ page.

glmmLRT <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID),
                   reduced = ~ Timepoint + EULAR_6m + (1 | PATID),
                   countdata = tpm,
                   metadata = metadata,
                   dispersion = disp, verbose = FALSE)

summary(glmmLRT, "MS4A1")

Extracting individual Genes

Similarly you can run the script for an individual gene (make sure you use drop = FALSE to maintain countdata as a matrix).

MS4A1glmm <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID),
                     countdata = tpm["MS4A1", , drop = FALSE],
                     metadata = metadata,
                     dispersion = disp,
                     verbose = FALSE)

The (g)lmer or glmmTMB fitted object for a single gene can be obtained using the function glmmRefit():

fit <- glmmRefit(results, gene = "MS4A1")
fit

These (g)lmer objects can be passed to other packages notably emmeans for visualising estimated marginal means. Note these are based on the model, not directly on the data.

library(emmeans)

emmeans(fit, ~ Timepoint | EULAR_6m)
emmip(fit, ~ Timepoint | EULAR_6m)

Refitting a different model

glmmRefit() allows a different model to be fitted using the original data. The example below shows how to refit the model without the interaction term and then perform a likelihood ratio test using anova. Note for glmmSeq() objects the LHS of the reduced formula must be "count", while for lmmSeq() objects the LHS must be "gene". For glmmTMB analyses, the GLM family can also be changed.

fit2 <- glmmRefit(results, gene = "MS4A1",
                  formula = count ~ Timepoint + EULAR_6m + (1 | PATID))

anova(fit, fit2)

Model Plots

For variables such as time, which are matched according to an ID (the random effect), we can examine the fitted model using plots which show estimated means and confidence intervals based on coefficients for the fitted regression model, overlaid upon the underlying data. In this case the samples are matched longitudinally over time.

Plots can be viewed using either ggplot or base graphics. We can start looking at the gene with the most significant interaction IGHV3-23:

plotColours <- c("skyblue", "goldenrod1", "mediumseagreen")
modColours <- c("dodgerblue3", "goldenrod3", "seagreen4")
shapes <- c(17, 19, 18)

ggmodelPlot(results,
            geneName = "IGHV3-23",
            x1var = "Timepoint",
            x2var="EULAR_6m",
            xlab="Time",
            colours = plotColours,
            shapes = shapes,
            lineColours = plotColours, 
            modelColours = modColours,
            modelSize = 10)

Alternatively plots can be generated using base graphics, here with or without the model fit overlaid. By default p-value labels are taken from the column names of the pvals object in the stats slot of the S4 result object. These can be relabelled using the plab argument.

oldpar <- par(mfrow=c(1, 2))

modelPlot(results2,
          geneName = "FGF14",
          x1var = "Timepoint",
          x2var="EULAR_binary",
          fontSize=0.65,
          colours=c("coral", "mediumseagreen"),
          modelColours = c("coral", "mediumseagreen"),
          modelLineColours = "black",
          modelSize = 2)

modelPlot(results,
          geneName = "EMILIN3",
          x1var = "Timepoint",
          x2var = "EULAR_6m",
          colours = plotColours,
          plab = c("time", "response", "time:response"),
          addModel = FALSE)

par(oldpar)

To plot the model fits alone set addPoints = FALSE.

library(ggpubr)

p1 <- ggmodelPlot(results,
                  "ADAM12",
                  x1var="Timepoint",
                  x2var="EULAR_6m",
                  xlab="Time",
                  addPoints = FALSE,
                  colours = plotColours)

p2 <- ggmodelPlot(results,
                  "EMILIN3",
                  x1var="Timepoint",
                  x2var="EULAR_6m",
                  xlab="Time",
                  fontSize=8,
                  x2Offset=1,
                  addPoints = FALSE,
                  colours = plotColours)

ggarrange(p1, p2, ncol=2, common.legend = T, legend="bottom")

Fold change plots

The comparative fold change (for x1var variables) between conditions (x2var and x2Values variables) can be plotted using an fcPlot for all genes to highlight significance. This type of plot is most suited to look for interaction between time (x1var) and a two-level factor (x2var), looking at change between two timepoints. In the example below from the R4RA study [1], gene expression pre- and post-drug treatment is compared between two drugs (rituximab & tocilizumab), using the design formula gene_counts ~ time * drug. By setting graphics = "plotly" this can be viewed interactively.

r4ra_glmm <- glmmSeq(~ time * drug + (1 | Patient_ID), 
                       countdata = tpmdata, metadata,
                       dispersion = dispersions, cores = 8, removeSingles = T)
r4ra_glmm <- glmmQvals(r4ra_glmm)
labels = c(..)  # Genes to label
fcPlot(r4ra_glmm, x1var = "time", x2var = "drug", graphics = "plotly",
       pCutoff = 0.05, useAdjusted = TRUE,
       labels = labels,
       colours = c('grey', 'green3', 'gold3', 'blue'))
knitr::include_graphics("r4ra_glmm_fcplot.png")

Log2 fold change between the two time points for individuals treated with rituximab on the x axis and individuals treated with tocilizumab on the y axis with each point representing a gene. Genes showing an interaction between time and drug are coloured blue or gold depending on whether their fold change is greater post-rituximab (blue) or post-tocilizumab (gold). Genes without interaction, but changing significantly over time are coloured green and tend to lie along the line of identity. See the Longitudinal tab in https://r4ra.hpc.qmul.ac.uk for an interactive version of the above plot.

MA plots

An MA plot is an application of a Bland–Altman plot. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values.

labels = c('MS4A1', 'FGF14', 'IL2RG', 'IGHV3-23', 'ADAM12', 'IL36G', 
           'BLK', 'SAA1', 'CILP', 'EMILIN3', 'EMILIN2', 'IGHJ6', 
           'CXCL9', 'CXCL13')
maPlots <- maPlot(results,
                  x1var="Timepoint",
                  x2var="EULAR_6m",
                  x2Values=c("Good", "Non-response"),
                  colours=c('grey', 'midnightblue',
                             'mediumseagreen', 'goldenrod'),
                  labels=labels,
                  graphics="ggplot")

maPlots$combined

Troubleshooting

Mixed effects models are tricky to fit and lme4::glmer sometimes returns errors. In fact, a significant amount of code in glmmSeq is devoted to catching and handling errors to allow parallelisation to continue. Errors in genes are stored in the errors slot.

results@errors[1]   # first gene error

Sometimes glmmSeq returns errors in all genes. This usually means a problem with not enough samples in each timepoint or a mistake in the formula. Since version 0.5.5 glmmSeq now returns a vector of the error messages for all genes, which can be useful for debugging. In the example below, only timepoint 0 is specified.

results3 <- glmmSeq(~ Timepoint * EULAR_6m + (1 | PATID),
                    countdata = tpm[, metadata$Timepoint == 0],
                    metadata = metadata[metadata$Timepoint == 0, ],
                    dispersion = disp)

If there are errors which are not caught by the error checking core mechanism, this can lead to problems with grouping results after the core models have been fit. Setting returnList=TRUE when calling glmmSeq returns the list output direct from mclapply (or parLapply on windows). This can be helpful for debugging unforeseeen problems in the core loop.

Citing glmmSeq

glmmSeq was developed by the bioinformatics team at the Experimental Medicine & Rheumatology department and Centre for Translational Bioinformatics at Queen Mary University London.

If you use this package please cite as:

citation("glmmSeq")

References

  1. Felice Rivellese, Anna Surace, Katriona Goldmann, Elisabetta Sciacca, Cankut Cubuk, Giovanni Giorli, ... Michael Barnes, Myles J. Lewis, Costantino Pitzalis, R4RA collaborative group. Rituximab versus tocilizumab in rheumatoid arthritis: synovial biopsy-based biomarker analysis of the phase 4 R4RA randomized trial. Nature medicine 2022; 28(6): 1256-68. doi:10.1038/s41591-022-01789-0

Statistical software used in this package:

  1. lme4: Douglas Bates, Martin Maechler, Ben Bolker, Steve Walker (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48. doi: 10.18637/jss.v067.i01.

  2. car: John Fox and Sanford Weisberg (2019). An {R} Companion to Applied Regression, Third Edition. Thousand Oaks CA: Sage. URL: https://socialsciences.mcmaster.ca/jfox/Books/Companion/

  3. MASS: Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

  4. glmmTMB: Mollie Brooks, Ben Bolker, Kasper Kristensen, Martin Maechler, Arni Magnusson (2022). glmmTMB: Generalized Linear Mixed Models using Template Model Builder

  5. qvalue: John D. Storey, Andrew J. Bass, Alan Dabney and David Robinson (2020). qvalue: Q-value estimation for false discovery rate control. R package version 2.22.0. https://github.com/StoreyLab/qvalue

  6. lmerTest: Alexandra Kuznetsova, Per Brockhoff, Rune Christensen, Sofie Jensen. lmerTest: Tests in Linear Mixed Effects Models

  7. emmeans: Russell V. Lenth, Paul Buerkner, Maxime Herve, Jonathon Love, Fernando Miguez, Hannes Riebl, Henrik Singmann. emmeans: Estimated Marginal Means, aka Least-Squares Means



Try the glmmSeq package in your browser

Any scripts or data that you put into this service are public.

glmmSeq documentation built on Oct. 8, 2022, 5:05 p.m.