knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
As the cost of sequencing decreases, the complexity and size of RNA-Seq experiments are rapidly growing. In particular, paired, longitudinal and other correlated study designs are becoming commonplace. However, the tools available to correctly analyze these more complex study designs have been limited. In Vestal et al. 2019, we developed a Bayesian hierarchical negative binomial mixed model framework (called MCMSeq) that could directly handle count the data from RNA-Seq experiments with repeated measures. Though MCMSeq performed best in terms of achieving good sensitivity while maintaining nominal type 1 error and false discovery rates (FDR) among the methods explored in that paper, the only other alternative that came close to matching its performance was to transform the counts, and then fit regular linear mixed models. In this R package, entitled lmerSeq, we provide some convenient wrapper functions to fit these linear mixed models (using the lmerTest R package) and obtain results tables in a similar spirit to the edgeR
or DESeq2
analysis pipelines. In the following vignette, we provide an example analysis on simulated RNA-Seq data to show how to fit the models, summarize the results for a single coefficient, and then test a linear contrast of coefficients.
To model the observed RNA-Seq count data, we first transform the counts using a normalizing transformation such as the variance stabilizing transformation offered in the vst
function in DESeq2
. We model the transformed data using a linear mixed model. To account for correlation, we allow a random effects to be included in the model. Let $Y_{gij}$ be the transformed value of the expression of gene $g$ from subject $i$ at time point or observation $j$, then
\begin{eqnarray}
\label{eqn:glmm}
Y_{gij} &\sim& \mathcal{N}(\mu_{gij}, \sigma_g^2) \
\mu_{gij} &=& X_{ij}\boldsymbol{\beta_g} + \mathbf{Z}{ij}\mathbf{b}{gi} \
{b_{gi}} &\sim& \mathcal{MVN}(\mathbf{0}, \boldsymbol{\Sigma}{gb})
\end{eqnarray}
where $\mu{gij}$ is the mean transformed expression of gene $g$ in subject $i$ at observation $j$ and $\sigma_g^2$ is the variance parameter for gene $g$. $\boldsymbol{\beta_g}$ is a $p$ by 1 vector of fixed effect regression coefficients for gene $g$, $X_{ij}$ is a row vector of fixed effects covariates for subject $i$ at observation $j$, and $\mathbf{b}{gi}$ is a vector of random intercept for gene $g$ and subject $i$. We assume that for each gene $g$, the subject-specific random effects have a multivariate normal distribution with mean $\mathbf{0}$ and variance-covariance matrix $\boldsymbol{\Sigma}{bg}$.
The three main functions in the lmerSeq
package are lmerSeq.fit
, which is used to fit lmerSeq models, lmerSeq.summary
, which is used to summarize individual regression coefficients, and lmerSeq.contrast
, which is used to summarize individual linear contrasts coefficients. Each function has several required arguments and options which are described below.
| Argument | Description |
| :-------------- | :---------------------------------------------------|
| form | One-sided linear formula describing both the fixed-effects and random-effects parts of the model using the syntax of the lme4
package. |
| expr_mat | (G x N) numeric matrix or data frame of transformed RNA-seq counts (e.g. using vst
from DESeq2
), with genes in rows and samples in columns. G = number of genes. N = number of samples. |
| sample_data | Data frame with N rows containing the fixed- and random-effects terms included in the formula. The rows of the data frame must correspond (and be in the same order as) the columns of the expression matrix.|
| Option | Description | Default | | :---------------------- | :--------------------------------------------------- | :----------------| | gene_names | An optional character vector of gene names (length G). If unspecified, row names from the expression matrix will be used. | NULL | | REML | Should the models be fit with REML or regular ML?| TRUE | | parallel |If on Mac or Linux, use forking (via mclapply) to parallelize fits.| FALSE | | cores | Number of cores to use if using parallelizing fits. | 2 |
| Argument | Description |
| :-------------- | :---------------------------------------------------|
| lmerSeq_results | Results object from running lmerSeq.fit. |
| coefficient | Character string or numeric indicator of which coefficient to summarize.|
| Option | Description | Default | | :---------------------- | :--------------------------------------------------- | :----------------| | p_adj_method | Method for adjusting for multiple comparisons (default is Benjamini-Hochberg). | "BH"| | ddf | Method for computing degrees of freedom and t-statistics. Options are "Satterthwaite" and "Kenward-Roger".| "Satterthwaite" | | sort_results |Should the results table be sorted by adjusted p-value? | TRUE|
| Argument | Description |
| :-------------- | :---------------------------------------------------|
| lmerSeq_results | Results object from running lmerSeq.fit. |
| contrast_mat | Numeric matrix representing the contrast to be tested. Matrix must have the same number of columns as the number of coefficients in the model. If the matrix has multiple rows, a simultaneous F-test will be done. |
| Option | Description | Default | | :---------------------- | :--------------------------------------------------- | :----------------| | p_adj_method | Method for adjusting for multiple comparisons (default is Benjamini-Hochberg). | "BH"| | ddf | Method for computing degrees of freedom and t-statistics. Options are "Satterthwaite" and "Kenward-Roger".| "Satterthwaite" | | sort_results |Should the results table be sorted by adjusted p-value? | TRUE|
We will perform an analysis using data simulated from a negative binomial distribution and then transformed using the vst
function in the DESeq2
package. This dataset is included in the package and contains transformed expression data for 10 subjects (ids 1 to 10): 5 control subjects (group = 0) and 5 treatment subjects (group = 1). All subjects are measured at 2 timepoints, baseline (time = 0) and follow-up (time = 1). The dataset contains 14,916 genes, around 20\% of which were simulated to have changes in expression from baseline to follow up in the treatment group only. For this example, we will subset the dataset to 500 genes for this analysis. First load the data:
# Load the library library(lmerSeq) # Load the data data("expr_data") names(expr_example) # sample_meta_data has information about the study design # each row is a sample and corresponds to a column # in the expression matrix (vst_expr) sample_meta_data <- expr_example$sample_meta_data sample_meta_data # The expression matrix (vst_expr) has the same number of columns as # rows in metadata. The columns of the counts matrix # are in the same order as the rows of the metadata. vst_expr <- expr_example$vst_expr[1:500,] head(vst_expr)
The goal of our analysis will be to compare changes over time between the control and treatment groups. For this example, the model will use a random intercept and no other random effects. Thus, we will fit the following model to the data:
\begin{eqnarray} \label{eq:sim_data} Y_{gij} &\sim& \mathcal{N}(\mu_{gij}, \sigma_g^2) \ \mu_{gij} &=& \beta_{g0} + \beta_{g1} I_{T_i} + \beta_{g2} I_{F_{ij}} + \beta_{g3}I_{T_i} I_{F_{ij}} + b_{gi} \nonumber \ b_{gi} &\sim& \mathcal{N}(0, \sigma_{gb}^2) \nonumber \end{eqnarray} where $I_{T_i}$ is an indicator function that equals 1 if subject $i$ is in the treatment group, $I_{F_{ij}}$ is a similar indicator for if observation $j$ is the follow-up, and $b_{gi}$ is the random intercept for gene $g$ and subject $i$.
To fit the model, we use the lmerSeq.fit function. The expr_mat
argument takes a data frame or matrix of transformed RNA-seq counts, the columns of which must be in the same order as the rows of sample_data
data frame, which contains metadata on each sample. Fixed and random effects are passed to the form
argument using the syntax of the lme4
package and must be included in the sample_data
data frame. By default, the lmerSeq.fit will use rownames of the counts matrix to identify genes. If you would like to use a different set of gene names, you can supply a character vector of names in the same order as the rows of the counts matrix to the gene_names option. If you are using Mac or Linux and would like to use forking (via mclapply
) to parallelize fits, the parallel
argument should be set to TRUE
and the number of cores to be used should be passed to the cores
argument. Finally, if you would like to use regular maximum likelihood (ML) instead of restricted maximum likelihood (REML), set the REML
argument to FALSE
.
## Fit the Model fit.lmerSeq <- lmerSeq.fit(form = ~ group * time + (1|ids), expr_mat = vst_expr, sample_data = sample_meta_data, REML = T) # Look at lmerSeq.fit object class(fit.lmerSeq) #Length is number of genes length(fit.lmerSeq) # Names for individual element names(fit.lmerSeq[[1]])
The resulting lmerSeq object is a list with elements for every gene in the expression matrix. Each gene element contains two items. The first item, fit
is an object of class lmerModLmerTest
(see lmerTest
package) which contains information concerning the model fit. The item gene
contains the genename (or rowname from the expression matrix if gene names were not provided).
Model coefficients can be more easily summarized with the lmerSeq.summary
function, which calculates and tabulates adjusted p-values, as well as other results. The coefficient of interest can be passed to the function either as a numeric indicator or character string. In calculating adjusted p-values, you can choose between various adjustment methods (see p.adjust.methods)
and two degree of freedom methods (Satterthwaite or Kenward-Roger). The resulting table can also be sorted by adjusted p-value by setting the sort_results
option to TRUE
.
## Summarize the group and time coefficient model_sum_time <- lmerSeq.summary(lmerSeq_results = fit.lmerSeq, coefficient = 3, p_adj_method = 'BH', ddf = 'Satterthwaite', sort_results = F) model_sum_group <- lmerSeq.summary(lmerSeq_results = fit.lmerSeq, coefficient = "group", p_adj_method = 'BH', ddf = 'Satterthwaite', sort_results = T) #Look at summary object names(model_sum_time) head(model_sum_time$summary_table)
Along with information about the function call, the resulting summary object contains a list of genes whose corresponding models were singular, and a summary table. For each gene, the summary table displays the model estimate, standard error, degree of freedom, t-value, and raw and adjusted p-values. Values for genes with singular models or models that did not converge are marked NA
.
We are also interested in the following contrasts or linear combinations of regression parameters:
$\beta_{g1} + \beta_{g3} = 0$ tests for differences in expression between the treatment and the control group at follow-up
$\beta_{g1}=0 \text{ or }\beta_{g2} = 0 \text{ or }\beta_{g3} = 0$ simultaneously tests all of the coefficients
The lmerSeq.contrast
function allows users to specify custom contrasts that are either one dimensional or multi-dimensional. If the contrast matrix has only one row, a t-test is executed while if the contrast matrix has more than 1 row, an F-test is used. The contrast matrix must have the same number of columns as fixed effects and if more than one row is supplied, each row should correspond to a contrast to be simultaneously tested with the other supplied contrasts.
The function calculates adjusted p-values according to the adjustment method you choose (see p.adjust.methods)
. You can also specify the method for the degree of freedom calculation (Satterthwaite or Kenward-Roger). The resulting contrast summary table can also be sorted by adjusted p-value by setting the sort_results
option to TRUE
.
The fixed effects in the contrast matrix must be in the same order specified by the model formula. If you are unsure of the ordering, you can look at the design matrix for any gene model fit in the lmerSeq fit object. To see the design matrix:
# Look at the fixed effects design matrix to develop the contrast_mat model.matrix(fit.lmerSeq[[1]]$fit) ## Contrast 1: t-test for difference in expression between groups at follow-up cont_mat1 <- rbind(c(0, 1, 0, 1)) contrast_summary1 <- lmerSeq.contrast(lmerSeq_results = fit.lmerSeq, contrast_mat = cont_mat1, p_adj_method = 'BH', ddf = 'Satterthwaite', sort_results = T) #Look at summary object names(contrast_summary1) head(contrast_summary1$summary_table) ## Contrast 2: F-test for for all coefficients cont_mat2 <- rbind(c(0, 1, 0, 0), c(0, 0, 1, 0), c(0, 0, 0, 1)) contrast_summary2 <- lmerSeq.contrast(lmerSeq_results = fit.lmerSeq, contrast_mat = cont_mat2, p_adj_method = 'BH', ddf = 'Satterthwaite', sort_results = T) #Look at summary object names(contrast_summary1) head(contrast_summary2$summary_table)
Like with the lmerSeq.summary
function, the object produced by the lmerSeq.contrast
function contains information about the function call, a list of genes whose corresponding models were singular, and a summary table. For one dimensional contrasts, the summary table displays the contrast estimate and 95\% confidence limits, standard error, degrees of freedom, t-value, and raw and adjusted p-values. For multi-contrasts tests, the summary table contains the sum of squares and mean sum of squares, numerator and denominator degrees of freedom, the F-statistic, and raw and adjusted p-values. In both tables values for genes with singular models or models that did not converge are marked NA
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.