knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

As the cost of sequencing decreases, the complexity and size of RNA-Seq experiments are rapidly growing. In particular, paired, longitudinal and other correlated study designs are becoming commonplace. However, the tools available to correctly analyze these more complex study designs have been limited. In Vestal et al. 2019, we developed a Bayesian hierarchical negative binomial mixed model framework (called MCMSeq) that could directly handle count the data from RNA-Seq experiments with repeated measures. Though MCMSeq performed best in terms of achieving good sensitivity while maintaining nominal type 1 error and false discovery rates (FDR) among the methods explored in that paper, the only other alternative that came close to matching its performance was to transform the counts, and then fit regular linear mixed models. In this R package, entitled lmerSeq, we provide some convenient wrapper functions to fit these linear mixed models (using the lmerTest R package) and obtain results tables in a similar spirit to the edgeR or DESeq2 analysis pipelines. In the following vignette, we provide an example analysis on simulated RNA-Seq data to show how to fit the models, summarize the results for a single coefficient, and then test a linear contrast of coefficients.

Model Specification

To model the observed RNA-Seq count data, we first transform the counts using a normalizing transformation such as the variance stabilizing transformation offered in the vst function in DESeq2. We model the transformed data using a linear mixed model. To account for correlation, we allow a random effects to be included in the model. Let $Y_{gij}$ be the transformed value of the expression of gene $g$ from subject $i$ at time point or observation $j$, then \begin{eqnarray} \label{eqn:glmm} Y_{gij} &\sim& \mathcal{N}(\mu_{gij}, \sigma_g^2) \ \mu_{gij} &=& X_{ij}\boldsymbol{\beta_g} + \mathbf{Z}{ij}\mathbf{b}{gi} \ {b_{gi}} &\sim& \mathcal{MVN}(\mathbf{0}, \boldsymbol{\Sigma}{gb}) \end{eqnarray} where $\mu{gij}$ is the mean transformed expression of gene $g$ in subject $i$ at observation $j$ and $\sigma_g^2$ is the variance parameter for gene $g$. $\boldsymbol{\beta_g}$ is a $p$ by 1 vector of fixed effect regression coefficients for gene $g$, $X_{ij}$ is a row vector of fixed effects covariates for subject $i$ at observation $j$, and $\mathbf{b}{gi}$ is a vector of random intercept for gene $g$ and subject $i$. We assume that for each gene $g$, the subject-specific random effects have a multivariate normal distribution with mean $\mathbf{0}$ and variance-covariance matrix $\boldsymbol{\Sigma}{bg}$.

Overview of Functions in the lmerSeq Package

The three main functions in the lmerSeq package are lmerSeq.fit, which is used to fit lmerSeq models, lmerSeq.summary, which is used to summarize individual regression coefficients, and lmerSeq.contrast, which is used to summarize individual linear contrasts coefficients. Each function has several required arguments and options which are described below.

lmerSeq.fit

Required Arguments

| Argument | Description | | :-------------- | :---------------------------------------------------| | form | One-sided linear formula describing both the fixed-effects and random-effects parts of the model using the syntax of the lme4 package. |
| expr_mat | (G x N) numeric matrix or data frame of transformed RNA-seq counts (e.g. using vst from DESeq2), with genes in rows and samples in columns. G = number of genes. N = number of samples. | | sample_data | Data frame with N rows containing the fixed- and random-effects terms included in the formula. The rows of the data frame must correspond (and be in the same order as) the columns of the expression matrix.|

Additional Arguments

| Option | Description | Default | | :---------------------- | :--------------------------------------------------- | :----------------| | gene_names | An optional character vector of gene names (length G). If unspecified, row names from the expression matrix will be used. | NULL | | REML | Should the models be fit with REML or regular ML?| TRUE | | parallel |If on Mac or Linux, use forking (via mclapply) to parallelize fits.| FALSE | | cores | Number of cores to use if using parallelizing fits. | 2 |

lmerSeq.summary

Required Arguments

| Argument | Description | | :-------------- | :---------------------------------------------------| | lmerSeq_results | Results object from running lmerSeq.fit. |
| coefficient | Character string or numeric indicator of which coefficient to summarize.|

Additional Arguments

| Option | Description | Default | | :---------------------- | :--------------------------------------------------- | :----------------| | p_adj_method | Method for adjusting for multiple comparisons (default is Benjamini-Hochberg). | "BH"| | ddf | Method for computing degrees of freedom and t-statistics. Options are "Satterthwaite" and "Kenward-Roger".| "Satterthwaite" | | sort_results |Should the results table be sorted by adjusted p-value? | TRUE|

lmerSeq.contrast

Required Arguments

| Argument | Description | | :-------------- | :---------------------------------------------------| | lmerSeq_results | Results object from running lmerSeq.fit. |
| contrast_mat | Numeric matrix representing the contrast to be tested. Matrix must have the same number of columns as the number of coefficients in the model. If the matrix has multiple rows, a simultaneous F-test will be done. |

Additional Arguments

| Option | Description | Default | | :---------------------- | :--------------------------------------------------- | :----------------| | p_adj_method | Method for adjusting for multiple comparisons (default is Benjamini-Hochberg). | "BH"| | ddf | Method for computing degrees of freedom and t-statistics. Options are "Satterthwaite" and "Kenward-Roger".| "Satterthwaite" | | sort_results |Should the results table be sorted by adjusted p-value? | TRUE|

An Example lmerSeq Analysis

We will perform an analysis using data simulated from a negative binomial distribution and then transformed using the vst function in the DESeq2 package. This dataset is included in the package and contains transformed expression data for 10 subjects (ids 1 to 10): 5 control subjects (group = 0) and 5 treatment subjects (group = 1). All subjects are measured at 2 timepoints, baseline (time = 0) and follow-up (time = 1). The dataset contains 14,916 genes, around 20\% of which were simulated to have changes in expression from baseline to follow up in the treatment group only. For this example, we will subset the dataset to 500 genes for this analysis. First load the data:

# Load the library
library(lmerSeq)

# Load the data
data("expr_data")

names(expr_example)

# sample_meta_data has information about the study design
# each row is a sample and corresponds to a column
# in the expression matrix (vst_expr)
sample_meta_data <- expr_example$sample_meta_data
sample_meta_data

# The expression matrix (vst_expr) has the same number of columns as
# rows in metadata.  The columns of the counts matrix
# are in the same order as the rows of the metadata.
vst_expr <- expr_example$vst_expr[1:500,]
head(vst_expr)

The goal of our analysis will be to compare changes over time between the control and treatment groups. For this example, the model will use a random intercept and no other random effects. Thus, we will fit the following model to the data:

\begin{eqnarray} \label{eq:sim_data} Y_{gij} &\sim& \mathcal{N}(\mu_{gij}, \sigma_g^2) \ \mu_{gij} &=& \beta_{g0} + \beta_{g1} I_{T_i} + \beta_{g2} I_{F_{ij}} + \beta_{g3}I_{T_i} I_{F_{ij}} + b_{gi} \nonumber \ b_{gi} &\sim& \mathcal{N}(0, \sigma_{gb}^2) \nonumber \end{eqnarray} where $I_{T_i}$ is an indicator function that equals 1 if subject $i$ is in the treatment group, $I_{F_{ij}}$ is a similar indicator for if observation $j$ is the follow-up, and $b_{gi}$ is the random intercept for gene $g$ and subject $i$.

Fitting the Model

To fit the model, we use the lmerSeq.fit function. The expr_mat argument takes a data frame or matrix of transformed RNA-seq counts, the columns of which must be in the same order as the rows of sample_data data frame, which contains metadata on each sample. Fixed and random effects are passed to the form argument using the syntax of the lme4 package and must be included in the sample_data data frame. By default, the lmerSeq.fit will use rownames of the counts matrix to identify genes. If you would like to use a different set of gene names, you can supply a character vector of names in the same order as the rows of the counts matrix to the gene_names option. If you are using Mac or Linux and would like to use forking (via mclapply) to parallelize fits, the parallel argument should be set to TRUE and the number of cores to be used should be passed to the cores argument. Finally, if you would like to use regular maximum likelihood (ML) instead of restricted maximum likelihood (REML), set the REML argument to FALSE.

 ##  Fit the Model
 fit.lmerSeq <- lmerSeq.fit(form = ~ group * time + (1|ids),
                            expr_mat = vst_expr,
                            sample_data = sample_meta_data,
                            REML = T)

# Look at lmerSeq.fit object
class(fit.lmerSeq)

#Length is number of genes
length(fit.lmerSeq) 

# Names for individual element
names(fit.lmerSeq[[1]])

The resulting lmerSeq object is a list with elements for every gene in the expression matrix. Each gene element contains two items. The first item, fit is an object of class lmerModLmerTest (see lmerTest package) which contains information concerning the model fit. The item gene contains the genename (or rowname from the expression matrix if gene names were not provided).

Summarizing Results

Model coefficients can be more easily summarized with the lmerSeq.summary function, which calculates and tabulates adjusted p-values, as well as other results. The coefficient of interest can be passed to the function either as a numeric indicator or character string. In calculating adjusted p-values, you can choose between various adjustment methods (see p.adjust.methods) and two degree of freedom methods (Satterthwaite or Kenward-Roger). The resulting table can also be sorted by adjusted p-value by setting the sort_results option to TRUE.

 ##  Summarize the group and time coefficient
 model_sum_time <- lmerSeq.summary(lmerSeq_results = fit.lmerSeq,
                              coefficient = 3,
                              p_adj_method = 'BH',
                              ddf = 'Satterthwaite',
                              sort_results = F)
 model_sum_group <- lmerSeq.summary(lmerSeq_results = fit.lmerSeq,
                              coefficient = "group",
                              p_adj_method = 'BH',
                              ddf = 'Satterthwaite',
                              sort_results = T)

#Look at summary object
names(model_sum_time)

head(model_sum_time$summary_table)

Along with information about the function call, the resulting summary object contains a list of genes whose corresponding models were singular, and a summary table. For each gene, the summary table displays the model estimate, standard error, degree of freedom, t-value, and raw and adjusted p-values. Values for genes with singular models or models that did not converge are marked NA.

Summarizing individual linear contrasts

We are also interested in the following contrasts or linear combinations of regression parameters:

The lmerSeq.contrast function allows users to specify custom contrasts that are either one dimensional or multi-dimensional. If the contrast matrix has only one row, a t-test is executed while if the contrast matrix has more than 1 row, an F-test is used. The contrast matrix must have the same number of columns as fixed effects and if more than one row is supplied, each row should correspond to a contrast to be simultaneously tested with the other supplied contrasts.

The function calculates adjusted p-values according to the adjustment method you choose (see p.adjust.methods). You can also specify the method for the degree of freedom calculation (Satterthwaite or Kenward-Roger). The resulting contrast summary table can also be sorted by adjusted p-value by setting the sort_results option to TRUE.

The fixed effects in the contrast matrix must be in the same order specified by the model formula. If you are unsure of the ordering, you can look at the design matrix for any gene model fit in the lmerSeq fit object. To see the design matrix:

# Look at the fixed effects design matrix to develop the contrast_mat
model.matrix(fit.lmerSeq[[1]]$fit)

 ##  Contrast 1: t-test for difference in expression between groups at follow-up
 cont_mat1 <- rbind(c(0, 1, 0, 1)) 
 contrast_summary1 <- lmerSeq.contrast(lmerSeq_results = fit.lmerSeq,
                                       contrast_mat = cont_mat1,
                                       p_adj_method = 'BH',
                                       ddf = 'Satterthwaite',
                                       sort_results = T)
#Look at summary object
names(contrast_summary1)

head(contrast_summary1$summary_table)

 ##  Contrast 2: F-test for for all coefficients
 cont_mat2 <- rbind(c(0, 1, 0, 0),
                    c(0, 0, 1, 0),
                    c(0, 0, 0, 1)) 
 contrast_summary2 <- lmerSeq.contrast(lmerSeq_results = fit.lmerSeq,
                                       contrast_mat = cont_mat2,
                                       p_adj_method = 'BH',
                                       ddf = 'Satterthwaite',
                                       sort_results = T)
#Look at summary object
names(contrast_summary1)

head(contrast_summary2$summary_table)

Like with the lmerSeq.summary function, the object produced by the lmerSeq.contrast function contains information about the function call, a list of genes whose corresponding models were singular, and a summary table. For one dimensional contrasts, the summary table displays the contrast estimate and 95\% confidence limits, standard error, degrees of freedom, t-value, and raw and adjusted p-values. For multi-contrasts tests, the summary table contains the sum of squares and mean sum of squares, numerator and denominator degrees of freedom, the F-statistic, and raw and adjusted p-values. In both tables values for genes with singular models or models that did not converge are marked NA.



stop-pre16/lmerSeq documentation built on July 27, 2022, 9:08 a.m.