knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
As the cost of RNA-sequencing (RNA-seq) decreases, complex study designs, including paired, longitudinal, and other correlated designs, become increasingly feasible. Because many commonly used RNA-seq analysis tools do not allow for correlation between observations, analyzing correlated RNA-seq experiments with packages such as edgeR
or DESeq2
can lead to misleading inference. Instead, common statistical methods that can account for correlation, such as mixed effects models or generalized estimating equation (GEE). Because RNA-seq data are overdispersed counts, these analysis methods must also account for the non-normality of the data. Five methods that account for correlated data and may also be suited for overdispersed count data are: generalized estimating equations (GEE), negative binomial mixed models using a pseudo-likelihood approach (NBMM-PL), a maximum likelihood approach with a Laplace approximation (NBMM-LP), or a maximum likelihood approach with adaptive Gaussian quadrature (NBMM-AGQ), and linear mixed models (LMM) after taking a normalizing transformation of the count data. In this R package, we provide some convenient wrapper functions to fit these five types of models and obtain results tables in a similar spirit to the edgeR
or DESeq2
analysis pipelines.
Below is a brief description of five different methods that can be used to fit models to RNA-seq data (or other overdispersed count data) in this package.
Generalized estimating equations (GEE) are a semi-parametric extension of GLM that can account for correlation between observations. In these models, the marginal mean of the response (i.e. gene counts for RNA-seq data) is modeled using specified covariates (ex. treatment group), using a link function. In the case of RNA-seq data, we use a Poisson distribution with an extra scale parameter in the variance to account for overdispersion to model the data. A working correlation structure (ex. exchangeable, AR1, etc.), which describes the relationship between observations within a subject, must also be specified and robust (Sandwich) estimators are used so that that estimates are robust to misspecification of the correlation structure. For more information about GEE models, see @Liang1986.
GEE sandwich estimators can have poor performance at small sample sizes. To combat this, we include the three small sample adjustment options when fitting GEE models:
This package contains a custom function, geeglm_small_samp
, which combines the model fitting capability of the geeglm
function from the geepack
package with the small sample adjustment method capabilities of the geesmv
package. This function is used within wrapper functions to fit GEE models.
Negative binomial mixed models are a class of generalized linear mixed models that use random effects to account for correlation in the data and a negative binomial distribution to model overdispersed counts. This package allows for NBMMs to be fit using three different methods:
glmmADMB
package.GLMMadaptive
package. More details on this approximation technique and its application to RNA-seq data can be found in @Tsonaka2021.glmm_nb_lmer
, available in this package, can be used to fit individual NBMM-PL models. This function is utilized in wrapper functions to fit models for entire RNA-seq datasets. Linear Mixed Models (LMM) employ random effects to model correlated data, but assume a normal distribution. This method can be used for correlated RNA-seq data if we first apply a normalizing transformation, such as the variance-stabilizing transformation (VST) in the DESeq2
package. To apply this method to RNA-seq data we utilize the lmerSeq
package, available at https://github.com/stop-pre16/lmerSeq, which provides wrappers and summary functions for fitting LMMs to normalized RNA-seq data.
Through this package, users can perform hypothesis tests for single coefficients or single row contrasts (single degree of freedom tests) or multi-line contrasts (multiple degree of freedom tests). Each method utilizes different types of both single DF and multiple DF tests.
Methods that use t- or F-tests require the user to specify the degrees of freedom approximation methods to be used. This package offers four different options for computing the degrees of freedom.
Residual: Let $n$ be the total number of subjects, $m$ be the number of observations per subject and $\mathbf{X}$ be the fixed effects design matrix. Then $nm$ is equal to the total number of observations in the dataset and, the DF are approximated using $nm-Rank(\mathbf{X})$.
Containment: The containment method is a more conservative DF approximation than that of the residual method. Let $n$, $m$, and $\mathbf{X}$ have the same definitions as above and $\mathbf{Z}$ represents the random effects design matrix. Then, the DF approximation using the containment method is $nm-Rank(\mathbf{X},\mathbf{Z})$.
Satterthwaite: This method uses Satterthwaite's method of moments technique to approximate the degrees of freedom. In the context of RNA-seq experiments, unlike the residual and containment methods, a different degree of freedom value is estimated for each model (gene). This method is only available for data fit using LMM (on transformed data) and NBMM-PL and is implemented using the lmerTest
package. See @Giesbrecht1985 for more information.
Kenward-Roger: The Kenward-Roger approach builds off Satterthwaite's method, but includes modifications to make it more robust to small sample sizes. This method is only available for data fit using LMM (on transformed data) and NBMM-PL and is implemented using the pbkrtest
package (with slight modification for NBMM-PL models). See @Kenward1997 for more information.
The table below shows the type of hypothesis tests used and the DF methods available for each method.
| Method | Single DF Test | Multiple DF Test | DF Methods | :-------------- | :-----------------| :-----------------| :----------------------------| |gee | t-test (Wald test if contrast provided) | Wald test |containment or residual; Only applicable for t-tests| |lmm | t-test | F-test | containment, residual, ,Kenward-Roger or Satterthwaite; For multiple DF tests only Kenward-Roger or Satterthwaite| |nbmm_agq| z-test (LRT if reduced model is provided and Wald test if contrast is provided) | LRT or Wald test| Not applicable | |nbmm_lp| t-test (LRT if reduced model is provided) | LRT | containment or residual; Only applicable for t-tests | |nbmm_pl| t-test | F-test | containment, residual, ,Kenward-Roger or Satterthwaite; For multiple DF tests only Kenward-Roger or Satterthwaite |
The two main functions in the corrRNASeq
package are corrSeq_fit
, which is used to fit models using any of the methods previously outlined, and corrSeq_summary
, which is used to perform hypothesis testing and summarize the results. These functions have several required arguments and options which are described below.
| Argument | Description |
| :-------------- | :---------------------------------------------------|
| formula | A one-sided linear formula describing the model variables. For models that include them, random effects should be included in the formula using the syntax of the lme4 package.|
| expr_mat | A (G x N) numeric matrix RNA-seq expression data with genes in rows and samples in columns. For method="gee"
, method="nbmm_pl"
, method="nbmm_lp"
, and method="nbmm_agq"
, the matrix should contain raw counts and for method="lmm"
the matrix should contain transformed counts (e.g. using VST from DESeq2
). G = number of genes. N = number of samples. |
| sample_data | Data frame with N rows containing the fixed- and random-effects terms included in the formula. The rows of the data frame must correspond (and be in the same order as) the columns of the expression matrix.|
| method | Method to use to fit the models. Possible options are "lmm"
, "gee"
, "nbmm_pl"
, "nbmm_lp"
and nbmm_agq
.|
|id | Only applicable for models fit using the method="gee"
method. A vector or data column name which identifies the clusters. The length of ‘id’ should be the same as the number of observations. Data are assumed to be sorted so that observations on each cluster appear as contiguous rows in data. If data is not sorted this way, the function will not identify the clusters correctly. If sort=TRUE
(default), the dataframe from the data
argument is sorted by the id column to avoid this issue.|
| Option | Description | Default |
| :---------------------- | :--------------------------------------------------- | :----------------|
| gene_names | An optional character vector of gene names (length G). | NULL
|
| small.samp.method| Only applicable for models fit using the method="gee"
method. A character string specifying the small sample method. The following are permitted: "pan"
for the @Pan2001 method, "md"
for the @Mancl2001 method, and "wl
" for the @Wang2011 method. If small.samp.method
is NULL
, small sample variance estimates are not computed.| NULL
| parallel |A logical variable indicating whether forking (via mclapply) should be used to parallelize fits. Only available on Mac or linux machines.| FALSE
|
| cores | Number of cores to use if using parallelizing fits. | 2 |
|...|additional arguments passed on to lmer
(method="lmm"
), gee_small_sample
, (method="lmm"
),glmm_nb_lmer
(method="nbmm_pl"
) , mixed_model
(method="nbmm_agq
) or glmmadmb
(method="nbmm_lp"
)| NULL
|
| Argument | Description |
| :-------------- | :---------------------------------------------------|
| corrSeq_results | Results object from running corrSeq_fit. |
| coefficient | Character string or numeric indicator of which coefficient to summarize.|
| Option | Description | Default |
| :---------------------- | :--------------------------------------------------- | :----------------|
|corrSeq_results_reduced | Results object from corrSeq_fit containing reduced models for LRT tests. Only applicable when performing LRT tests using nbmm_agq
or nbmm_lp
model fits. |
| p_adj_method | Method for adjusting for multiple comparisons. See p.adjust.methods
from the stats
package. | "BH"
|
| df | Method for computing degrees of freedom for t- and F-tests. The options "Satterthwaite"
and "Kenward-Roger"
can only be used for models fit using nbmm_pl
or lmm
. Options "containment"
and "residual"
can be used for models fit using any method except nbmm_agq
(which does not use degrees of freedom, so use df=NA
). Alternatively, a single numeric value representing the df for all tests can also be given. If testing a multi-row contrast for nbmm_lp
, nbmm_agq
, or gee
, use df=NA
since these tests do not use degrees of freedom. If using a multi-row contrast for nbmm_pl
or lmm
, only df="Satterthwaite"
and df="Kenward-Roger"
are available.| "residual"
|
| sort_results |Should the results table be sorted by adjusted p-value? | TRUE
|
We will perform an analysis using data simulated from a negative binomial distribution. This dataset is included in the package and contains expression data for 6 subjects (ids 1 to 6): 3 control subjects (group = 0) and 3 treatment subjects (group = 1). All subjects are measured at 4 timepoints, (time=0 to 3). A total of 14,561 genes are included in this dataset, around 20\% of which were simulated to have changes in expression across time in the treatment group only. For this analysis, we will subset the data to the first 50 genes. First load the data:
# Load the library library(corrRNASeq) # Load the data data("simdata") names(simdata) # metadata has information about the the study design # each row is a sample and corresponds to a column # in the expression matrix (counts) sample_meta_data <- simdata$metadata sample_meta_data # The expression matrix (counts) has the same number of columns as # rows in metadata. The columns of the count matrix # are in the same order as the rows of the metadata. # Subset to include 50 genes counts <- simdata$counts[1:50,] head(counts) # The vst_expr object is the transformed expression matrix # It was created using the vst function from DESeq2 # We will use this transformed data for the LMM method # Subset to contain 50 genes vst_expr<-simdata$vst_expr[1:50,] head(vst_expr)
The goal of our analysis will be to compare changes over time between the control and treatment groups. We will do this using all five modelling methods. We will fit models where both group and time are categorical and include fixed effects for group, time, and an interaction between group and time. For the mixed models, we will use a random intercept and no other random effects.
To fit models using each of the methods, we use the corrSeq_fit
function. The method
argument is used to specify one of the five model fitting methods ("gee"
, "nbmm_lp"
, "nbmm_pl"
, nbmm_agq
or "lmm"
) The expr_mat
argument takes a data frame or matrix of RNA-seq expression data, the columns of which must be in the same order as the rows of the sample_data
data frame, which contains metadata on each sample. The model formula is passed to the formula
argument as a one sided formula with the model effects, which must be included in the sample_data
data frame. Offset terms may also be included in the model formula. By default, corrSeq_fit
will use rownames of the count matrix to identify genes. If you would like to use a different set of gene names, you can supply a character vector of names in the same order as the rows of the count matrix to the gene_names option. For models fit using method="gee"
, the id
argument must be specified either with a vector identifying the cluster/subject for each sample, or a column name for a column of ids in the sample_data
data frame. Additionally for method="gee"
, a small sample adjustment method can be specified using the small.samp.method
option. If you are using Mac or Linux and would like to use use forking (via mclapply
) to parallelize fits, the parallel
argument should be set to TRUE
and the number of cores to be used should be passed to the cores
argument. Finally, additional arguments to be passed to the functions used to fit individual models ( lmer
(method="lmm"), gee_small_sample
, (method="lmm"), glmm_nb_lmer
(method="nbmm_pl") or glmmadmb
(method="nbmm_lp")) can also be specified.
######### Fit the Models for each method ############### ## GEE models using Wang-Long small sample size estimator gee_fit <- corrSeq_fit(formula = ~ group * time+offset(log_offset), expr_mat = counts, sample_data = sample_meta_data, method="gee", id=ids, small.samp.method="wl") ## Fit NBMM-PL models nbmm_pl_fit <- corrSeq_fit(formula = ~ group *time+ offset(log_offset)+(1|ids), expr_mat = counts, sample_data = sample_meta_data, method="nbmm_pl") ## Fit NBMM-AGQ models nbmm_agq_fit <- corrSeq_fit(formula = ~ group * time+(1|ids)+offset(log_offset), expr_mat = counts, sample_data = sample_meta_data, method="nbmm_agq") ## Fit NBMM-LP models ## Random effects must be factors sample_meta_data$ids<-factor(sample_meta_data$ids) nbmm_lp_fit <- corrSeq_fit(formula = ~ group * time+(1|ids)+offset(log_offset), expr_mat = counts, sample_data = sample_meta_data, method="nbmm_lp") ## Fit LMM Models ## Use VST expression data instead of raw counts lmm_fit<- corrSeq_fit(formula = ~ group * time + (1|ids), expr_mat = vst_expr, sample_data = sample_meta_data, method="lmm") ###################Inspect Model objects############ # Look at class of one object class(gee_fit) #Length is number of genes length(gee_fit) # Class for an individual element from each object class(gee_fit$gene_1) class(nbmm_pl_fit$gene_1) class(nbmm_agq_fit$gene_1) class(nbmm_lp_fit$gene_1) class(lmm_fit$`gene_ 1`$fit)
The resulting object for each method is a list with elements for every gene in the expression matrix. Each gene element contains a model object whose class is dependent on the model fitting method. If the gee
method was used, each gene object is of the geeglm
class with an additional item for the small sample variance, if a small sample method was specified (see geepack
package and geeglm_small_samp
function documentation). NBMM-PL model objects are of the class glmm_nb_mod
, NBMM-LP models are of the class (see glmmADMB
package), LMM models are of the class lmerModLmerTest
(see lmerTest
package), and NBMM-AGQ model objects are of the class MixMod
(see GLMMadaptive
package).
Hypothesis tests can be run and summarized with the corrSeq_summary
function, which calculates and tabulates degrees of freedom and adjusted p-values, as well as other results. A coefficient of interest can be passed to the function either as a numeric indicator or character string. Alternatively, a contrast can be specified as a vector (single DF test) or matrix (multi-DF test) with the number of columns equal to the number of coefficients in the model. For methods using LRT tests, an object containing reduced models for each gene can be supplied. In calculating adjusted p-values, you can choose between various adjustment methods (see p.adjust.methods)
and four degrees of freedom methods (residual, containment, Satterthwaite, or Kenward-Roger; not used for LRT or Wald tests). The resulting table can also be sorted by adjusted p-value by setting the sort_results
option to TRUE
.
## Test for differences in treatment group at baseline using GEE models and residual DF gee_sum <-corrSeq_summary(corrSeq_results = gee_fit, coefficient = "group1", p_adj_method = 'BH', df = 'residual', sort_results = T) ## Test for differences between baseline and time 1 in control group with NBMM-LP model ## Use coefficient number instead of name nbmm_lp_sum <-corrSeq_summary(corrSeq_results = nbmm_lp_fit, coefficient = 2, p_adj_method = 'BH', df = "residual", sort_results = T) ## Test for differential expression between groups at any timepoint with NBMM-PL model ## using Satterthwaite DF contrast_mat<-rbind(c(0, 1, 0, 0, 0, 0, 0, 0), c(0, 1, 0, 0, 0, 1, 0, 0), c(0, 1, 0, 0, 0, 0, 1, 0), c(0, 1, 0, 0, 0, 0, 0, 1) ) nbmm_pl_sum<-corrSeq_summary(corrSeq_results = nbmm_pl_fit, contrast = contrast_mat, p_adj_method = 'BH', df = 'Satterthwaite', sort_results = T) ## Test for interaction effect between any group or timepoint with NBMM-AGQ model ## using LRT test reduced_mods<-corrSeq_fit(formula = ~ group+time+(1|ids)+offset(log_offset), expr_mat = counts, sample_data = sample_meta_data, method="nbmm_agq") nbmm_agq_sum<-corrSeq_summary(corrSeq_results = nbmm_agq_fit, corrSeq_results_reduced = reduced_mods, p_adj_method = 'BH', df = NA, sort_results = T) ## Test if there is differential expression between any timepoints in the treatment group ## Use LMM model and Kenward-Roger DF contrast_mat=rbind(c(0, 0, 1, 0, 0, 1, 0, 0), c(0, 0, 0, 1, 0, 0, 1, 0), c(0, 0, 0, 0, 1, 0, 0, 1)) lmm_sum <-corrSeq_summary(corrSeq_results = lmm_fit, contrast = contrast_mat, p_adj_method = 'BH', df = 'Kenward-Roger', sort_results = T) #Look at one of the summary objects names(nbmm_pl_sum) #Look at singular fits item nbmm_pl_sum$singular_fits #Look at summary table head(nbmm_pl_sum$summary_table)
Along with information about the function call, the resulting summary object contains a list of genes whose corresponding models were singular, and a summary table. For each gene, the summary table displays the model estimates and testing information as well as raw and adjusted p-values. Values for genes with singular models or models that did not converge are marked NA
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.