knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(ggplot2)
library(knitr)
library(kableExtra)
library(dplyr)
library(HTSet)

introduction

High through-put experiments such as RNA-seq, proteomics, and metaboliomics estimate the abundance/concentration of hundrands to millions of features. Very commonly, researchers would like to answer questions such as which gene (or protein, metabolites) were differentially expressed or have differential abundance in treatment groups. Linear model and generalized linear model are widely used because they are highly flaxible to be adopted to different study designs. The HTSet package has a wrapper function model_fit() of three popular bioconductor packages, including limma, edgeR, and DESeq2, that performs differnetial analysis on a high through-put experiment dataset, and it returns the result in an organized structure.

The basic usage of model_fit() is below.

model_fit(object, design, coef, engine, args)

object must be a HTSet object. design is the study design matrix, usually returned by the model.matrix function. coef is the coefficient to perform statistical test. engine must be one of the three backend statistical package mentioned above. args is a list of additional parameters to be parsed to the backend statistical engine.

model_fit returns a list-like object. The elemenets that it contains are listed below.

The HTSet package has two dataset included. They both can be loaded using the data() function.

data(exrna)
data(lipidome)

In this vignette, we will go though some examples of using model_fit() to perform statistical analysis with the two datasets.

Backend engines

count data

High through-put experimental data generated by nucleiic acid sequencing are were usually presented as counts per gene (RNA-seq) or counts per ASV/OTU (16S-seq). Counts data by nature do not fit gaussian distribution, thus linear model is not directly applicable. Thus, the major approches are (1) transforming counts data to continous data and normalized it, and (2) use negative binomial generalized linear model (glm). For the former approch, counts data are often transformed into logCPM followed by linear modeling. Another or a better approach is using limma's voom function, which estimates the mean-variance relationship of the log-counts, and generates a precision weight for each observation. As for negative binomial glm, edgeR and DESeq2 two widely recognized packages.

exrna

The dataset exrna is the small RNA-seq result of isolated HDL from human plasma of 3 patients with systemic lupus erythematosus and 3 healthy control. The raw sequence fastq files were downloaded from GEO (GSE121865), and processed using the exceRNApipeline, using STAR agaist hg38. Gene counts were generated using htseq-count agaist gencode's annotation.

exrna
kable(exrna$pdata[,c(1,2,4,7)], format = "html") %>%
    kable_styling()%>%
    column_spec(5, background = "orange") %>%
    scroll_box(width = "100%")

limma voom

We first create a study design matrix using the model.matrix function.

# create the design matrix
design = model.matrix( ~ Condition, data = exrna$pdata)
kable(design)

We are going to test whether each gene is expressed differently between patinets and control, thus the cofficient that we are going to test should be ConditionSystemic Lupus Erythematosus (second column name). We can than call model_fit. Setting voom = TRUE tells model_fit to run voom before calling limma's lmFit function.

coef = "ConditionSystemic Lupus Erythematosus"
fit1 = model_fit(exrna, design = design, coef = coef, engine = "limma", args = list(voom = TRUE))

When args has an element named voom, model_fit calls the three functions voom, lmFit and eBayes in order. Additional arguments can be parsed to each of the three functions by assigning a list as value to the key of the corresponding step. For example, voom's mean-variance trend plot was disabled by default. To enable the plot, parse it in a list to voom, as shown below.

fit1_2 = model_fit(exrna, design = design, coef = coef, engine = "limma", 
                   args = list(voom = list(plot = TRUE)))

Draw a volcano plot

volcanoplot(fit1, hline = 0.05)

P value histogram

hist(fit1, vline = 0.05)

limma (but not voom)

It is also possible to transform count data into logCPM, and use limma directly.

exrna_cpm = exrna
exrna_cpm$edata = edgeR::cpm(exrna_cpm$edata, prior.count = 1, log = TRUE)
fit2 = model_fit(exrna_cpm, design = design, coef = coef, engine = "limma")
fit2$results[order(fit2$results$pval),] %>% head %>% kable()

Drawing a volcano plot

volcanoplot(fit2, hline = 0.05)

P value histogram

hist(fit2, vline = 0.05)

DESeq2

To use DESeq2 to perform negative binomial glm, simply set engine = "DESeq2"

fit3 = model_fit(exrna, design = design, coef = coef, engine = "DESeq2")

Additional parameters can be parsed to the args as a list named as DESeq. The parameters will be used internally by the DESeq function. Use ?DESeq to see all parameters that can be used. For example, to use t distribution to estimate p-values:

fit3_2 = model_fit(
    exrna, design = design, coef = coef, engin = "DESeq2",
    args = list(
        DESeq = list(useT = TRUE)
    )
)

DESeq2 also provides a likelihood ratio test. That can be done in a very similar way, by parsing additional arguments to the args parameter.

reduced = model.matrix(~1, data = exrna$pdata)
fit3_2 = model_fit(
    exrna, design = design, coef = coef, engin = "DESeq2",
    args = list(
        DESeq = list(test = "LRT", reduced = reduced)
    )
)
fit3$results[order(fit3$results$pval),] %>% head %>% kable()

Drawing a volcano plot

volcanoplot(fit3, hline = 0.05)

P value histogram

hist(fit3, vline = 0.05)

edgeR

Similar to DESeq2, edgeR also performes negative binomial glm test. Two models are provided by edgeR, the quasi-likelihood test (QLF) and likelihood ratio test (LRT). model_fit uses QLF by default.

fit4 = model_fit(exrna, design = design, coef = coef, engine = "edgeR")
fit4$results[order(fit4$results$pval),] %>% head %>% kable()

Drawing a volcano plot

volcanoplot(fit4, hline = 0.05)

P value histogram

hist(fit4, vline = 0.05)

To use LRT, you need to specify it in the args.

fit5 = model_fit(exrna, design = design, coef = coef, engine = "edgeR",
                 args = list(model = "lrt"))
fit5$results[order(fit5$results$pval),] %>% head %>% kable()

Drawing a volcano plot

volcanoplot(fit5, hline = 0.05)

P value histogram

hist(fit5, vline = 0.05)

Complex models

lipidome

model_fit() also supports more complex study design models. The dataset lipidome is the result of a lipidomics analysis on isolated HDL from 10 human subjects in a 2x2 randomized cross-over study. The study was published on Metabolomics, 2019 (doi: 10.1007/s11306-019-1579-1).

lipidome
summary(lipidome$pdata[,c("Treatment", "Timepoint")])

In the examples above, we were simply comparing disease patients to control. However in this study, we are asking a slightly more complex question. We would like to see how each lipid species change before and after the two treatments FF and Med. And since each subject was sampled four times, the model that we are using would be as blew. And $\beta_3$ is the coefficient that we would like to test.

$$y = \beta_0 + \beta_1 Treatment + \beta_2 Timepoint + \beta_3 Timepoint*Treatment + \beta_4 Subject + e$$

The design matrix should be as below.

# in R's equation, ~ Treatment * Timepoint is a shorthand for 
# ~ Treatment + Timepoint + Treatment * Timepoint
design = model.matrix(~ Treatment * Timepoint + Subject + 1, data = lipidome$pdata)
coef = "TreatmentMed:TimepointPre"

Because the lipidome is continous data, the limma can be directly applied.

lipidome2 = transform_by_sample(lipidome, function(x) log(x/sum(x)))
fit6 = model_fit(lipidome2, design = design, coef = coef, engine = "limma")
fit6$results[order(fit6$results$pval),] %>% head %>% kable
volcanoplot(fit6, hline = 0.05)
hist(fit6, vline = 0.05)

The End

model_fit() is simply a wrapper for the statistical pipelines from limma, edgeR, and DESeq2 with a clean interface and output format. You can certainly use limma, edgeR or DESeq2 directly and it is also not too complicated. Sometimes different models may come up with slightly different gene/feature list, however, as long as the model is approciate for the type of data to answer the specific question, your conclusion should be affected much. On thing to bear in mind that although most high through-put experiments are exploratory and hypothesis generating, running all different kinds of models and keep the one with the gene list that you like the most is probably a bad practice and may lead to inappropriate conclusion.



zhuchcn/HTSet documentation built on April 10, 2020, 4:51 p.m.