In zhuchcn/HTSet: High through-put experiment dataset container

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(ggplot2)
library(knitr)
library(kableExtra)
library(dplyr)
library(HTSet)

introduction

High through-put experiments such as RNA-seq, proteomics, and metaboliomics estimate the abundance/concentration of hundrands to millions of features. Very commonly, researchers would like to answer questions such as which gene (or protein, metabolites) were differentially expressed or have differential abundance in treatment groups. Linear model and generalized linear model are widely used because they are highly flaxible to be adopted to different study designs. The HTSet package has a wrapper function model_fit() of three popular bioconductor packages, including limma, edgeR, and DESeq2, that performs differnetial analysis on a high through-put experiment dataset, and it returns the result in an organized structure.

The basic usage of model_fit() is below.

model_fit(object, design, coef, engine, args)

object must be a HTSet object. design is the study design matrix, usually returned by the model.matrix function. coef is the coefficient to perform statistical test. engine must be one of the three backend statistical package mentioned above. args is a list of additional parameters to be parsed to the backend statistical engine.

model_fit returns a list-like object. The elemenets that it contains are listed below.

results: A data.frame of the statistical test results for each gene/feature in each row. It has 5 columns. logFC has the estimate of the log2-fold-change corresponding to the effect. mean has the average lo2-expression for the gene/feature accross all samples. This is the AveExpr column in limma's topTable or baseMean in DESeq2's results. stat is the statistic value of the corresponding test. When using limma, this is the t-statistic value, same as the t column in the result of topTable. For edgeR, this is the f-statistic value for Quansi-likelihood test or the likelihood ratio statistic value for likelihood ratio test. Same as the column F or RT in the result of edgeR's topTags. As for DESeq2, this is the Wald statistic value for Wald test or the difference in deviance between the reduced model and the full model for likelihood ratio test. same as the stat column in the output of DESeq2's results. pval is the raw o-value and padj is the p-value after correction for multiple testing.
adjust.method: The method used to correct for multiple testing.
design: The design matrix
df: Degree of freedom for each gene/feature.
distribution: The distribution which the stat column in results is compared to generate the pval. This is t for t distribution for limma or DESeq2 with Wald test when useT eqauls to TRUE, f for f distribution for Quansi-likelihood test with edgeR, chisq for chi-squared distribution for likelihood ratio test with edgeR or DESeq2, and norm for DESeq2 with Wald test when useT equals to FALSE.
engin: The statistical package used as backend engine.
coef: The coefficient tested
params: Additional parameters parsed to model_fit

The HTSet package has two dataset included. They both can be loaded using the data() function.

data(exrna)
data(lipidome)

In this vignette, we will go though some examples of using model_fit() to perform statistical analysis with the two datasets.

Backend engines

count data

High through-put experimental data generated by nucleiic acid sequencing are were usually presented as counts per gene (RNA-seq) or counts per ASV/OTU (16S-seq). Counts data by nature do not fit gaussian distribution, thus linear model is not directly applicable. Thus, the major approches are (1) transforming counts data to continous data and normalized it, and (2) use negative binomial generalized linear model (glm). For the former approch, counts data are often transformed into logCPM followed by linear modeling. Another or a better approach is using limma's voom function, which estimates the mean-variance relationship of the log-counts, and generates a precision weight for each observation. As for negative binomial glm, edgeR and DESeq2 two widely recognized packages.

exrna

The dataset exrna is the small RNA-seq result of isolated HDL from human plasma of 3 patients with systemic lupus erythematosus and 3 healthy control. The raw sequence fastq files were downloaded from GEO (GSE121865), and processed using the exceRNApipeline, using STAR agaist hg38. Gene counts were generated using htseq-count agaist gencode's annotation.

exrna

kable(exrna$pdata[,c(1,2,4,7)], format = "html") %>%
    kable_styling()%>%
    column_spec(5, background = "orange") %>%
    scroll_box(width = "100%")

limma voom

We first create a study design matrix using the model.matrix function.

# create the design matrix
design = model.matrix( ~ Condition, data = exrna$pdata)
kable(design)

We are going to test whether each gene is expressed differently between patinets and control, thus the cofficient that we are going to test should be ConditionSystemic Lupus Erythematosus (second column name). We can than call model_fit. Setting voom = TRUE tells model_fit to run voom before calling limma's lmFit function.

coef = "ConditionSystemic Lupus Erythematosus"
fit1 = model_fit(exrna, design = design, coef = coef, engine = "limma", args = list(voom = TRUE))

When args has an element named voom, model_fit calls the three functions voom, lmFit and eBayes in order. Additional arguments can be parsed to each of the three functions by assigning a list as value to the key of the corresponding step. For example, voom's mean-variance trend plot was disabled by default. To enable the plot, parse it in a list to voom, as shown below.

fit1_2 = model_fit(exrna, design = design, coef = coef, engine = "limma", 
                   args = list(voom = list(plot = TRUE)))

Draw a volcano plot

volcanoplot(fit1, hline = 0.05)

P value histogram

hist(fit1, vline = 0.05)

limma (but not voom)

It is also possible to transform count data into logCPM, and use limma directly.

exrna_cpm = exrna
exrna_cpm$edata = edgeR::cpm(exrna_cpm$edata, prior.count = 1, log = TRUE)
fit2 = model_fit(exrna_cpm, design = design, coef = coef, engine = "limma")

fit2$results[order(fit2$results$pval),] %>% head %>% kable()

Drawing a volcano plot

volcanoplot(fit2, hline = 0.05)

P value histogram

hist(fit2, vline = 0.05)

DESeq2

To use DESeq2 to perform negative binomial glm, simply set engine = "DESeq2"

fit3 = model_fit(exrna, design = design, coef = coef, engine = "DESeq2")

Additional parameters can be parsed to the args as a list named as DESeq. The parameters will be used internally by the DESeq function. Use ?DESeq to see all parameters that can be used. For example, to use t distribution to estimate p-values:

fit3_2 = model_fit(
    exrna, design = design, coef = coef, engin = "DESeq2",
    args = list(
        DESeq = list(useT = TRUE)
    )
)

DESeq2 also provides a likelihood ratio test. That can be done in a very similar way, by parsing additional arguments to the args parameter.

reduced = model.matrix(~1, data = exrna$pdata)
fit3_2 = model_fit(
    exrna, design = design, coef = coef, engin = "DESeq2",
    args = list(
        DESeq = list(test = "LRT", reduced = reduced)
    )
)

fit3$results[order(fit3$results$pval),] %>% head %>% kable()

Drawing a volcano plot

volcanoplot(fit3, hline = 0.05)

P value histogram

hist(fit3, vline = 0.05)

edgeR

Similar to DESeq2, edgeR also performes negative binomial glm test. Two models are provided by edgeR, the quasi-likelihood test (QLF) and likelihood ratio test (LRT). model_fit uses QLF by default.

fit4 = model_fit(exrna, design = design, coef = coef, engine = "edgeR")
fit4$results[order(fit4$results$pval),] %>% head %>% kable()

Drawing a volcano plot

volcanoplot(fit4, hline = 0.05)

P value histogram

hist(fit4, vline = 0.05)

To use LRT, you need to specify it in the args.

fit5 = model_fit(exrna, design = design, coef = coef, engine = "edgeR",
                 args = list(model = "lrt"))
fit5$results[order(fit5$results$pval),] %>% head %>% kable()

Drawing a volcano plot

volcanoplot(fit5, hline = 0.05)

P value histogram

hist(fit5, vline = 0.05)

Complex models

lipidome

model_fit() also supports more complex study design models. The dataset lipidome is the result of a lipidomics analysis on isolated HDL from 10 human subjects in a 2x2 randomized cross-over study. The study was published on Metabolomics, 2019 (doi: 10.1007/s11306-019-1579-1).

lipidome

summary(lipidome$pdata[,c("Treatment", "Timepoint")])

In the examples above, we were simply comparing disease patients to control. However in this study, we are asking a slightly more complex question. We would like to see how each lipid species change before and after the two treatments FF and Med. And since each subject was sampled four times, the model that we are using would be as blew. And $\beta_3$ is the coefficient that we would like to test.

$$y = \beta_0 + \beta_1 Treatment + \beta_2 Timepoint + \beta_3 Timepoint*Treatment + \beta_4 Subject + e$$

The design matrix should be as below.

# in R's equation, ~ Treatment * Timepoint is a shorthand for 
# ~ Treatment + Timepoint + Treatment * Timepoint
design = model.matrix(~ Treatment * Timepoint + Subject + 1, data = lipidome$pdata)
coef = "TreatmentMed:TimepointPre"

Because the lipidome is continous data, the limma can be directly applied.

lipidome2 = transform_by_sample(lipidome, function(x) log(x/sum(x)))
fit6 = model_fit(lipidome2, design = design, coef = coef, engine = "limma")
fit6$results[order(fit6$results$pval),] %>% head %>% kable

volcanoplot(fit6, hline = 0.05)

hist(fit6, vline = 0.05)

The End

model_fit() is simply a wrapper for the statistical pipelines from limma, edgeR, and DESeq2 with a clean interface and output format. You can certainly use limma, edgeR or DESeq2 directly and it is also not too complicated. Sometimes different models may come up with slightly different gene/feature list, however, as long as the model is approciate for the type of data to answer the specific question, your conclusion should be affected much. On thing to bear in mind that although most high through-put experiments are exploratory and hypothesis generating, running all different kinds of models and keep the one with the gene list that you like the most is probably a bad practice and may lead to inappropriate conclusion.

zhuchcn/HTSet documentation built on April 10, 2020, 4:51 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com