In order to demonstrate the ability of satuRn to perform a differential exon usage (DEU) analysis, as opposed to a differential transcript usage (DTU) analysis, we perform the DEU analysis described in the vignette of DEXSeq. Note that for satuRn there is no real distinction between performing a transcript-level or exon-level analysis. Once a proper input object is provided, with each row corresponding to a sub gene-level feature, satuRn will perform a differential usage analysis regardless of the specific feature type.

In this script, we perform a DEU analysis on the same dataset as in the DEXSeq vignette, i.e. a subset of the pasilla bulk RNA-Seq dataset by (Brooks et al., 2011), which can be obtained with the Bioconductor experiment package pasilla. Brooks et al. investigated the effect of siRNA knock-down of the gene pasilla on the transcriptome of fly S2-DRSC cells. The RNA-binding protein pasilla protein is thought to be involved in the regulation of splicing. (Its mammalian orthologs, NOVA1 and NOVA2, are well-studied examples of splicing factors.)

library(knitr)

Load and wrangle data

library(satuRn)
library(DEXSeq)
library(pasilla)
library(ggplot2)

First, we load the data files of the pasilla dataset as processed by the authors of the DEXSeq vignette.

inDir <- system.file("extdata", package="pasilla")
countFiles <- list.files(inDir, pattern="fb.txt$", full.names=TRUE)
flattenedFile <- list.files(inDir, pattern="gff$", full.names=TRUE)

sampleTable <- data.frame(row.names = c( "treated1", "treated2", "treated3", 
                                         "untreated1", "untreated2", "untreated3", 
                                         "untreated4" ),
                          condition = c("knockdown", "knockdown", "knockdown",  
                                        "control", "control", "control", "control"),
                          libType = c("single-end", "paired-end", "paired-end", 
                                      "single-end", "single-end", "paired-end", "paired-end"))

Next, we use the wrapper DEXSeqDataSetFromHTSeq function of the DEXSeq package to create a DEXSeqDataSet object from the raw data files. In addition, we subset the data to a selected set of genes created by the authors of the DEXSeq vignette, with the purpose of limiting the vignette runtime.

dxd <- DEXSeqDataSetFromHTSeq(countFiles,
                              sampleData=sampleTable,
                              design= ~ sample + exon + libType:exon + condition:exon,
                              flattenedfile=flattenedFile)
genesForSubset <- read.table(file.path(inDir, "geneIDsinsubset.txt"), 
                             stringsAsFactors=FALSE)[[1]]
dxd <- dxd[geneIDs(dxd) %in% genesForSubset,]
dxd # only 498 out of 70463 exons retained

Next, we remove exons with zero counts in all samples, and exon that are the only exon within a gene. This filtering is optional, but recommended. Not filter lowly abundant exons may lead to fit errors. Retaining exons that are the only exon in a gene is nonsensical for a differential usage analysis, because all the usages will be 100% by definition. satuRn will handle fit errors and "lonely transcripts" internally, setting their results to NA. However, it is good practice to remove them up front.

# remove exons with zero expression
dxd <- dxd[rowSums(featureCounts(dxd)) != 0,]

# remove exons that are the only exon for a gene
remove <- which(table(rowData(dxd)$groupID) == 1)
dxd <- dxd[rowData(dxd)$groupID != names(remove),]

Wrangle the data into a SummarizedExperiment object. satuRn can also handle RangedSummarizedExperiment and SingleCellExperiment objects, but is not compatible with DEXSeqDataSet objects.

exonInfo <- rowData(dxd)
colnames(exonInfo)[1:2] <- c("isoform_id", "gene_id")
exonInfo$isoform_id <- rownames(exonInfo)
sumExp <- SummarizedExperiment::SummarizedExperiment(assays = list(counts=featureCounts(dxd)), 
                                                     colData = sampleAnnotation(dxd), 
                                                     rowData = exonInfo)

satuRn analysis

We here perform a canonical satuRn analysis with exons as feature type. For a more elaborate description of the different steps, we refer to the main vignette of the satuRn package.

Fit quasi-binomial generalized linear models

sumExp <- satuRn::fitDTU(object = sumExp,
                         formula = ~ 0 + condition + libType,
                         parallel = FALSE,
                         BPPARAM = BiocParallel::bpparam(),
                         verbose = TRUE)

Test for differential exon usage

Create contrast matrix

design <- model.matrix(~0 + sampleAnnotation(dxd)$condition + sampleAnnotation(dxd)$libType)
colnames(design)[1:2] <- levels(as.factor(sampleAnnotation(dxd)$condition))

L <- matrix(0, ncol = 1, nrow = ncol(design))
rownames(L) <- colnames(design)
colnames(L) <- "C1"
L[c("control", "knockdown"), 1] <- c(1,-1)
L

Perform the test

sumExp <- satuRn::testDTU(object = sumExp, 
                          contrasts = L, 
                          diagplot1 = TRUE,
                          diagplot2 = TRUE,
                          sort = FALSE,
                          forceEmpirical = TRUE)

Visualize DTU

Visualize the statistically significant differentially used exons.

# get all (3) statistically significant differentially used exons
DEU <- rownames(rowData(sumExp)[["fitDTUResult_C1"]][which(rowData(sumExp)[["fitDTUResult_C1"]]$empirical_FDR < 0.05),])

group1 <- rownames(colData(sumExp))[colData(sumExp)$condition == "knockdown"]
group2 <- rownames(colData(sumExp))[colData(sumExp)$condition == "control"]

plots <- satuRn::plotDTU(object = sumExp,
                         contrast = "C1",
                         groups = list(group1,group2),
                         coefficients = list(c(0,1,0),c(1,0,0)),
                         summaryStat = "model",
                         transcripts = DEU)

for (i in seq_along(plots)) {
    current_plot <- plots[[i]] +
        scale_fill_manual(labels = c("knockdown","control"),
                          values=c("royalblue4", "firebrick")) + 
        scale_x_discrete(labels= c("knockdown","control")) +
        theme(axis.text.x = element_text(angle = 0, vjust = 1, hjust = 0.5, size = 9)) +
        theme(strip.text = element_text(size = 9, face = "bold"))

    print(current_plot)
}

Comparison with DEXSeq

In this document, we display the ability of satuRn to perform a differential exon usage (DEU) analysis. In our publication [@Gilis2021], we have used this analysis to demonstrate satuRn's ability to perform a DEU analysis, as well as to compare its result to those of DEXSeq. The main conclusion was that when the DEU results are ranked in terms of statistical significance, satuRn and DEXSeq results display a very strong concordance. This is in line with these methods having a very similar performance on small bulk RNA-seq datasets when performing analyses on the transcript level. However, as the datasets grow, e.g. for single-cell data, satuRn was much more scalable, and its empirical correction of p-values additionally improved the type 1 error control. We did not extensively benchmark if these methods have the same behavior on exon-level data, but at least for this small bulk analysis, this seems to be the case.

Acknowledgements

We would like to specifically acknowledge the original authors of the DEXSeq vignette, Alejandro Reyes, Simon Anders and Wolfgang Huber. This vignette essentially uses their processed data files and has simply replaced their DEXSeq analysis with a satuRn analysis.

Session info

sessionInfo()

References



statOmics/satuRn documentation built on March 9, 2023, 3:43 p.m.