In this vignette, we demonstrate how to perform a differential equivalence
class (DECU) analysis with satuRn
, as opposed to the more common
differential transcript usage analysis. We will work on a subset of the
single-cell transcriptomics (10X Genomics Chromium) dataset from Hagai et al.
For a more detailed description on the biological context of the experiment, we
refer to the paragraph Import equivalence classes
of this vignette and the
original publication.
The dataset by Hagai et al. was generated using the 10X Genomics Chromium sequencing protocol. This protocol generates 75 basepair, paired-end reads, one of which harbors the unique molecular identifier (UMI). As a consequence, only one read is used for mapping the against a reference transcriptome or genome. This is typically considered insufficient for obtaining transcript-level quantifications.
However, it is possible to still obtain sub gene-level counts by quantifying
equivalence classes. An equivalence class is the set of transcripts that a
sequencing read is compatible with. As such, reads that are aligned to the same
set of transcripts are part of the same equivalence class. A transcript
compatibility count is the number of reads assigned to each of the different
equivalence classes. For a more detailed biological interpretation of an
equivalence class, we refer to the visualize DTU
paragraph in this vignette,
or to our satuRn
publication. Note that we are by no means the first to
suggest performing differential expression analyses on equivalence classes; see
ADD REFERENCES for relevant examples.
Note that for satuRn
there is no real distinction between performing a
transcript-level or equivalence class-level analysis. Once a proper input object
is provided, with each row corresponding to a sub gene-level feature, satuRn
will perform a differential usage analysis regardless of the specific feature
type.
library(tximportData) library(SingleCellExperiment) library(satuRn) library(ggplot2) library(edgeR) library(limma) library(fishpond)
For this vignette, we will perform a toy analysis on a subset of the dataset from Hagai et al. The authors performed bulk and single-cell transcriptomics experiments that aimed to characterize the innate immune response’s transcriptional divergence between different species and the variability in expression among cells. Here, we focus on a small subset of one of the single-cell experiments (10X Genomics sequencing protocol), which characterized mononuclear phagocytes in mice under two treatment regimes: unstimulated (unst) and after two hours of stimulation with lipopolysaccharide (LPS2). To reduce the runtime for this vignette, we only quantified the expression in 50 cells of each treatment group, based on the original whitelist (essentially a list of high quality cells as determined by the CellRanger software that is often used to analyze 10X Genomics data) provided by the original authors.
To obtain equivalence class-level counts, we quantified the fastq files from
Hagai et al. using the Alevin software. Alevin can take a predetermined
whitelist of high quality cells as input, which allows for quantifying only a
subset of the cells. More importantly, Alevin allows for directly obtaining
equivalence class-level counts by using the -dumpBfh
flag. For each data
sample, the equivalence-class counts are outputted in a raw format in a
bfh.txt
file. These files can we read and wrangled into a R-friendly format
using the alevinEC
function of the fishpond
R package.
Note that other pseudo-alignment tools such as Salmon
(which is internally
used by Alevin) and kallisto
are also able to output equivalence class counts.
Here, we read and wrangle the equivalence class counts from the bfh.txt
files
of the subset of samples from the dataset by Hagai et al. that is included in
the tximportData
package.
# set path to the bfh.txt files generated by Alevin with the `-dumpBfh` flag # that are present in the `tximportData` package dir <- system.file("extdata", package="tximportData") files <- c(file.path(dir,"alevin/mouse1_unst_50/alevin/bfh.txt"), file.path(dir,"alevin/mouse1_LPS2_50/alevin/bfh.txt")) # Additionally read a dataframe that links transcript identifiers to gene # identifiers tx2gene <- read.table(file.path(dir, "tx2gene_alevin.tsv"), sep = "\t", header = FALSE) colnames(tx2gene) <- c("isoform_id", "gene_id") # use `alevinEC` to read and wrangle the equivalence class counts (~10 seconds) EC_list <- alevinEC(paths = files, tx2gene = tx2gene, multigene = FALSE, ignoreTxVersion = FALSE, ignoreAfterBar = FALSE, quiet = TRUE)
The AlevinEC
function outputs a list. The counts element of the list is
a sparse matrix, where each row corresponds to an equivalence class and each
column corresponds to a cell identifier. The _integer at the end of the cell
identifier corresponds to the sample (more specifically, the index of the
element in the files vector) from which the cell originates.
TCC_mat <- EC_list$counts dim(TCC_mat) TCC_mat[1:3,1:3]
Note that the equivalence class names are in the form of integers separated by
"|" signs. These integers allow for linking back equivalence classes to their
corresponding genes and gene isoforms (transcripts) by using the
tx2gene_matched element of the AlevinEC
output. For example the first
equivalence class, 2183|2187|2188|2190 corresponds to the genes and
transcripts in rows 2183, 2187, 2188 and 2190 of the tx2gene_matched element.
tx2gene_matched <- EC_list$tx2gene_matched dim(tx2gene_matched) head(tx2gene_matched) tx2gene_matched[c(2183,2187,2188,2190),]
Next, we construct a dataframe that allows for linking each equivalence class (feature) to its corresponding gene and transcript(s). This can be achieved using the following code.
eccs <- strsplit(rownames(TCC_mat),"\\|",fixed=FALSE) # link ECs to transcripts txForEachEC <- lapply(eccs, function(ecc){ tx2gene_matched$isoform_id[as.integer(ecc)] }) # link ECs to genes eccs <- sapply(eccs, function(x) x[1]) # we can simply take the first transcript # identifier, since we already made sure all transcripts in the equivalence # class come from the same gene by by setting multigene = FALSE in the alevin_EC # function. See the help files of AlevinEC for more information on the function # arguments geneForEachEC <- tx2gene_matched$gene_id[as.integer(eccs)] rd <- data.frame(isoform_id = rownames(TCC_mat), gene_id = geneForEachEC, row.names = rownames(TCC_mat)) rd$tx_id <- txForEachEC head(rd)
In the isoform_id
column, we have a copy of the equivalence class names. Note
that the target features (here the equivalence class) must always be in a column
named isoform_id
for downstream analysis with satuRn
.
In the gene_id
column, we have the name of the gene to which the equivalence
class corresponds. Note that because we imported the data with the AlevinEC
function with the multigene
option set to FALSE. As such, each equivalence
class will be a set of reads that is only compatible with a single gene. In the
context of a differential usage analysis, this is the most sensible thing to do;
if we would retain reads compatible with multiple genes, it would not be
possible to compute their usage within their gene (this would result in multiple
usages per feature). Hence, we remove such classes, which in our hands removes
approximately 15% of the total reads, which is in line with previous reports.
Finally, the tx_id
column displays the set of transcripts with which each
equivalence class is compatible. This column is useful for downstream biological
interpretation of the equivalence classes.
Here we provide the cell-level information. Cells originated either from the unstimulated or LPS2 treatment group.
cd <- data.frame(cellID = colnames(EC_list$counts), mouse = factor(rep("mouse_1",times=100)), treatment = factor(rep(c("unst", "LPS2"),each=50), levels = c("unst", "LPS2")), row.names = colnames(EC_list$counts)) head(cd)
We combine the counts, feature-level information and cell-level information into
a SingleCellExperiment
object.
sce <- SingleCellExperiment(assays = list(counts = TCC_mat), colData = cd, rowData = rd) sce
Here we perform some feature-level filtering. For this task, we adopt
the filtering criterion that is implemented in the R package edgeR
.
Alternatively, one could adopt the dmFilter
criterion from the DRIMSeq
R
package, which provides a more stringent filtering when both methods are run
in default settings. After filtering, we again remove transcripts that are
the only isoform expressed of a certain gene.
# Remove lowly expressed equivalence classes filter_edgeR <- filterByExpr(assay(sce), min.count = 1, min.total.count = 0, large.n = 20, min.prop = 0) table(filter_edgeR) sce <- sce[filter_edgeR,] # Remove EC that are the only EC expressed of a certain gene uniqueGene <- which(isUnique(rowData(sce)$gene_id)) sce <- sce[-uniqueGene,] dim(sce)
We here perform a canonical satuRn
analysis with equivalence classes as
feature type. For amore elaborate description of the different steps, we refer
to the main
vignette
of the satuRn
package.
sce <- satuRn::fitDTU( object = sce, formula = ~ 0 + treatment, parallel = FALSE, BPPARAM = BiocParallel::bpparam(), verbose = TRUE )
Here we test for differential equivalence class usage between the two treatment groups.
First, we set up a contrast matrix, which specifies which combination of model
parameters corresponds to our research question, i.e., testing differential
usage between the treatment groups. This can be done automatically with the
makeContrasts
function of the limma
R package.
group <- sce$treatment design <- model.matrix(~ 0 + group) # construct design matrix colnames(design) <- levels(group) L <- limma::makeContrasts( Contrast1 = unst - LPS2, levels = design) L # contrast matrix
Next we can perform differential usage testing using testDTU
. We again adopt
default settings. For more information on the parameter settings, please consult
the help file of the testDTU
function.
sce <- satuRn::testDTU( object = sce, contrasts = L, diagplot1 = TRUE, diagplot2 = TRUE, sort = FALSE, forceEmpirical = FALSE )
When set to TRUE, the diagplot1
and diagplot2
arguments generate diagnostic
plots. See our main
vignette
for a detailed explanation on their interpretation.
The test results are now saved into the rowData
of our SummarizedExperiment
object under the name fitDTUResult_
followed by the name of the contrast
of interest (i.e. the column names of the contrast matrix).
The results can be accessed as follows:
head(rowData(sce)[["fitDTUResult_Contrast1"]]) # first and only contrast
See our main vignette for a detailed explanation on the interpretation of the different columns.
Finally, we may visualize the usage of the top 3 differentially used equivalence
classes in the different treatment groups. By the setting the transcripts
and
genes
arguments to NULL
and specifying top.n = 3
, the 3 features with the
smallest (empirically correct) false discovery rates are displayed.
group1 <- colnames(sce)[sce$treatment == "unst"] group2 <- colnames(sce)[sce$treatment == "LPS2"] plots <- satuRn::plotDTU( object = sce, contrast = "Contrast1", groups = list(group1, group2), coefficients = list(c(1, 0), c(0, 1)), summaryStat = "model", transcripts = NULL, genes = NULL, top.n = 3) # to have same layout as in our publication for (i in seq_along(plots)) { current_plot <- plots[[i]] + scale_fill_manual(labels = c("unst", "LPS2"), values = c("royalblue4","firebrick")) + scale_x_discrete(labels = c("unst", "LPS2")) print(current_plot) }
Below, we discuss how differential usage between treatment groups of equivalence classes can be interpreted biologically. We discuss this for two genes as examples.
We here plot the differential usage of all the equivalence classes (retained
after feature-level filtering) of the gene ENSMUSG00000030560.18. We plot
each equivalence class of this gene by specifying
genes = "ENSMUSG00000030560.18"
.
plots <- satuRn::plotDTU( object = sce, contrast = "Contrast1", groups = list(group1, group2), coefficients = list(c(1, 0), c(0, 1)), summaryStat = "model", transcripts = NULL, genes = "ENSMUSG00000030560.18", top.n = 3) # to have same layout as in our publication for (i in seq_along(plots)) { current_plot <- plots[[i]] + scale_fill_manual(labels = c("unst", "LPS2"), values = c("royalblue4","firebrick")) + scale_x_discrete(labels = c("unst", "LPS2")) print(current_plot) }
We see that there are only two equivalence classes for this gene, 58069
and
58070
. In addition, these equivalence classes are of length 1. This means
that each equivalence class corresponds only to a single transcript (isoform).
rowData(sce)[c("58069","58070"),"tx_id"]
As such, even though we performed an analysis on the equivalence class level, we here can attribute the observed change in usage to specific transcripts! I.e., we can say that both transcript "ENSMUST00000032779.12" and transcript "ENSMUST00000131108.9" are beeing differentially used between unstimulated and LPS (2 hours) stimulated cells.
Below, we show an example of a geene for which the biological interpretation will be more ambiguous.
plots <- satuRn::plotDTU( object = sce, contrast = "Contrast1", groups = list(group1, group2), coefficients = list(c(1, 0), c(0, 1)), summaryStat = "model", transcripts = NULL, genes = "ENSMUSG00000006818.6", top.n = 3) # to have same layout as in our publication for (i in seq_along(plots)) { current_plot <- plots[[i]] + scale_fill_manual(labels = c("unst", "LPS2"), values = c("royalblue4","firebrick")) + scale_x_discrete(labels = c("unst", "LPS2")) print(current_plot) }
Gene "ENSMUSG00000058427.11" has 3 equivalence classes: 119653|119657
,
119653|119656|119657
and 119653
. Only equivalence class
119653|119656|119657
is flagged as significantly differentially used between
the two treatment groups. However, when we look at the feature-level
information,
rowData(sce)[c("119653", "119653|119657", "119653|119656|119657"),"tx_id"]
we see that this equivalence class contains reads that are compatible with both transcript "ENSMUST00000007012.6" and "ENSMUST00000233897.2". As such, we cannot attribute the observed difference in usage to a specific transcript.
Furthermore, transcript "ENSMUST00000007012.6" is involved in all equivalence classes, and transcript "ENSMUST00000233897.2" is involved in both the 2nd and third equivalence class. Equivalence class 1 and 3 are both not significantly differentially used. As such, among the three transcripts involved in this gene, we did not get a clear picture which of the specific isoforms are differentially used.
In conclusion, we here displayed how to perform a equivalence class-level
differential transcript usage analysis using satuRn
. The biggest advantage of
equivalence classes is that they can even be obtained from sequencing protocols
that do not allow for transcript level quantification, which includes many of
the droplet-based protocols such as 10X Genomics Chromium. Their biggest
drawback is their downstream biological interpretation. In some cases, it will
be possible to uniquely link a equivalence class to a specific isoform. However,
in most cases, multiple transcripts can be compatible with the equivalence
class. In those cases, the interpretation is often limited to recognizing that
something is changing internally in the gene, but its unclear which exact
isoforms are involved. This, however, still provides sub gene-level information,
and may prompt follow-up experiments with a different methodolgy that allows for
pinpointing the exact transcript-level change.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.