In this vignette, we demonstrate how to perform a differential equivalence class (DECU) analysis with satuRn, as opposed to the more common differential transcript usage analysis. We will work on a subset of the single-cell transcriptomics (10X Genomics Chromium) dataset from Hagai et al. For a more detailed description on the biological context of the experiment, we refer to the paragraph Import equivalence classes of this vignette and the original publication.

The dataset by Hagai et al. was generated using the 10X Genomics Chromium sequencing protocol. This protocol generates 75 basepair, paired-end reads, one of which harbors the unique molecular identifier (UMI). As a consequence, only one read is used for mapping the against a reference transcriptome or genome. This is typically considered insufficient for obtaining transcript-level quantifications.

However, it is possible to still obtain sub gene-level counts by quantifying equivalence classes. An equivalence class is the set of transcripts that a sequencing read is compatible with. As such, reads that are aligned to the same set of transcripts are part of the same equivalence class. A transcript compatibility count is the number of reads assigned to each of the different equivalence classes. For a more detailed biological interpretation of an equivalence class, we refer to the visualize DTU paragraph in this vignette, or to our satuRn publication. Note that we are by no means the first to suggest performing differential expression analyses on equivalence classes; see ADD REFERENCES for relevant examples.

Note that for satuRn there is no real distinction between performing a transcript-level or equivalence class-level analysis. Once a proper input object is provided, with each row corresponding to a sub gene-level feature, satuRn will perform a differential usage analysis regardless of the specific feature type.

Load and wrangle data

Load libraries

library(tximportData)
library(SingleCellExperiment)
library(satuRn)
library(ggplot2)
library(edgeR)
library(limma)
library(fishpond)

Import equivalence classes

For this vignette, we will perform a toy analysis on a subset of the dataset from Hagai et al. The authors performed bulk and single-cell transcriptomics experiments that aimed to characterize the innate immune response’s transcriptional divergence between different species and the variability in expression among cells. Here, we focus on a small subset of one of the single-cell experiments (10X Genomics sequencing protocol), which characterized mononuclear phagocytes in mice under two treatment regimes: unstimulated (unst) and after two hours of stimulation with lipopolysaccharide (LPS2). To reduce the runtime for this vignette, we only quantified the expression in 50 cells of each treatment group, based on the original whitelist (essentially a list of high quality cells as determined by the CellRanger software that is often used to analyze 10X Genomics data) provided by the original authors.

To obtain equivalence class-level counts, we quantified the fastq files from Hagai et al. using the Alevin software. Alevin can take a predetermined whitelist of high quality cells as input, which allows for quantifying only a subset of the cells. More importantly, Alevin allows for directly obtaining equivalence class-level counts by using the -dumpBfh flag. For each data sample, the equivalence-class counts are outputted in a raw format in a bfh.txt file. These files can we read and wrangled into a R-friendly format using the alevinEC function of the fishpond R package.

Note that other pseudo-alignment tools such as Salmon (which is internally used by Alevin) and kallisto are also able to output equivalence class counts.

Here, we read and wrangle the equivalence class counts from the bfh.txt files of the subset of samples from the dataset by Hagai et al. that is included in the tximportData package.

# set path to the bfh.txt files generated by Alevin with the `-dumpBfh` flag
# that are present in the `tximportData` package
dir <- system.file("extdata", package="tximportData")
files <- c(file.path(dir,"alevin/mouse1_unst_50/alevin/bfh.txt"),
           file.path(dir,"alevin/mouse1_LPS2_50/alevin/bfh.txt"))

# Additionally read a dataframe that links transcript identifiers to gene 
# identifiers
tx2gene <- read.table(file.path(dir, "tx2gene_alevin.tsv"),
                      sep = "\t",
                      header = FALSE)
colnames(tx2gene) <- c("isoform_id", "gene_id")

# use `alevinEC` to read and wrangle the equivalence class counts (~10 seconds)
EC_list <- alevinEC(paths = files, 
                    tx2gene = tx2gene, 
                    multigene = FALSE,
                    ignoreTxVersion = FALSE,
                    ignoreAfterBar = FALSE,
                    quiet = TRUE)

The AlevinEC function outputs a list. The counts element of the list is a sparse matrix, where each row corresponds to an equivalence class and each column corresponds to a cell identifier. The _integer at the end of the cell identifier corresponds to the sample (more specifically, the index of the element in the files vector) from which the cell originates.

TCC_mat <- EC_list$counts
dim(TCC_mat)
TCC_mat[1:3,1:3]

Note that the equivalence class names are in the form of integers separated by "|" signs. These integers allow for linking back equivalence classes to their corresponding genes and gene isoforms (transcripts) by using the tx2gene_matched element of the AlevinEC output. For example the first equivalence class, 2183|2187|2188|2190 corresponds to the genes and transcripts in rows 2183, 2187, 2188 and 2190 of the tx2gene_matched element.

tx2gene_matched <- EC_list$tx2gene_matched
dim(tx2gene_matched)
head(tx2gene_matched)

tx2gene_matched[c(2183,2187,2188,2190),]

Construct feature-level information

Next, we construct a dataframe that allows for linking each equivalence class (feature) to its corresponding gene and transcript(s). This can be achieved using the following code.

eccs <- strsplit(rownames(TCC_mat),"\\|",fixed=FALSE)

# link ECs to transcripts
txForEachEC <- lapply(eccs, function(ecc){
  tx2gene_matched$isoform_id[as.integer(ecc)]
})

# link ECs to genes
eccs <- sapply(eccs, function(x) x[1]) # we can simply take the first transcript
# identifier, since we already made sure all transcripts in the equivalence 
# class come from the same gene by by setting multigene = FALSE in the alevin_EC
# function. See the help files of AlevinEC for more information on the function
# arguments
geneForEachEC <- tx2gene_matched$gene_id[as.integer(eccs)]

rd <- data.frame(isoform_id = rownames(TCC_mat),
                 gene_id = geneForEachEC,
                 row.names = rownames(TCC_mat))
rd$tx_id <- txForEachEC
head(rd)

In the isoform_id column, we have a copy of the equivalence class names. Note that the target features (here the equivalence class) must always be in a column named isoform_id for downstream analysis with satuRn.

In the gene_id column, we have the name of the gene to which the equivalence class corresponds. Note that because we imported the data with the AlevinEC function with the multigene option set to FALSE. As such, each equivalence class will be a set of reads that is only compatible with a single gene. In the context of a differential usage analysis, this is the most sensible thing to do; if we would retain reads compatible with multiple genes, it would not be possible to compute their usage within their gene (this would result in multiple usages per feature). Hence, we remove such classes, which in our hands removes approximately 15% of the total reads, which is in line with previous reports.

Finally, the tx_id column displays the set of transcripts with which each equivalence class is compatible. This column is useful for downstream biological interpretation of the equivalence classes.

Construct cell-level information

Here we provide the cell-level information. Cells originated either from the unstimulated or LPS2 treatment group.

cd <- data.frame(cellID = colnames(EC_list$counts),
                 mouse = factor(rep("mouse_1",times=100)),
                 treatment = factor(rep(c("unst", "LPS2"),each=50),
                                    levels = c("unst", "LPS2")),
                 row.names = colnames(EC_list$counts))
head(cd)

Construct SingleCellExperiment object

We combine the counts, feature-level information and cell-level information into a SingleCellExperiment object.

sce <- SingleCellExperiment(assays = list(counts = TCC_mat), 
                            colData = cd, 
                            rowData = rd)
sce

Feature-level filtering

Here we perform some feature-level filtering. For this task, we adopt the filtering criterion that is implemented in the R package edgeR. Alternatively, one could adopt the dmFilter criterion from the DRIMSeq R package, which provides a more stringent filtering when both methods are run in default settings. After filtering, we again remove transcripts that are the only isoform expressed of a certain gene.

# Remove lowly expressed equivalence classes
filter_edgeR <- filterByExpr(assay(sce),
                             min.count = 1,
                             min.total.count = 0,
                             large.n = 20,
                             min.prop = 0)
table(filter_edgeR)

sce <- sce[filter_edgeR,]

# Remove EC that are the only EC expressed of a certain gene
uniqueGene <- which(isUnique(rowData(sce)$gene_id))
sce <- sce[-uniqueGene,]

dim(sce)

satuRn analysis

We here perform a canonical satuRn analysis with equivalence classes as feature type. For amore elaborate description of the different steps, we refer to the main vignette of the satuRn package.

Fit quasi-binomial generalized linear models

sce <- satuRn::fitDTU(
    object = sce,
    formula = ~ 0 + treatment,
    parallel = FALSE,
    BPPARAM = BiocParallel::bpparam(),
    verbose = TRUE
)

Test for differential equivalence class usage

Here we test for differential equivalence class usage between the two treatment groups.

Set up contrast matrix

First, we set up a contrast matrix, which specifies which combination of model parameters corresponds to our research question, i.e., testing differential usage between the treatment groups. This can be done automatically with the makeContrasts function of the limma R package.

group <- sce$treatment
design <- model.matrix(~ 0 + group) # construct design matrix
colnames(design) <- levels(group)

L <- limma::makeContrasts(
    Contrast1 = unst - LPS2,
    levels = design)
L # contrast matrix

Perform the test

Next we can perform differential usage testing using testDTU. We again adopt default settings. For more information on the parameter settings, please consult the help file of the testDTU function.

sce <- satuRn::testDTU(
    object = sce,
    contrasts = L,
    diagplot1 = TRUE,
    diagplot2 = TRUE,
    sort = FALSE,
    forceEmpirical = FALSE
)

When set to TRUE, the diagplot1 and diagplot2 arguments generate diagnostic plots. See our main vignette for a detailed explanation on their interpretation.

The test results are now saved into the rowData of our SummarizedExperiment object under the name fitDTUResult_ followed by the name of the contrast of interest (i.e. the column names of the contrast matrix). The results can be accessed as follows:

head(rowData(sce)[["fitDTUResult_Contrast1"]]) # first and only contrast

See our main vignette for a detailed explanation on the interpretation of the different columns.

Visualize DTU

Top 3 features

Finally, we may visualize the usage of the top 3 differentially used equivalence classes in the different treatment groups. By the setting the transcripts and genes arguments to NULL and specifying top.n = 3, the 3 features with the smallest (empirically correct) false discovery rates are displayed.

group1 <- colnames(sce)[sce$treatment == "unst"]
group2 <- colnames(sce)[sce$treatment == "LPS2"]

plots <- satuRn::plotDTU(
    object = sce,
    contrast = "Contrast1",
    groups = list(group1, group2),
    coefficients = list(c(1, 0), c(0, 1)),
    summaryStat = "model",
    transcripts = NULL,
    genes = NULL,
    top.n = 3)

# to have same layout as in our publication
for (i in seq_along(plots)) {
    current_plot <- plots[[i]] +
        scale_fill_manual(labels = c("unst", "LPS2"), 
                          values = c("royalblue4","firebrick")) +
        scale_x_discrete(labels = c("unst", "LPS2"))
    print(current_plot)
}

Below, we discuss how differential usage between treatment groups of equivalence classes can be interpreted biologically. We discuss this for two genes as examples.

Biological interpretation for gene ENSMUSG00000030560.18

We here plot the differential usage of all the equivalence classes (retained after feature-level filtering) of the gene ENSMUSG00000030560.18. We plot each equivalence class of this gene by specifying genes = "ENSMUSG00000030560.18".

plots <- satuRn::plotDTU(
    object = sce,
    contrast = "Contrast1",
    groups = list(group1, group2),
    coefficients = list(c(1, 0), c(0, 1)),
    summaryStat = "model",
    transcripts = NULL,
    genes = "ENSMUSG00000030560.18",
    top.n = 3)

# to have same layout as in our publication
for (i in seq_along(plots)) {
    current_plot <- plots[[i]] +
        scale_fill_manual(labels = c("unst", "LPS2"), 
                          values = c("royalblue4","firebrick")) +
        scale_x_discrete(labels = c("unst", "LPS2"))
    print(current_plot)
}

We see that there are only two equivalence classes for this gene, 58069 and 58070. In addition, these equivalence classes are of length 1. This means that each equivalence class corresponds only to a single transcript (isoform).

rowData(sce)[c("58069","58070"),"tx_id"]

As such, even though we performed an analysis on the equivalence class level, we here can attribute the observed change in usage to specific transcripts! I.e., we can say that both transcript "ENSMUST00000032779.12" and transcript "ENSMUST00000131108.9" are beeing differentially used between unstimulated and LPS (2 hours) stimulated cells.

Below, we show an example of a geene for which the biological interpretation will be more ambiguous.

Biological interpretation for gene ENSMUSG00000058427.11

plots <- satuRn::plotDTU(
    object = sce,
    contrast = "Contrast1",
    groups = list(group1, group2),
    coefficients = list(c(1, 0), c(0, 1)),
    summaryStat = "model",
    transcripts = NULL,
    genes = "ENSMUSG00000006818.6",
    top.n = 3)

# to have same layout as in our publication
for (i in seq_along(plots)) {
    current_plot <- plots[[i]] +
        scale_fill_manual(labels = c("unst", "LPS2"), 
                          values = c("royalblue4","firebrick")) +
        scale_x_discrete(labels = c("unst", "LPS2"))
    print(current_plot)
}

Gene "ENSMUSG00000058427.11" has 3 equivalence classes: 119653|119657, 119653|119656|119657 and 119653. Only equivalence class 119653|119656|119657 is flagged as significantly differentially used between the two treatment groups. However, when we look at the feature-level information,

rowData(sce)[c("119653", "119653|119657", "119653|119656|119657"),"tx_id"]

we see that this equivalence class contains reads that are compatible with both transcript "ENSMUST00000007012.6" and "ENSMUST00000233897.2". As such, we cannot attribute the observed difference in usage to a specific transcript.

Furthermore, transcript "ENSMUST00000007012.6" is involved in all equivalence classes, and transcript "ENSMUST00000233897.2" is involved in both the 2nd and third equivalence class. Equivalence class 1 and 3 are both not significantly differentially used. As such, among the three transcripts involved in this gene, we did not get a clear picture which of the specific isoforms are differentially used.

Conclusions

In conclusion, we here displayed how to perform a equivalence class-level differential transcript usage analysis using satuRn. The biggest advantage of equivalence classes is that they can even be obtained from sequencing protocols that do not allow for transcript level quantification, which includes many of the droplet-based protocols such as 10X Genomics Chromium. Their biggest drawback is their downstream biological interpretation. In some cases, it will be possible to uniquely link a equivalence class to a specific isoform. However, in most cases, multiple transcripts can be compatible with the equivalence class. In those cases, the interpretation is often limited to recognizing that something is changing internally in the gene, but its unclear which exact isoforms are involved. This, however, still provides sub gene-level information, and may prompt follow-up experiments with a different methodolgy that allows for pinpointing the exact transcript-level change.

Session info

sessionInfo()

References



statOmics/satuRn documentation built on March 9, 2023, 3:43 p.m.