In tangchao7498/ICAS:

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.align = 'center')
knitr::opts_knit$set(root.dir = "/mnt/raid61/Personal_data/tangchao/40TCellsQTL4")

\pagebreak

For post-transcriptional splicing analysis, an important question is the quantification and statistical inference of systematic changes between conditions, as compared to within-condition variability. We have provide an R package named ICAS to test for differential splicing by use of beta binomial generalized linear models; the estimates of dispersion and delta changes incorporate data-driven prior distributions. In this tutorial, we have shown how to use ICAS to quantification intron-centric PSI (percentage of splice in) of each alternative spliced intron from STAR outputed SJ.out.tab files and user prepared read count matrix of splice junctions. In addition, we have also shown that how to use ICAS to get differential spliced introns from user-defined groups. Finally, we have provided many visualization functions in ICAS.

Citing the ICAS package

The ICAS package contains many differential splicing analysis and genomic features visualization methods which are described across multiple manuscripts, by different non-overlapping authors. This makes citing the package a bit difficult, so here is a guide:

If you are using bimod (Likelihood-ratio test) for differential splicing analysis, please cite [@McDavid_2012].
If you are using tobit (Tobit-test) for differential splicing analysis, please cite[@Trapnell_2014], and it would be considerate also to cite [@Tobin_1958], since this publication described Tobit-test first time.
If you are using MAST, you should cite the original publication [@Finak_2015].
If you are using BB (betabinomial nodel) for differential splicing analysis, please cite [@Chen_2014].
If you are using wilcox, roc, t, poisson, or negbinom for differential splicing analysis, you should cite [@Stuart_2019].
If you are using function plotgene() or GroupHeatSpot() for gene structure visualization, please cite [@Yin_2012].

If you're using ICAS in a publication, you can get the citation by

citation("ICAS")

Installation

Your can install ICAS from GitHub https://github.com/tangchao7498/ICAS.

library(devtools)
devtools::install_github("tangchao7498/ICAS")

Other packages needed for this guideline:

library(ICAS)
library(data.table)
library(ggplot2)
library(ggpubr)

Data preparation

We need meta information to describe the samples used.

meta <- read.table(file = "./analysis/12.ICASTutorial/Meta.txt", 
                   header = TRUE, 
                   sep = "\t", 
                   stringsAsFactors = FALSE)
knitr::kable(meta)

In this tutorial we want to select the differential spliced introns between megakaryocytes (MK) and erythroblasts (EB), we first need to convert the CELL_TYPE to facotrs.

meta$CELL_TYPE <- factor(meta$CELL_TYPE)

Make ICAS object

Make ICAS object from STAR SJ.out.tab files

ICAS has a function ICASDataSetFromSJFile that is fully compatible with STAR outputed SJ.out.tab files.

SJFiles <- paste0("./analysis/11.DifferentialSplicing/data/", 
                  row.names(meta), 
                  ".SJ.out.tab")
stopifnot(all(file.exists(SJFiles)))
SJFiles

MyObj <- ICASDataSetFromSJFile(SJFiles = SJFiles, 
                               postfix = ".SJ.out.tab", 
                               colData = meta, 
                               uniqMapOnly = TRUE,
                               annotationOnly = FALSE,
                               minSJRowSum = 50,
                               minSJRead = 2,
                               minOverhang = 3,
                               design = "CELL_TYPE")
MyObj

Make ICAS object from matrix

Alternatively, the function ICASDataSetFromMatrix can be used if you already have a matrix of read counts prepared from another source. To use ICASDataSetFromMatrix, the user should provide the counts matrix, the information about the samples (the columns of the count matrix) as a DataFrame or data.frame, and the design factor.

CountTab <- counts(MyObj)
CountTab[1:4,1:4]

MyObj <- ICASDataSetFromMatrix(countData = CountTab, 
                               colData = meta, 
                               design = "CELL_TYPE")
MyObj

Calculate PSI

For complex splice sites (same start/end junction > 2), we need the minimum minor-junction's fraction not less than 0.01. And we only calculate PSI of introns with at least 10 junction reads of a same splice site.

MyObj <- PSICalculater(object = MyObj, 
                       MMJF = 0.01, 
                       MinSumSpliceSite = 10)
psi(MyObj)[1:4, 1:5]

Mean variance plot

The figure below plots the standard deviation of the PSI, across samples, against the mean.

vsn::meanSdPlot(psi(MyObj))

Heatmap of the sample-to-sample distances

One of the use of the PSI data is sample clustering. Here, we apply the get_dist() function from package factoextra to the transpose of the PSI matrix to get sample-to-sample distances.
A heatmap of this distance matrix gives us an overview over similarities and dissimilarities between samples. We have to provide a hierarchical clustering hc to the heatmap function based on the sample distances, or else the heatmap function would calculate a clustering based on the distances between the rows/columns of the distance matrix.

library(RColorBrewer)
library(pheatmap)
sampleDists <- factoextra::get_dist(t(psi(MyObj)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste0(MyObj@colData$CELL_TYPE, "-", MyObj@colData$MOLECULE)
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors)

For confounder confused dataset, we can use batch correction algorithm ComBat to remove the batch effect.

library(sva)
batch <- as.factor(MyObj@colData$MOLECULE)
ExpMat <- sva::ComBat(dat = as.matrix(na.omit(psi(MyObj))), 
                      batch = batch, 
                      prior.plots = FALSE, 
                      par.prior = TRUE, 
                      mean.only = FALSE)

sampleDists <- factoextra::get_dist(t(ExpMat))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste0(MyObj@colData$CELL_TYPE, "-", MyObj@colData$MOLECULE)
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors)

Host gene of all alternative splicing splice junction

Identify host gene of alternative splicing splice junction (ASSJ), but if you run HostGeneIdentify before the PSICalculater, you can identify host gene of all SJs.

load(file = "/mnt/raid61/Personal_data/tangchao/40TCellsQTL4/analysis/12.ICASTutorial/MyObj.RData")

gtfFile <- list.files("/mnt/raid61/Science_2014_chen/align/STAR_index/", 
                      "Homo_sapiens.GRCh38.87.sorted.gtf$", 
                      full.names = TRUE)
stopifnot(file.exists(gtfFile))

MyObj <- HostGeneIdentify(object = MyObj, 
                          gtfFile = gtfFile, 
                          NT = 5)

MyObj@HostGene

Alternative splicing type

According GTF file, we can assign all ASSJ to different AS types.

MyObj <- ASTypeOfASSJ(object = MyObj, 
                      gtfFile = gtfFile, 
                      NT = 5)

MyObj@ASType

ASType_Freq <- as.data.frame.table(table(MyObj@ASType$AS))
ggplot(ASType_Freq, aes(x = Var1, y = Freq)) + 
  geom_col() + 
  geom_text(aes(y = Freq + max(Freq) * .05, label = Freq)) +
  labs(x = "Category", y = "Number") + 
  theme_classic()

Differential splicing

We have provided 11 different methods for differential splicing analysis. The detail can be found by ?FindMarkers

DE1 <- FindAllMarkers(object = MyObj, 
                      only.pos = T, 
                      delta.threshold = 0, 
                      return.thresh = 1, 
                      test.use = "tobit", 
                      print.bar = FALSE, 
                      NT = 20)
setDT(DE1)
head(DE1)

Histogram of differentia splicing test P value

hist(DE1$p_val, main = "Histogram of P value", xlab = "P value")

PSI of differentia splicing intron

toplot1 <- DE1[deltaPSI > 0.15 & p_val < 0.01 & pct.1 == 1 & pct.2 == 1, gene]
PSIHeatmap(object = MyObj, SJs = toplot1, main = paste0("DS introns ",  length(toplot1)))

Differential splicing with confounder

We can use a beta-binomial generalized linear model for differential splicing analysis with confounder.

DE2 <- FindAllMarkers(object = MyObj, 
                      only.pos = T, 
                      delta.threshold = 0.15, 
                      return.thresh = 1, 
                      test.use = "BB", 
                      print.bar = FALSE, 
                      NT = 20, 
                      confounder = "MOLECULE")
setDT(DE2)
head(DE2)

toplot2 <- DE2[deltaPSI > 0.15 & p_val < 0.01 & pct.1 == 1 & pct.2 == 1, gene]
PSIHeatmap(object = MyObj, SJs = toplot2, main = paste0("DS introns ",  length(toplot2)))

library(ggvenn)
ggvenn::ggvenn(data = list(DE1 = toplot1, DE2 = toplot2))

AS-type of differential splicing intron

We can use function SJEvent to get the AS-type of ASSJ.

DSEvent <- SJEvent(object = MyObj, SJ = toplot2, HostGene = TRUE)
table(DSEvent$AS)

head(subset(DSEvent, AS == "A3SS"))

Box plot of differential splicing intron

GroupBoxplot(object = MyObj, SJ = toplot2[7])

Visualization of splice event

In this section, we demonstrate the visualization for given splice event.
First we need corresponding bam files for each sample and a TxDb or GRanges object from corresponding GTF file.

library(GenomicFeatures)
BamFiles <- gsub("SJ.out.tab", "Aligned.sortedByCoord.out.bam", SJFiles)
txdb <- suppressWarnings(
  suppressMessages(
    GenomicFeatures::makeTxDbFromGFF(file = gtfFile, format = "gtf")
  )
)

Using function plotgene, user can plot the relative event, host gene, AS-type and the annotated genetic features around the splice event.

plotgene(object = MyObj, SJ = toplot2[7], flank = 1000, geneModul = txdb)

The function GroupHeatSpot can reveal the relative abundance and the P value of statistical tests (Wald-test) of mapped reads over the splice event and flanking regions at base resolution.

GroupHeatSpot(object = MyObj, 
              SJ = toplot2[7], 
              low.col = "white", 
              BamFiles = BamFiles, 
              geneModul = txdb, 
              logTrans = FALSE, 
              ignore.SEorPE = TRUE, 
              BamPostfix = ".Aligned.sortedByCoord.out.bam", 
              flank = 50, 
              reduce = FALSE, 
              rel_heights = c(1, 1.5, 1.2, 2), 
              MinimumFractionForP = 0.1, 
              curvature = - 0.05,
              shrinkage = 1.1)

And, user can selectively display certain panels.

GroupHeatSpot(object = MyObj, 
              SJ = toplot2[7], 
              low.col = "white", 
              BamFiles = BamFiles, 
              geneModul = txdb, 
              logTrans = FALSE, 
              ignore.SEorPE = TRUE, 
              BamPostfix = ".Aligned.sortedByCoord.out.bam", 
              flank = 50, 
              reduce = FALSE, 
              rel_heights = c(0, 1.5, 1.2, 2), 
              MinimumFractionForP = 0.1, 
              curvature = - 0.05,
              shrinkage = 1.1)