knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.align = 'center') knitr::opts_knit$set(root.dir = "/mnt/raid61/Personal_data/tangchao/40TCellsQTL4")
\pagebreak
For post-transcriptional splicing analysis, an important question is the quantification and statistical inference of systematic changes between conditions, as compared to within-condition variability. We have provide an R package named ICAS to test for differential splicing by use of beta binomial generalized linear models; the estimates of dispersion and delta changes incorporate data-driven prior distributions. In this tutorial, we have shown how to use ICAS
to quantification intron-centric PSI (percentage of splice in) of each alternative spliced intron from STAR outputed SJ.out.tab
files and user prepared read count matrix of splice junctions. In addition, we have also shown that how to use ICAS
to get differential spliced introns from user-defined groups. Finally, we have provided many visualization functions in ICAS.
The ICAS package contains many differential splicing analysis and genomic features visualization methods which are described across multiple manuscripts, by different non-overlapping authors. This makes citing the package a bit difficult, so here is a guide:
bimod
(Likelihood-ratio test) for differential splicing analysis, please cite [@McDavid_2012]. tobit
(Tobit-test) for differential splicing analysis, please cite[@Trapnell_2014], and it would be considerate also to cite [@Tobin_1958], since this publication described Tobit-test first time.MAST
, you should cite the original publication [@Finak_2015].BB
(betabinomial nodel) for differential splicing analysis, please cite [@Chen_2014].wilcox
, roc
, t
, poisson
, or negbinom
for differential splicing analysis, you should cite [@Stuart_2019].plotgene()
or GroupHeatSpot()
for gene structure visualization, please cite [@Yin_2012].If you're using ICAS in a publication, you can get the citation by
citation("ICAS")
Your can install ICAS
from GitHub https://github.com/tangchao7498/ICAS.
library(devtools) devtools::install_github("tangchao7498/ICAS")
Other packages needed for this guideline:
library(ICAS) library(data.table) library(ggplot2) library(ggpubr)
We need meta information to describe the samples used.
meta <- read.table(file = "./analysis/12.ICASTutorial/Meta.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE) knitr::kable(meta)
In this tutorial we want to select the differential spliced introns between megakaryocytes (MK) and erythroblasts (EB), we first need to convert the CELL_TYPE to facotrs.
meta$CELL_TYPE <- factor(meta$CELL_TYPE)
ICAS
has a function ICASDataSetFromSJFile
that is fully compatible with STAR outputed SJ.out.tab
files.
SJFiles <- paste0("./analysis/11.DifferentialSplicing/data/", row.names(meta), ".SJ.out.tab") stopifnot(all(file.exists(SJFiles))) SJFiles
MyObj <- ICASDataSetFromSJFile(SJFiles = SJFiles, postfix = ".SJ.out.tab", colData = meta, uniqMapOnly = TRUE, annotationOnly = FALSE, minSJRowSum = 50, minSJRead = 2, minOverhang = 3, design = "CELL_TYPE") MyObj
Alternatively, the function ICASDataSetFromMatrix
can be used if you already have a matrix of read counts prepared from another source. To use ICASDataSetFromMatrix
, the user should provide the counts matrix, the information about the samples (the columns of the count matrix) as a DataFrame or data.frame, and the design factor.
CountTab <- counts(MyObj) CountTab[1:4,1:4]
MyObj <- ICASDataSetFromMatrix(countData = CountTab, colData = meta, design = "CELL_TYPE") MyObj
For complex splice sites (same start/end junction > 2), we need the minimum minor-junction's fraction not less than 0.01. And we only calculate PSI of introns with at least 10 junction reads of a same splice site.
MyObj <- PSICalculater(object = MyObj, MMJF = 0.01, MinSumSpliceSite = 10) psi(MyObj)[1:4, 1:5]
The figure below plots the standard deviation of the PSI, across samples, against the mean.
vsn::meanSdPlot(psi(MyObj))
One of the use of the PSI data is sample clustering. Here, we apply the get_dist()
function from package factoextra
to the transpose of the PSI matrix to get sample-to-sample distances.
A heatmap of this distance matrix gives us an overview over similarities and dissimilarities between samples. We have to provide a hierarchical clustering hc
to the heatmap function based on the sample distances, or else the heatmap function would calculate a clustering based on the distances between the rows/columns of the distance matrix.
library(RColorBrewer) library(pheatmap) sampleDists <- factoextra::get_dist(t(psi(MyObj))) sampleDistMatrix <- as.matrix(sampleDists) rownames(sampleDistMatrix) <- paste0(MyObj@colData$CELL_TYPE, "-", MyObj@colData$MOLECULE) colnames(sampleDistMatrix) <- NULL colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255) pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists, col = colors)
For confounder confused dataset, we can use batch correction algorithm ComBat
to remove the batch effect.
library(sva) batch <- as.factor(MyObj@colData$MOLECULE) ExpMat <- sva::ComBat(dat = as.matrix(na.omit(psi(MyObj))), batch = batch, prior.plots = FALSE, par.prior = TRUE, mean.only = FALSE)
sampleDists <- factoextra::get_dist(t(ExpMat)) sampleDistMatrix <- as.matrix(sampleDists) rownames(sampleDistMatrix) <- paste0(MyObj@colData$CELL_TYPE, "-", MyObj@colData$MOLECULE) colnames(sampleDistMatrix) <- NULL colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255) pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists, col = colors)
Identify host gene of alternative splicing splice junction (ASSJ), but if you run HostGeneIdentify before the PSICalculater, you can identify host gene of all SJs.
load(file = "/mnt/raid61/Personal_data/tangchao/40TCellsQTL4/analysis/12.ICASTutorial/MyObj.RData")
gtfFile <- list.files("/mnt/raid61/Science_2014_chen/align/STAR_index/", "Homo_sapiens.GRCh38.87.sorted.gtf$", full.names = TRUE) stopifnot(file.exists(gtfFile))
MyObj <- HostGeneIdentify(object = MyObj, gtfFile = gtfFile, NT = 5)
MyObj@HostGene
According GTF file, we can assign all ASSJ to different AS types.
MyObj <- ASTypeOfASSJ(object = MyObj, gtfFile = gtfFile, NT = 5)
MyObj@ASType
ASType_Freq <- as.data.frame.table(table(MyObj@ASType$AS)) ggplot(ASType_Freq, aes(x = Var1, y = Freq)) + geom_col() + geom_text(aes(y = Freq + max(Freq) * .05, label = Freq)) + labs(x = "Category", y = "Number") + theme_classic()
We have provided 11 different methods for differential splicing analysis. The detail can be found by ?FindMarkers
DE1 <- FindAllMarkers(object = MyObj, only.pos = T, delta.threshold = 0, return.thresh = 1, test.use = "tobit", print.bar = FALSE, NT = 20) setDT(DE1) head(DE1)
hist(DE1$p_val, main = "Histogram of P value", xlab = "P value")
toplot1 <- DE1[deltaPSI > 0.15 & p_val < 0.01 & pct.1 == 1 & pct.2 == 1, gene] PSIHeatmap(object = MyObj, SJs = toplot1, main = paste0("DS introns ", length(toplot1)))
We can use a beta-binomial generalized linear model for differential splicing analysis with confounder.
DE2 <- FindAllMarkers(object = MyObj, only.pos = T, delta.threshold = 0.15, return.thresh = 1, test.use = "BB", print.bar = FALSE, NT = 20, confounder = "MOLECULE") setDT(DE2) head(DE2)
toplot2 <- DE2[deltaPSI > 0.15 & p_val < 0.01 & pct.1 == 1 & pct.2 == 1, gene] PSIHeatmap(object = MyObj, SJs = toplot2, main = paste0("DS introns ", length(toplot2)))
library(ggvenn) ggvenn::ggvenn(data = list(DE1 = toplot1, DE2 = toplot2))
We can use function SJEvent
to get the AS-type of ASSJ.
DSEvent <- SJEvent(object = MyObj, SJ = toplot2, HostGene = TRUE) table(DSEvent$AS)
head(subset(DSEvent, AS == "A3SS"))
GroupBoxplot(object = MyObj, SJ = toplot2[7])
In this section, we demonstrate the visualization for given splice event.
First we need corresponding bam files for each sample and a TxDb or GRanges object from corresponding GTF file.
library(GenomicFeatures) BamFiles <- gsub("SJ.out.tab", "Aligned.sortedByCoord.out.bam", SJFiles) txdb <- suppressWarnings( suppressMessages( GenomicFeatures::makeTxDbFromGFF(file = gtfFile, format = "gtf") ) )
Using function plotgene
, user can plot the relative event, host gene, AS-type and the annotated genetic features around the splice event.
plotgene(object = MyObj, SJ = toplot2[7], flank = 1000, geneModul = txdb)
The function GroupHeatSpot
can reveal the relative abundance and the P value of statistical tests (Wald-test) of mapped reads over the splice event and flanking regions at base resolution.
GroupHeatSpot(object = MyObj, SJ = toplot2[7], low.col = "white", BamFiles = BamFiles, geneModul = txdb, logTrans = FALSE, ignore.SEorPE = TRUE, BamPostfix = ".Aligned.sortedByCoord.out.bam", flank = 50, reduce = FALSE, rel_heights = c(1, 1.5, 1.2, 2), MinimumFractionForP = 0.1, curvature = - 0.05, shrinkage = 1.1)
And, user can selectively display certain panels.
GroupHeatSpot(object = MyObj, SJ = toplot2[7], low.col = "white", BamFiles = BamFiles, geneModul = txdb, logTrans = FALSE, ignore.SEorPE = TRUE, BamPostfix = ".Aligned.sortedByCoord.out.bam", flank = 50, reduce = FALSE, rel_heights = c(0, 1.5, 1.2, 2), MinimumFractionForP = 0.1, curvature = - 0.05, shrinkage = 1.1)
toLatex(sessionInfo())
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.