We start by loading the libraries and functions we will need:
library("tidyverse") library("knitr") source("~/Dropbox (Partners HealthCare)/replicates_ASE/code/ASE/R/ASE_functions.R") source("~/Dropbox (Partners HealthCare)/replicates_ASE/code/ASE/R/PerformAIAnalysis_CC.R")
To demonstrate how the pipeline works, we are going to use data from RNA-seq experiment aimed to study effect of 5aza treatment on expression and allele-specific expression. Let's look at the data first:
5 samples of 4.11 Abelson clones:
name | description | number of repicates ------------- | ------------- | ------------- control | Untreated Day 0 | 2 DMSO | DMSO control Day 2 | 2 low | low 5aza treatment of 0.2 µM | 2 medium | medium 5aza treatment of 2 µM | 5 high | high 5aza treatment of 10 µM | 2
The first step is to create design matrix describing the experiment:
experimentNames <- c("control","DMSO","low","medium","high") techReps <- c(2,2,2,5,2) designMatrix <- BuildDesign(experimentNames, techReps)
Now we can look at the data. Here is the geneCountTab dataframe with counts for all genes, all replicates and conditions:
inTabs <- "../../../data/5aza/pr_20180714_ISEKI_processed_gene_extended2.txt" geneCountTab <- GetGatkPipelineTabs(inTabs, designMatrix$techReps, multiple = F) head(geneCountTab)
We can calculate allelic imbalances for all genes for all experimental conditions pooling technical replicates: $$AI=\frac{maternal\ counts}{maternal\ counts + paternal\ counts}$$
aiTable <- do.call(cbind, lapply(1:length(designMatrix$techReps), function(x){ round(CountsToAI(geneCountTab, reps = unlist(designMatrix$replicateNums[x]),thr=10)$AI,3) })) aiTable <- data.frame(geneCountTab$ensembl_gene_id, aiTable) colnames(aiTable) <- c("ensembl_gene_id", designMatrix$experimentNames) head(aiTable)
Let's visualize AI correlations, for example between AI estimates for DMSO (experiment 2) and high 5aza treatement(experiment 5):
ggplot(aiTable, aes_string(x=as.name(designMatrix$experimentNames[2]), y=as.name(designMatrix$experimentNames[5]))) + geom_point(size=0.5) + theme_bw() + coord_fixed()
We can also make MA plots if we load log2Fold changes from DESeq2 analysis:
DE_SG3_SG2 <- read_csv("~/Dropbox (Partners HealthCare)/MAE screen/DE_analysis/DE_SG3_SG2.txt") ai_plus_DE_low_DMSO <- merge(aiTable[,c(1,3,4)], DE_SG3_SG2[,c(1:3)], by.x="ensembl_gene_id", by.y="gene_id") ai_plus_DE_low_DMSO$aiDiff <- ai_plus_DE_low_DMSO$low - ai_plus_DE_low_DMSO$DMSO ai_plus_DE_low_DMSO <- ai_plus_DE_low_DMSO[!is.na(ai_plus_DE_low_DMSO$DMSO),] ai_plus_DE_low_DMSO <- ai_plus_DE_low_DMSO[!is.na(ai_plus_DE_low_DMSO$low),] ApplyQuintiles <- function(x) { cut(x, breaks=seq(0, 0.6, by = 0.10), labels=c("<0.1","0.1-0.2","0.2-0.3","0.3-0.4","0.4-0.5", ">0.5"), include.lowest=TRUE) } ai_plus_DE_low_DMSO$aiDiff_q <- sapply(abs(ai_plus_DE_low_DMSO$aiDiff), ApplyQuintiles) table(ai_plus_DE_low_DMSO$aiDiff_q) ai_thr <- 0.1 ai_plus_DE_low_DMSO$ai_col <- ifelse(ai_plus_DE_low_DMSO$aiDiff > ai_thr, "red", "gray") ggplot(ai_plus_DE_low_DMSO, aes(x=aiDiff, y=log2FoldChange)) + geom_point(size=0.5) + xlim(c(-0.5, 0.5)) ggplot(ai_plus_DE_low_DMSO, aes(x=baseMean, y=log2FoldChange, col=ai_col)) + geom_point(size=0.5) + scale_x_continuous(trans="log10") + scale_color_manual(values=ai_plus_DE_low_DMSO$ai_col) + theme_bw()
Let's compare conditions DMSO (experiment 2) and high 5aza treatment (experiment 5). We will construct 95%-CIs around AIs and get the resulting classification finding the genes demonstrating difference in AI between conditions (based on non-overlapping CIs).
First we get 95% CIs:
CIs_DMSO <- PerformBinTestAIAnalysisForConditionNPoint(geneCountTab, unlist(designMatrix$replicateNums[2]), Q=0.95, EPS=1.3, thr=0) CIs_high <- PerformBinTestAIAnalysisForConditionNPoint(geneCountTab, unlist(designMatrix$replicateNums[5]), Q=0.95, EPS=1.3, thr=0) CIs <- merge(CIs_DMSO$Output[,c("ID", "AI", "BT_CIleft_CC","BT_CIright_CC")], CIs_high$Output[,c("ID", "AI", "BT_CIleft_CC","BT_CIright_CC")], by="ID") colnames(CIs) <- c("ID", "AI_DMSO", "CI_left_DMSO", "CI_right_DMSO", "AI_high", "CI_left_high", "CI_right_high") head(CIs)
Now let's run the differential analysis, using coverage threshold=40 and Bonferroni p-value correction. We will define a gene as demonstrating differential allelic imbalance between two conditions if this gene has non-intersecting CIs of allelic imbalance estimates in these conditions and if the difference between these allelic imbalance estimates is bigger than some cutoff (set to 0.1 here). The cutoff value is set arbitary depending on the biological question we are asking.
thr <- 40 minDifference <- 0.1 DiffAI_DMSO_high <- PerformBinTestAIAnalysisForTwoConditions_knownCC(geneCountTab, vect1CondReps = unlist(designMatrix$replicateNums[2]), vect2CondReps = unlist(designMatrix$replicateNums[5]), vect1CondRepsCombsCC=CIs_DMSO$CC, vect2CondRepsCombsCC=CIs_high$CC, Q=0.95, thr=thr, minDifference = minDifference ) table(DiffAI_DMSO_high$BT_CC_thrDiff)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.