DE analysis (by DESeq2)

design.table <- fread(type.list$Design, header = TRUE)
des.count <- fread(type.list$RSEM)[Type!="protein_coding",]
rna.id <- des.count[,1]
rna.type <- des.count[,2]
des.count <- as.data.frame(des.count[, round(.SD), .SDcols = -(1:2), with = TRUE])
rownames(des.count) <- rna.id[[1]]
coldata <- as.data.frame(design.table[,2])
rownames(coldata) <- names(des.count)
dds <- suppressWarnings(DESeq2::DESeqDataSetFromMatrix(countData = des.count,
                              colData = coldata,
                              design = ~ condition))
keep <- rowSums(DESeq2::counts(dds)) >= 10
dds <- dds[keep,]
dds <- DESeq2::DESeq(dds)
resLFC <- DESeq2::lfcShrink(dds, coef=2)
resOrdered <- resLFC[order(resLFC$pvalue),]
vsd <- DESeq2::vst(dds, blind=FALSE)
ntd <- DESeq2::normTransform(dds)

Column {.tabset}

MA-plot

DESeq2::plotMA(resLFC, ylim=c(-2,2))

Heatmap

select <- order(rowMeans(DESeq2::counts(dds,normalized=TRUE)),
                decreasing=TRUE)[1:20]
# df <- as.data.frame(SummarizedExperiment::colData(dds)[,"condition"])
heatmaply::heatmaply(SummarizedExperiment::assay(ntd)[select,], scale = 'row', xlab = "Sample", margins = c(60,100,40,20), row_text_angle = 45, column_text_angle = 60) %>% layout(margin = list(l = 100, b = 50, r = 0))

Correlation heatmap

heatmaply::heatmaply(cor(SummarizedExperiment::assay(ntd)[select,]), margins = c(40, 40),
          k_col = 2, k_row = 2,
          limits = c(-1,1)) %>% layout(margin = list(l = 50, b = 50))

Principal Component Analysis

pcaData <- DESeq2::plotPCA(vsd, intgroup="condition", returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
p <- ggplot(pcaData, aes(PC1, PC2, color=condition)) +
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()
save_plot('pca.tiff', p, base_height = 8.5, base_width = 11, dpi = 300, compression = 'lzw')
save_plot('pca.pdf', p, base_height = 8.5, base_width = 11, dpi = 300)
ggplotly(p)

Column

Description

Title: Differential expression analysis of novel and known lncRNA from LncPipe using DESeq2

DESeq takes all transcripts with similar expression values into consideration. To determine over-dispersion parameters in different conditions, it is assumed by DESeq that over-dispersion is related to mean reads count. Differentially expressed genes are validated by an exact test between means of distribution parameters in two conditions. In this project, DESeq was last updated as DESeq2 (1.6.2).

What LncPipeReporter does: Design.matrix file and reads count matrix are fed into DESeq2, an R package to test for differential expression based on a model using the negative binomial distribution. DESeq2 analysis can help identify differentially expressed genes from high-throughput sequencing assays. Before exact test statistics, raw reads count matrix are filtered according to the user-defined parameter min.expressed.sample, which can be explained as follows: raw reads count are first normalized in to CPM (Counts Per Million reads) matrix , genes with CPM > 1 in more than min.expressed.sample numbers were retained for further analysis. Complementary experimental design can be performed separately based on kallisto.count.txt generated by lncPipe. The kallisto.count.txt file can also be imported into another software IDEA, which focuses on comprehensive differential expression analysis from an expression matrix.

Note The current version of LncPipeReporter only supports standard two-condition comparison. Experimental without replicates are not supported at the moment. The R scripts for differential expression analysis can be freely checked here

Description of the plots:

MA plots A MA-Plot can give a quick overview of the distribution of data. The log2–transformed fold change is plotted on the y-axis and the average count (normalized by size factors) is on the x-axis. The genes with adjusted p-value less than the FDR are colored red, while other genes are colored black.

PCA analysis Principle Components Analysis (PCA) of all expressed lncRNAs. The analysis is performed on all expressed lncRNA and two highest Principal Components are identified and plotted on X and Y axis of a scatter plot. Points are colored according to the conditions involved in the design of the analysis. Theoretically, samples from the same library should be closer than the others. This plot can be used to control the variance introduced by biological/technical replicates or treatment conditions. In our analysis, and can be deemed as an unsupervised clustering analysis method.

Heatmap Heatmap plot of the correlation matrix. All expressed lncRNA are involved and unsupervised hierarchical clustering method were applied for cluster samples based on correlation matrix, which are displayed as a dendrogram.

Reference

Michael I Love, Wolfgang Huber and Simon Anders. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014. 15:550

Differentially expressed lncRNAs table

```r resSig <- BiocGenerics::subset(resOrdered, padj < 0.1) fwrite(as.data.table(resSig), 'DE.csv', row.names = TRUE) DT::datatable(head(as.data.frame(resSig), n = 80L)) %>% DT::formatRound(c('baseMean', 'log2FoldChange', 'lfcSE', "stat", 'pvalue', 'padj'), digits = 2)



bioinformatist/LncPipeReporter documentation built on May 28, 2019, 7:11 p.m.