knitr::opts_chunk$set( fig.width = 9, fig.height = 9, dpi = 72)

This is an R Markdown document for SAMseq analysis of IDEA: Interactive Differential Expression Analyzer. Plots in SAMseq analysis module (plotted in R [1] with pheatmap[2] (for heat map), samr.plot in SAMSeq (samr) [3] (for Q-Q plot) and ggplot2[4] (for FDR distribution plot)) are presented in HTML file via rmarkdown [5]. For figures of higher resolution, please download from website directly.

Citation: This work is in process of publishing, citation method will be post here as soon as possible. Check out the IDEA website above.

#setwd(tempdir())
load("SAMseqAnalysis.RData")
#p1 basic information  
      #exprimental design plist[[1]][1]
      #select paird plist[[1]][[2]]
      #SAMseq resample number plist[[1]][3]
      #fdr cutoff plist[[1]][4]
    # #heatmap genenumber plist[[1]][5]
library(ggplot2)
library(gplots)
library(RColorBrewer)
library(scales)
library(pheatmap)
library(plyr)
library(labeling)
library(stringr)
library(rmarkdown)
#library(S4Vectors)
library(samr)
library(stringr)

Introduction

Count data, as generated by various high-throughput sequencing methods such as RNA-Seq [6, 7], Tag-Seq[8, 9], and ChIP-Seq[10], has been more and more used to represent the abundance of genes/features at RNA/DNA level since read count and abundance are linearly related [7]. Also in RNA-Seq, variation caused by replicate is low, which makes RNA-Seq count data advantageous for differential expressed gene discovery[11]. Differential expression (DE) analysis typically works with following questions: choice of normalization and noise control method [7, 9]; choice of data distribution given numbers of replicates[9]; choice of assessment of statistical significance of DE detection[12].

SAMseq [3] is a computational tool for differential expression analysis for count data. R package "samr" is an R implementation of SAMseq. It adopts Wilcoxon statistic and multiple resampling strategy to calculate the significance of DE. FDR is calculated by the permutation plug-in method.

In IDEA, SAMseq, version 2.0, is employed for DE analysis. For more information on SAMseq, please refer to their reference [3] and package manual.

Basic Information

Experimental Design

In IDEA, a raw count table and an experimental design table should be inputted. Optionally, experimental design can be one of Standard Comparison, Multi-factors Design and Without Replicates (not recommended). Then a pair of conditions should be selected to carry out DE analysis.
Specifically, SAMseq (samr) is applicable only for Standard Comparison.
In this case, condition r as.character(plist[[1]][[2]])[1] and condition r as.character(plist[[1]][[2]])[2] were selected for differential expression analysis.

Advanced Options

Several advanced options are available in SAMseq analysis module, including number of resampling procedures employed for test statistics, and false discovery rate (FDR) cutoff.
In this case, number of resampling was defined as r as.character(plist[[1]][3]), and FDR cutoff is defined as r as.character(plist[[1]][4]).

Analysis Result

Differential Expression Table

A table containing information of all differentially expressed genes is presented with interactive options. Intepretation of all headers is explained in Table 1.
Note that in different packages, same header can have different implication. For example, p-values in DESeq are obtained by Wald test, but in edgeR p-values are obtained by Fisher's exact test.

htmltools::HTML('  
<div align="center">
Table 2: Interpretation of headers of differential expression table in SAMseq (samr)<br/>
<table cellpadding="5" cellspacing="0" border="1" frame=hsides rules=all style="border-color: #000000">
        <tr>
            <td style="border-width: medium thin medium 0">&nbsp;Headers</td>
            <td style="border-width: medium thin medium 0">&nbsp;Interpretation</td>
        </tr>
         <tr>
            <td style="border-width: 0 thin thin 0">&nbsp;FeatureID</td>
            <td style="border-width: 0 thin thin 0">&nbsp;Feature identifier</td>
        </tr>
        <tr>
            <td style="border-width: 0 thin thin 0">&nbsp;Score.d</td>
            <td style="border-width: 0 thin thin 0">&nbsp;The T-statistic* value</td>
        </tr>
        <tr>
            <td style="border-width: 0 thin thin 0">&nbsp;Fold.Change</td>
            <td style="border-width: 0 thin thin 0">&nbsp;The ratio of the two compared value, fold change is defined as counts of Condition2 divided by counts of Condition1</td>
        </tr>
        <tr>
            <td style="border-width: 0 thin medium 0">&nbsp;q.value</td>
            <td style="border-width: 0 thin medium 0">&nbsp;The lowest FDR at which that gene is called significant</td>
        </tr>
</table>
*T-statistic: A method used to test coefficient, which is the ratio of the coefficient to the standard error.
</div>   
')

Heat Map of Differential Expressed Genes

Heat map can graphically display the differential expression table, and each square (pixel) represents the value of a feature in a sample and colored accordingly. Here, heat map of differential expressed features is plotted via R package pheatmap. Features are arranged in columns (samples) and rows (features) as in the original data matrix. Up-regulated differential expression features are colored red in heat map, while the down-regulated colored green. Hierarchical clustering results of features and samples are shown in dendrogram on the left and upper side of heat map, respectively.
Numbers of features to display as rows, the appearance of dendrogram on both left and upper side, and the appearance of color key are all interactively changeable. The data scaling of heat map can be one of "none", "row", and "column", as chosen by user. The color is scaled by $log_{10}(Normalized Reads Count + 1)$.

In this case, data is centered and scaled in the as.character(plist[[3]][[3]]) direction. For more information on parameter settings, please refer to the manual of package pheatmap (as in References [2]).

wzxhzdk:3
Figure 1 Heat map of differential expressed genes, top `r plist[[1]][5]` DE features with lowest false discover rate (FDR) value displayed

Q-Q Plot

A Q-Q plot is a probability plot that visualize comparison between two distributions by plotting quantile ("Q") against each other. In SAMseq (samr), a Q-Q plot can be generated by function samr.plot after a samr.obj is obtained by a call to samr via function SAMseq, or other functions. Expected quantile score of features is taken as x-axis, and observed score as y-axis. A solid line is presented in the plot which is 45 degree and intercepts y-axis at the minimal fold change value. The upper and lower parallel lines are plotted according to a vertical distance of delta, defining SAM threshold rule. Thus features that pass minimal fold changes and are plotted outside of dashed lines are identified as differentially expressed features. Up-regulated features, or features whose observed scores are greater than expected scores, is colored red, while down-regulated features green.

wzxhzdk:4
Figure 2: Q-Q plot in SAMseq

FDR Distribution Plot

False discover rate (FDR) distribution plot visualizes distribution of FDR in DE test. In SAMseq, a Wilcoxon test is adopted for differential expression test, and a permutation plug-in method for multiple testing. FDR distribution plot uses FDR as x-axis and percentage of different groups of x value as y-axis, and colors significant and not significant groups differently.

wzxhzdk:5
Figure 3: FDR distribution plot in PoissonSeq

References

1. R Core Team, R: A language and environment for statistical computing, 2014, R Foundation for Statistical Computing: Vienna, Austria.
2. Kolde, R., pheatmap: Pretty Heatmaps, 2013.
3. Li, J. and R. Tibshirani, Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data. Statistical methods in medical research, 2013. 22(5): p. 519-536.
4. Wickham, H., ggplot2: elegant graphics for data analysis, 2009, Springer New York.
5. JJ Allaire, J.M., Yihui Xie, Hadley Wickham, Joe Cheng and Jeff Allen, rmarkdown: Dynamic Documents for R, 2014.
6. Nagalakshmi, U., et al., The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 2008. 320(5881): p. 1344-9.
7. Mortazavi, A., et al., Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods, 2008. 5(7): p. 621-8.
8. Morrissy, A.S., et al., Next-generation tag sequencing for cancer gene expression profiling. Genome Res, 2009. 19(10): p. 1825-35.
9. Anders, S. and W. Huber, Differential expression analysis for sequence count data. Genome biol, 2010. 11(10): p. R106.
10. Robertson, G., et al., Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods, 2007. 4(8): p. 651-7.
11. Marioni, J.C., et al., RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res, 2008. 18(9): p. 1509-17.
12. Robinson, M.D., D.J. McCarthy, and G.K. Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 2010. 26(1): p. 139-40.


likelet/IDEA documentation built on Sept. 8, 2020, 2:56 p.m.