This is an R Markdown document for Data Exploration of IDEA: Interactive Differential Expression Analyzer. Plots in Data Exploration module (plotted in R[1] with ggplot2[2] (for all plots but heat map) and pheatmap[3] (for heat map)) is presented in HTML file via rmarkdown [4]. For figures of higher resolution, please download from website directly.
Citation: This work is in process of publishing, citation method will be post here as soon as possible. Check out the IDEA website above.

knitr::opts_chunk$set( fig.width = 9, fig.height = 9, dpi = 72)
setwd(tempdir())
load("myfile.RData")
library(ggplot2)
library(RColorBrewer)
library(scales)
library(pheatmap)
library(plyr)
library(labeling)
library(stringr)
library(rmarkdown)

Introdution

Count data, as generated by various high-throughput sequencing methods such as RNA-Seq[5, 6], Tag-Seq[7, 8], and ChIP-Seq[9], has been more and more used to represent the abundance of genes/features at RNA/DNA level since read count and abundance are linearly related [6]. Also in RNA-Seq, variation caused by replicate is low, which makes RNA-Seq count data advantageous for differential expressed gene discovery[10]. Differential expression (DE) analysis typically works with following questions: choice of normalization and noise control method [6, 8]; choice of data distribution given numbers of replicates[8]; choice of assessment of statistical significance of DE detection[11].

Counts data depends not only on expression level of features, but also on several factors including transcript length and sequence depth [6, 12]. Therefore, it is necessary to normalize counts data before identifying differential expressed features.

Basic Information

Inputted Data

Raw read count data should be uploaded in the "New" module, with experimental design type related parameters declared.
Note that due to an upper limits for number of groups in color palettes, only data that has no more than 11 conditions/samples is supported. To plot data with more conditions/samples, please refer to open-sourced code of IDEA and run locally.

Experimental Design

In data exploration, experimental design is set in the "New" module before starting analysis. Three experiment types are available, including Standard Comparison, Multi-factor Design and Without Replicates. Selection of paired conditions is available for all three experiment types. For Multi-factor Design, factor of interst is defined as first column of inputted data.
In this case, experimental design type was stated as r as.character(plist[[10]]).

  #if multi-factor then
  #output: And the factor of interest was XXX

Normalization

if(as.character(plist[[10]]) == "uqua"){
  plist[[10]] = "UQ"
}

For normalization, four methods are provided, including the Reads per Kilobase per Million Reads (RPKM) [6], the Upper Quartile (UQ) [13], and the Trimmed Mean of M values (TMM) [14]. User can also perform data exploration without any normalization method by choosing "None" in the drop-down menu in Normalization Method Panel. Summarized details of each normalization methods are listed in Table 1. In this report the "normalized counts" is taken as the expression unit, but inputted count data can be normalized, or not, as decided by users.
Note that if user did not upload a gene length file for normalization, a default length of 1000bp/gene is used.
In this case, normalization was set as r plist[[10]].

htmltools::HTML('     
<div align="center">
Table 1: Summarized details for normalization methods provided in Data Exploration module<br/>
<table cellpadding="5" cellspacing="0" border="1" frame=hsides rules=all style="border-color: #000000">
        <tr>
          <td style="border-width: medium thin medium 0">&nbsp;Method</td>
          <td style="border-width: medium thin medium 0">&nbsp;Abbreviation</td>
          <td style="border-width: medium thin medium 0">&nbsp;Summary</td>
        </tr>
        <tr>
          <td style="border-width: 0 thin thin 0">&nbsp;The Reads per Kilobase per Million Reads[<a href="#ref6">6</a>]</td>
          <td style="border-width: 0 thin thin 0">&nbsp;RPKM</td>
          <td style="border-width: 0 thin thin 0">&nbsp;Counts per kilobase per million mapped reads or total number of reads in library calculated</td>
        </tr>
        <tr>
          <td style="border-width: 0 thin thin 0">&nbsp;The Upper Quartile (default)[<a href="#ref13">13</a>]</td>
          <td style="border-width: 0 thin thin 0">&nbsp;UQ</td>
          <td style="border-width: 0 thin thin 0">&nbsp;Features that are zero in all library removed, scale factor calculated from a upper quartile of counts for each library</td>
        </tr>
        <tr>
          <td style="border-width: 0 thin thin 0">&nbsp;The Trimmed Mean of M values[<a href="#ref14">14</a>]</td>
          <td style="border-width: 0 thin thin 0">&nbsp;TMM</td>
          <td style="border-width: 0 thin thin 0">&nbsp;Weight taken from delta method on binomial data, then trimmed weighted means calculated</td>
        </tr>
        <tr>
          <td style="border-width: 0 thin medium 0">&nbsp;None</td>
          <td style="border-width: 0 thin medium 0">&nbsp;None</td>
          <td style="border-width: 0 thin medium 0">&nbsp;All scaling factors set to 1</td>
        </tr>
</table>
</div>
')

Data Exploration Plots

Samples Boxplot

Samples boxplot visualizes count distribution for all samples, showing reads distribution in each sample repectively. It uses expression values as y-axis, and samples as x-axis.

print(plist[[1]])


Figure 1: Expression distribution of in all samples.

Stacked Density Plot

Stacked density plot visualizes density distribution of genes with different reads count, showing overall condition of normailzed counts. It also enables interactive comparison of density distribution between groups. The x-axis is defined as logarithmic scale (base 10) of (normalized reads count + 1) to avoid NA values, and the y-axis is defined as density of calculated logarithm.

print(plist[[2]])


Figure 2: Normalized expression level distribution for genes in selected samples.

Ratio Bar Plot

Ratio bar plot visualize distribution of normalized counts in each sample using stacked bar. Low counts may introduce noise and interfere extraction of differential expression of features [13]. It shows features in each sample by feature of having non-zero normalized counts, or more than 1, 2, 5, 10, and uses samples as x-axis, sensitivity, that is, proportion of groups in each sample, as y-axis.

wzxhzdk:7
Figure 3: Ratio of genes with low counts for each sample.

Heat Map of Sample Distance

Heat map of sample distance visualizes the distance between samples by calculating sample distance from features in two samples and plotting heatmap and hierarchical clustering. Sample distances is defined as Euclidean distance between samples, calculated as:
$$Sample\ Distance = d(q,p) = \sqrt{\sum_{i=1}^{n} (q_{i}-p_{i})^{2}}$$
, where p and q represent the two samples, pi and qi their features’ normalized counts.

wzxhzdk:8
Figure4 Heat map of sample distance.

PCA

Principal Component Analysis (PCA) result is visualized by scatter plot showing samples as dots colored by their groups/conditions. Basically, PCA is to reduce dimension of data. Here a 2-dimension eigenvector is generated using package FactoMineR [15]. In IDEA, the first two principal components were extracted to generate this two-dimension plot. X- and y-axis represent s first two principal components.

wzxhzdk:9
Figure5 PCA result.

Correlation Analysis

Correlation analysis visualizes feature expression correlation between two selected samples using scatter plot. Generally, samples from biological/technical replicates are likely display high correlation, and samples from different conditions in experimental design usually gives a poor correlation [16]. Thus examining correlation between samples can give a hint of their relationship. To avoid NA, x-axis and y-axis are defined as:
$$log_{10} (Normalized Read Counts + 1)$$
, where "Normalized Reads Count" stands for normalized counts. Spearman’s correlation is calculated by function cor in package stats [1]. Each feature is visualized as a single black dot in figure. For selected two samples, Spearman correlation coefficient is calculated as:
$$Spearman\ Correlation\ Coefficient = \frac{\Sigma_{i}(P_{i}-\overline{P})(Q_{i}-\overline{Q})}{\sqrt{\Sigma_{i}(P_{i}-\overline{P})(Q_{i}-\overline{Q})}}$$
, where $p$ and $q$ stand for the two samples, $P_{i}$ and $Q_{i}$ represent converted ranks of $p_{i}$ and $q_{i}$, and $\overline{P}$ and $\overline{Q}$ the average value of $P_{i}$ and $Q_{i}$.
Spearman correlation coefficient is shown as digits on webpage in module of correlation analysis. In this case, Spearman correlation coefficient is r plist[[11]].

print(plist[[6]])
#


Figure 6: Scatter plot and correlation analysis between gene expression levels in r plist[[8]][1] and r plist[[8]][2].

Feature Query - Expression Level Comparison

A histogram with error bar visualizes the comparison result of expression level of a certain feature, selected by user, in different conditions, where x-axis stands for applicable conditions and y-axis stands for normalized counts.

print(plist[[7]])


Figure 7: Expression level comparison of feature r plist[[9]] in each condition.

References

1. R Core Team, R: A language and environment for statistical computing, 2014, R Foundation for Statistical Computing: Vienna, Austria.
2. Wickham, H., ggplot2: elegant graphics for data analysis, 2009, Springer New York.
3. Kolde, R., pheatmap: Pretty Heatmaps, 2013.
4. JJ Allaire, J.M., Yihui Xie, Hadley Wickham, Joe Cheng and Jeff Allen, rmarkdown: Dynamic Documents for R, 2014.
5. Nagalakshmi, U., et al., The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 2008. 320(5881): p. 1344-9.
6. Mortazavi, A., et al., Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods, 2008. 5(7): p. 621-8.
7. Morrissy, A.S., et al., Next-generation tag sequencing for cancer gene expression profiling. Genome Res, 2009. 19(10): p. 1825-35.
8. Anders, S. and W. Huber, Differential expression analysis for sequence count data. Genome biol, 2010. 11(10): p. R106.
9. Robertson, G., et al., Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods, 2007. 4(8): p. 651-7.
10. Marioni, J.C., et al., RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res, 2008. 18(9): p. 1509-17.
11. Robinson, M.D., D.J. McCarthy, and G.K. Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 2010. 26(1): p. 139-40.
12. Li, J., et al., Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics, 2011: p. kxr031.
13. Bullard, J.H., et al., Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 2010. 11: p. 94.
14. Robinson, M.D. and A. Oshlack, A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol, 2010. 11(3): p. R25.
15. Francois Husson, J.J., Sebastien Le and Jeremy Mazet, FactoMineR: Multivariate Exploratory Data Analysis and Data Mining with R, 2014.
16. Fu, X., et al., Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genomics, 2009. 10: p. 161.


likelet/IDEA documentation built on Sept. 8, 2020, 2:56 p.m.