inst/markdown/documents.md

QRseq: an R Shiny application for analyzing and visualizing RNA sequencing data

Shixue Gou, Liangxue Lai Lab, Guangzhou Institutes of Biomedicine and Health,Chinese Academy of Sciences

June 10th, 2021

[TOC]

1. Introduction

Here we developed QRseq, an R shiny application that can be launched easily from a local web browser for analyzing sequenced or published RNA-seq data. QRseq allows users to analysis RNA-seq data from local or to analysis published datasets from GEO fastly and interactively. This application is started from data input, followed by preprocessing data by filtering low expressed genes and poorly reproducible samples, correcting batch effects, normalizing and transforming data, identifying differential expressed genes and some biological patterns, exploring the enrichment of functions, analyzing and visualizing the protein to protein networks or gene regulation networks. QRseq provides a clear analysis flow and a user-friendly GUI interface but keeps the most important parameters of involved functions, which suit both non-programming experience researchers and expert bioinformatics researchers. User can accomplish a standard RNA-seq analysis in hours depending on the size of their dataset and requires using QRseq.

2. Installation and Launch

2.1 Run using docker (Recommended):

Make sure docker is installed.

docker run -p 3838:3838 goushixue/qrseq-shiny

The default port is 3838, to run on a different port (e.g. 7788):

docker run -p 7788:3838 goushixue/qrseq-shiny

Then access via web-browser on http://localhost/:3838 or http://localhost:7788 depending on port number used.

2.2 Run using R and Rstudio:

2.2.1 Installation QRseq in R

# you may need install devtools and BiocManager package first

## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

if (!requireNamespace("devtools", quietly=TRUE))
        install.packages("devtools") 

## Install QRseq
devtools::install_github("goushixue/QRseq")

2.2.2 Lanch QRseq in R

To run PIVOT, in Rstudio console, use command:

library(QRseq)
startQRseq()

3. Getting Start

After starting QRSeq, the application will be launched to an interactive web interface, as shown in the figure below. We provide two data input approaches for users to choose from, one is to start the analysis of the local data which can be activated by the Get Started Local button, and the other is to start the analysis of public data stored in the GEO database which can be launched with the Get started GEO button.

4. Data Input

For people with no programming experience or background in bioinformatics, it is difficult to re-analyze the sequencing data when it is delivered to them. While sequencing companies can provide analytics, the data used for publication is much more personalized and needs to be tailored by users to suit their own needs. Therefore, we provide users with the option to perform sequencing data analysis on a personal computer.

GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles. We provide users with the option to search, download, and analyze data in the GEO database without having to jump from site to site to do the tedious data manipulation.

4.1 upload data from local

4.2 download data from GEO

4.3 pre-filtering data

After clicking the "Upload" button in Local Module or the "GO NEXT" button of GEO Module, the following analysis steps will be consistent from this section.

5. Data Normalizing and transforming

DESeq2 offers two transformations for count data that stabilize the variance across the mean: the variance stabilizing transformation (VST) for negative binomial data with a dispersion-mean trend (Anders and Huber 2010), implemented in the vst function, and the regularized-logarithm transformation or rlog (Love, Huber, and Anders 2014).For genes with high counts, both the VST and the rlog will give similar result to the ordinary log2 transformation of normalized counts. For genes with lower counts, however, the values are shrunken towards a middle value. The VST or rlog-transformed data then become approximately homoskedastic (more flat trend in the meanSdPlot), and can be used directly for computing distances between samples, making PCA plots, or as input to downstream methods which perform best with homoskedastic data.

Which transformation to choose? The VST is much faster to compute and is less sensitive to high count outliers than the rlog. The rlog tends to work well on small datasets (n < 30), potentially outperforming the VST when there is a wide range of sequencing depth across samples (an order of magnitude difference). We therefore recommend the VST for medium-to-large datasets (n > 30).

5.1 Design the experiment condition table

>>

5.2 Batch factor correction

>>

5.3 Normalization and transformation

6. Data Quality Assessment

Data quality assessment and quality control (i.e. the removal of insufficiently good data) are essential steps of any data analysis. These steps should typically be performed very early in the analysis of a new data set, preceding or in parallel to the differential expression testing.

We define the term quality as fitness for purpose. Our purpose is the detection of differentially expressed genes, and we are looking in particular for samples whose experimental treatment suffered from an anormality that renders the data points obtained from these particular samples detrimental to our purpose.

6.1 Principal Component Analysis (PCA)

>>

6.2 Hierarchical Clustering Heatmap

In the hierarchical clustering heatmap the dendrogram at the side shows us a hierarchical clustering of the samples and genes. Since the clustering is only relevant for genes that actually carry a signal, one usually would only cluster a subset of the most highly variable genes. Here, we select the top 500 (default) highest variance genes across samples to generate the hierarchical clustered heatmap.

6.3 Sample to Sample Distance

A heatmap of this distance matrix gives us an overview over similarities and dissimilarities between samples. We have to provide a hierarchical clustering hc to the heatmap function based on the sample distances, or else the heatmap function would calculate a clustering based on the distances between the rows/columns of the distance matrix.

6.4 Sample correlation coefficients

The sample correlation coefficient, r, estimates the population correlation coefficient, ρ. It indicates how closely a scattergram of x,y points cluster about a 45° straight line. A tight cluster implies a high degree of association. The coefficient of determination R2 indicates the proportion of ability to predict y that can be attributed to the model using the independent (predictor) variables. In the case of a single predictor x in a straight-line relationship with y, R2 is just the square of r.

6.4.1 pairwise scatter plot

You can generate pairwise scatterplot bettween any two condition groups. Rlog data, VST data or log2(normalized_counts + 1) was supported to perform this analysis. Ggplot2 codes also were supported to modify features of the figure.

6.4.2 sample correlation heatmap

The sample correlation heatmap provides a more intuitive way of visualizing the correlation between your samples. And you can also adjust multiple aesthetics of the plot.

6.4.3 pairwise scatter panels

Pairs.panels shows a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal. Useful for descriptive statistics of small data sets. DO NOT run this module if you have too many samples/cells, you should instead use sample correlation heatmap.

7. Differential Expression Analysis

The results tables of differentially expressed genes will be generated in this section, which extracts a results table with log2 fold changes, p values and adjusted p values. To extract the results table of differentially expressed genes, user need to specify the control group, treatment group, p value threshold and log2 fold changes threshold. After that, DEG tables will be save to DEG directory in the current working environment. User can perform the analysis on DEG tables by select different group of DEG tables.

7.1 Volcano plot

Volcano plots are increasingly popular in ‘omics’ type experiments (e.g., genomics, proteomics, and metabolomics) that typically compare two conditions (e.g., wild-type vs. mutant or healthy vs. disease) and involve many thousands of replicate data points. By separating these data by the magnitude of the difference between the two conditions (on the x-axis) and the statistical significance of that difference (on the y-axis), it’s possible to quickly pick out those data points (e.g., genes or proteins) that display a large magnitude change but are also statistically significant.

7.2 DEG heatmap

Heatmaps are a great way of displaying three-dimensional data in only two dimensions. It can help find differences, similarities, variability between groups of genes or conditions in terms of direction of change in gene expression, its magnitude, its statistical significance, and baseline expression level.

7.3 DEG VennDiagram

This function can compute Venn intersects for large numbers of sample sets and plots 2-5 way Venn diagrams. A useful feature is the possiblity to combine the counts from several Venn comparisons with the same number of sample sets in a single Venn diagram.

7.4 DEG Number Barplot

Filter and plot DEG results for up and down regulated genes.

7.5 DEG Detail Table

Results tables are generated using the function results, which extracts a results table with log2 fold changes, p values and adjusted p values. A typical RNA-seq experiment may have many LFCs between -1 and 1, padj < 0.05.

8. Differentially Expressed Gene Patterns

Many methods for time-course experiments aim to identify differentially expressed genes between multi-series time-courses (e.g. two treatments monitored over time). Alternatively, single-series time-course experiments, those where a single treatment is monitored over time, are also of biological interest. In these experiments, genes with dynamic expression patterns over time are identified, which can provide insight on regulatory genes and reveal key transitional periods. In this section, we show how to detect expression patterns of differentilly expressed genes across samples. Mainly useful when data is a time course experiment. This part of the function will help users calculate the differentially expressed genes with the same expression trend and display them in the form of line map or heat map.

8.1 Run parameters

8.2 Visualize the expression patterns

9. Gene Expression Visualization

Another basic function of RNA-seq is to visualize the level of gene expression. Here we offer different data sources (transformed data, normalized data, log2foldchange, eg.) and multiple ways to plotting the results. This will give users more options to better present their data.

9.1 Expression BarPlot

A barplot or barplots of expression values with confidence intervals for a given gene, set of genes.

9.2 Expression BoxPlot

A boxplot or boxplots of expression values with confidence intervals for a given gene, set of genes.

9.3 Expression HeatMap

Heat map is a well-received approach to illustrate gene expression data. It is an impressive visual exhibit that addresses explosive amounts of NGS data. It’s packed with closely set patches in shades of colors, pomping the gene expression data of multifarious high-throughput tryouts.

9.4 Log2FoldChage BarPlot

The Log2 fold-change (L2FC) is an estimate of the log2 ratio of expression in a cluster to that in all other cells. A value of 1.0 indicates 2-fold greater expression in the cluster of interest.

9.5 Log2FoldChage HeatMap

Heat map of gene expression log2foldchanges across different groups.

9.6 Log2FoldChage DotPlot

Dot plot of gene expression log2foldchanges across different groups.

10. WGCNA analysis

Correlation networks are increasingly being used in bioinformatics applications. For example, weighted gene co-expression network analysis is a systems biology method for describing the correlation patterns among genes across microarray samples. Weighted correlation network analysis (WGCNA) can be used for finding clusters (modules) of highly correlated genes, for summarizing such clusters using the module eigengene or an intramodular hub gene, for relating modules to one another and to external sample traits (using eigengene network methodology), and for calculating module membership measures. Correlation networks facilitate network based gene screening methods that can be used to identify candidate biomarkers or therapeutic targets.

10.1 Data Preparation

This is the first step of any network analysis. We show here how to load typical expression data, pre-process them into a format suitable for network analysis, and clean the data by removing obvious outlier samples as well as genes and samples with excessive numbers of missing entries. And then load in the trait data and match the samples for which they were measured to the expression samples.

10.2 Modules Detection

This step is the bedrock of all network analyses using the WGCNA methodology.We using a convenient 1-step network construction and module detection function. Automatically choose soft-thresholding power for network construction and module detection.

10.3 Gene Significance and Module Membership

We quantify associations of individual genes with our trait of interest by defining Gene Significance GS as (the absolute value of) the correlation between the gene and the trait. For each module, we also define a quantitative measure of module membership MM as the correlation of the module eigengene and the gene expression profile. This allows us to quantify the similarity of all genes on the array to every module.

11. Functional Enrichment Analysis

Enrichment analysis is a means to characterize biological attributes in a given gene set. Over Representation Analysis (ORA) is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed genes (DEGs). While all genes can be used in Gene Set Enrichment Analysis (GSEA); GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.

11.1 ORA by gprofiler2

gprofiler2 provides an R interface to the widely used web toolset g:Profiler (https://biit.cs.ut.ee/gprofiler). The toolset performs functional enrichment analysis and visualization of gene lists, converts gene/protein/SNP identifiers to numerous namespaces, and maps orthologous genes across species. g:Profiler relies on Ensembl databases as the primary data source and follows their release cycle for updates.

11.1.1 Running Parameters

gprofiler2 enables to perform functional profiling of gene lists. The function performs statistical enrichment analysis to find over-representation of functions from Gene Ontology, biological pathways like KEGG and Reactome, human disease annotations, etc. This is done with the hypergeometric test followed by correction for multiple testing.

11.1.2 Visualization by gostplot

The enrichment results are visualized with a Manhattan-like-plot using the function gostplot and the previously found enrichment results.The x-axis represents the functional terms that are grouped and color-coded according to data sources and positioned according to the fixed "source_order". This section enables to highlight a selection of interesting terms from the results with numbers and table of results. These can be set with parameter highlight_terms listing the term IDs in a vector or as a data.frame with column "term_id" such as a subset of the result data.frame.

11.1.3 Visualization by gosttable

The enrichment results can also be visualized with a table. The gosttable will create a nice-looking table with the result statistics for the highlight_terms from the result data.frame. The Columns will be showed parameter is used to list the names of additional columns to show in the table in addition to the "term_id" and "p_value".

11.1.4 Visualization by dotplot

For comparing different enrichment results, the x-axis represent different gene clusters while for a single enrichment result, the x-axis can be gene count or gene ratio. This is actually similar to traditional barplot, with dot position as bar height and dot color as bar color. But dotplot can represent one more feature nicely by dot size and it can be a good alternative to barplot.

11.1.5 Visualization by heatmap

To facilitate the exploration of gene expression in the specific enrichment term, we support users to use heat maps to visualize the expression of genes in any term they are interested in.

11.2 ORA By clusterProfiler

Over Representation Analysis (ORA) is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed genes (DEGs).

11.2.1 Running Parameters

11.2.2 Visualization by barplot

Bar plot is the most widely used method to visualize enriched terms. It depicts the enrichment scores (e.g. p values) and gene count or ratio as bar height and color.

11.2.3 Visualization by dotplot

Dot plot is similar to bar plot with the capability to encode another score as dot size.

11.2.4 Visualization by ggtable

The enrichment results can also be visualized with a table. This function will create a nice-looking table with the result statistics for the interested terms from the result data frame.

11.2.5 Visualization by cnetplot

Both the barplot and dotplot only displayed most significant enriched terms, while users may want to know which genes are involved in these significant terms. The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network to extract the complex association.

11.2.6 Visualization by emapplot

Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional module.

11.2.7 Visualization by heatmap

In addition to knowing which biological functions (e.g. GO terms or KEGG pathways) the gene set is enriched in, users may also want to know the gene expression in a particular enriched biological functions. Therefore, we support drawing gene expression heat maps of specific biological functions directly from the enrichment results, so as to facilitate users to explore the target genes they are interested in.

11.3 GSEA by clusterProfiler

A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA) directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.

11.3.1 Running Parameters

11.3.2 Visualization by gseaplot2

Running score and preranked list are traditional methods for visualizing GSEA result. Both of them were supported to visualize the distribution of the gene set and the enrichment score. We also support multile gene sets to be displayed on the same figure and display the pvalue table on the plot.

11.3.3 Visualization by ridgeplot

The ridgeplot will visualize expression distributions of core enriched genes for GSEA enriched categories. It helps users to interpret up/down-regulated pathways.

11.3.4 Visualization by heatmap

Similarly, heat maps of gene expression in specific biological functions allow users to explore target genes of interest.

12. Network Analysis

Studies have pointed out that the expression of genes are highly regulated, which result in a cascade of distinct patterns of coexpression forming a network. Identifying and understanding such patterns is crucial in deciphering the mechanism of gene expression regulation and expression heterogeneity. We provide three methods to perform the network analysis: KEGG pathview, Protein-to-protein network and Gene regulatory networks.

12.1 KEGG pathview

Pathview is a tool set for pathway based data integration and visualization. It maps and renders a wide variety of biological data on relevant pathway graphs. All users need is to supply their data and specify the target pathway. Pathview automatically downloads the pathway graph data, parses the data file, maps user data to the pathway, and render pathway graph with the mapped data. In addition, Pathview also seamlessly integrates with pathway and gene set (enrichment) analysis tools for large-scale and fully automated analysis.

12.2 Protein-to-protein network

STRING (https://www.string-db.org) is a database of known and predicted protein-protein interac- tions. The interactions include direct (physical) and indirect (functional) associations. The database contains information from numerous sources, including experimental repositories, computational pre- diction methods and public text collections. Each interaction is associated with a combined confidence score that integrates the various evidences.

12.3 Gene regulatory networks

One of the urgent problems in computational systems biology is to clarify the genetic regulatory networks (GRN) using high-throughput genomic data. GENIE3 is a machine learning-based algorithm based on regression trees for inferring gene regulatory networks from expression data [28]. Finally, a matrix containing the putative regulatory links will be generated. R packages edgebundleR and visNetwork were used in this section to visualize the regulatory networks.

13. Summarize genes and functions

Because we provide three methods to detect genes: DEGs, WGCNA and DEG Patterns, And three functional enrichment analysis methods are also provided: clusterprofiler ORA, clusterprofiler GSEA and gprofiler2 ORA. This may leave users unsure how to choose a appropriate result. Therefore, we provide a summary module that uses a Venn diagram to visualize the intersection between the three methods, as well as print out tabular details of the results detected simultaneously in the three methods.

13.1 Summarize genes

Before this part, the analysis of DEGs, WGCNA and DEG Patterns should be completed. Then select DEGs grouping, Pattern Gene module and WGCNA module to complete the analysis process.

13.2 Summarize functions

Before this part, the analysis of clusterprofiler ORA, clusterprofiler GSEA and gprofiler2 ORA should be completed. Then select proper data sources (e.g. GO:BP or KEGG) to complete the analysis process.



goushixue/QRseq documentation built on July 9, 2023, 9:28 a.m.