vignettes/Visualization.md

Data Visualization

Yu-Jui Ho and Toby Aicher, M Hammell Lab r Sys.Date()

Following identification of NMF clusters and sample assignemnts, SAKE provides several options for interactive data viusualization. Users can explore their NMF clusters through t-SNE and PCA projection plots. Users can also create standard gene expression heatmaps to, for example, evaluate gene expression patterns across samples in NMF clusters.

t-SNE

t-SNE is a non-linear form of dimensional reduction that gives each sample a location on a two or three dimensional grid. Early successful results of t-SNE maps in separating single cells of distinct origin have made t-SNE maps a popular choice for display of single-cell RNA-se data. The user can filter the genes used during t-SNE using four different ranking metrics: mean expression, median expression, MAD, and variance. Like for NMF, we recommend using using Top 1500 - 3000 MAD genes for bulk RNA-Seq data; Top 5000 - 8000 MAD genes for single-cell RNA-Seq data.

Under more options, the user can further modify t-SNE:

As mentioned in the earlier section on NMF, concordance of NMF groups and t-SNE clusters indicate the robustness of both methods for identifying expresison clusters in RNA-seq datasets. It's important to use NMF clustering in addition to t-SNE visualization maps because NMF can help quantitatively assign data points to clusters that occupy distinct but closely connected t-SNE groupings.

SAKE provide t-SNE plots both in 2-D and 3-D for users to better understand the clustering results.

PCA

Principal component analysis (PCA) is a dimensional reduction technique that finds inter-related variables within data and reduces them into a smaller set of independent variables that explain most of the variance in the data. The principal components are ordered by the amount of variance in the data they explain (e.g. the first principal component explains the most variance in the data). The first two or three principal components can be used to visualize data by plotting data points using the principal components as axes.

As with t-SNE and NMF, the user has the option to filter the number of genes used to calculate the principal components with four different ranking metrics: mean expression, median expression, MAD, and variance. We recommend using using Top 1500 - 3000 MAD genes for bulk RNA-Seq data; Top 5000 - 8000 MAD genes for single-cell RNA-Seq data.

The user can choose which principal components to use as axes to visualize their data. The default is to use the first and second principal components for 2D PCA, and the first, second, and third axes for 3D PCA.

The user can also designante the size of each sample dot, whether to display its label, the size of the label, and the alpha value (the transparency of each of the dots).

Heatmap

Heatmaps help with visualizing patterns in gene expression across multiple samples. Each column is a different sample and each row is a different gene.

There are five options for selecting sets of genes to analyze:

An example gene list file should look like this:

Gene | ------------| ------------- AHNAK | BMP1 | CALD1 | CAMK2N1 | CDH2 | COL1A2 | COL3A1 | COL5A2 | FN1 |

The first row should be a character string Gene. The following rows should be the names/IDs of your gene of interest.

An example heatmap using genes from NMF selected features is shown below. The color bar on the top of the heatmap indicates which NMF group each sample is assigned to.

More options

Under more options, the user can change the parameters of the heatmap.

Summary Stats

Several options are available to display summary statistics within NMF assigned groups. This includes: transcriptome variance distributions, histograms of the number of expressed genes in each group, and boxplots of mean intra-group correlation coefficients. Each of these distributions were calculated for each individual NMF group in order to assess the level of within- and between-cluster heterogeneity. Groups with high levels of intragroup heterogeneity are more likely to have high levels of transcriptome variance and low mean correlation coefficient. This may be due to sub-clusters present within a given group. Alternately, such clusters may represent outlier samples that could include low quality samples, which often have fewer expressed genes overall relative to other groups.

Based on these criteria for the samples displayed below, NMF group2 and group3 contained samples with higher levels of heterogeneity. A majority of the cells in NMF group2 were identified as deriving from a single cell type in the original author's publication, whereas cells in NMF group3 were identified as deriving from two distinct cell types (Ting et al., 2014).

Continue on the next section Differential Expression and Enrichment Analysis



naikai/sake documentation built on Feb. 15, 2023, 11 p.m.