knitr::opts_chunk$set(echo = TRUE, eval = TRUE, message = FALSE, warning = FALSE)
library(singleCellTK) sce <- readRDS("tutorial_pbmc3k_qc.rds")
```{=html}
## Introduction Single Cell Toolkit (singleCellTK, SCTK) is a package that works on single-cell RNA-seq (scRNAseq) dataset. SCTK allows users to import multiple datasets, perform quality control and a series of preprocessing, get clustering on cells and markers of clusters, and run various downstream analysis. Meanwhile, SCTK also wraps curated workflows for [celda](https://www.camplab.net/celda/) and [Seurat](https://satijalab.org/seurat/). This tutorial takes the real-world scRNAseq dataset as an example, which consists of 2,700 Peripheral Blood Mononuclear Cells (PBMCs) collected from a healthy donor, namingly PBMC3K. This dataset is available from 10X Genomics and can be found on the 10X website. Before heading into the next steps, we assume that users have already loaded SCTK, imported and QC'ed the PBMC3K data, following the [Import and QC Tutorial](01_import_and_qc_tutorial.html). ## Normalization {#normalization} After removing cells of low quality, we next need to normalize the feature expression matrix. In this tutorial, we apply a global-scaling normalization method from [scater](https://bioconductor.org/packages/release/bioc/html/scater.html), `"logNormCounts"`. It normalizes the feature expression for each cell by the total expression, multiplies this by a scale factor, and log-transforms the result. For detail about all normalization methods, please refer to the [Normalization Documentation](normalization.html) ```{=html} <!--Remember to change tabset id--> <div class="tabset" id="tabset1"> <div class="tab"> <!--Remember to change tabset id argument of `openTab`--> <button class="tablinks ia" onclick="openTab(event, 'ia', 'tabset1')">Interactive Analysis</button> <button class="tablinks console" onclick="openTab(event, 'console', 'tabset1')">Console Analysis</button> </div> <div class="tabcontent ia">
\
"Scater - LogNormCounts"
from the left panel.```{=html}
SCTK has a generic wrapper function for various types of normalization, called `runNormalization()`. ```{R norm, results='hold'} sce <- runNormalization(sce, useAssay = "counts", outAssayName = "logcounts", normalizationMethod = "logNormCounts")
```{=html}
SCTK also supports a number of batch correction methods. Here we don't present the workflow because PBMC3k dataset only has one batch. For examples, please refer to the [document of batch correction](batch_correction.html). We also recommend performing the scaling method to standardize the variance, and this results in a better PCA. The UI and console function allow users to perform scaling on the expression assay. However, we don't recommend directly apply this to the full-sized expression matrix because the memory usage will be large. To accomplish this, we suggest finding the highly variable genes first in the next section, and only scaling the subset of the expression matrix as instructed. ## Variable Feature Selection Selecting highly variable genes (HVG) for downstream analyses is recommended, since a subset of variable features enhances the biological signal for community detection and reduces amount of data being used which hence speeds up the computation. SCTK wraps the methods used in [Seurat](https://satijalab.org/seurat/) and [Scran](https://bioconductor.org/packages/release/bioc/html/scran.html). We recommend keeping at least 2,000-5,000 HVGs. In this tutorial, we use the "modelGeneVar" method from Scarn, and select the top 2,000 genes. SCTK requires at least two steps to create the HVG subset. The first step performs variability calculation and stores the metrics, and the second step selects top genes basing on the ranking of the metrics. ```{=html} <!--Remember to change tabset id--> <div class="tabset" id="tabset2"> <div class="tab"> <!--Remember to change 2 tabset id arguments of `openTab`--> <button class="tablinks ia" onclick="openTab(event, 'ia', 'tabset2')">Interactive Analysis</button> <button class="tablinks console" onclick="openTab(event, 'console', 'tabset2')">Console Analysis</button> </div> <div class="tabcontent ia">
\
After calculation, the panel for selecting the top features, "2. Select number of variable features" becomes enabled. A scatter plot of features, with axes of variance and mean, will also be shown on the right panel "Plot".
\
After selection, the plot will be again updated, with highlighted red dots representing the selected features, texts labeling a smaller number (adjustable in the dropdown) of features with the top variability among the selected ones. The names of highlighted features are also updated in a text box at the bottom of the plotting region.
Note that:
```{=html}
For calculating the feature variablity, users can use either the wrapper function of each method (`runSeuratFindHVG()`, `runModelGeneVar()`) or using the generic wrapper `runFeatureSelection()` with argument `method` specified. ```{R hvg} sce <- runModelGeneVar(sce, useAssay = "logcounts") sce <- setTopHVG(sce, method = "modelGeneVar", hvgNumber = 2000, featureSubsetName = "hvg2000")
Users can also get the character vector of gene names:
```{R getTopHVG} hvg <- getTopHVG(sce, useFeatureSubset = "hvg2000") print(hvg[1:10])
The variability metrics and the top variable feature selection can also be visualized: ```{R plotTopHVG} plotTopHVG(sce, method = "modelGeneVar", hvgNumber = 2000, labelsCount = 10)
```{=html}
## Dimension Reduction The next step is performing a principal component analysis (PCA), which creates new uncorrelated variables that successively maximize variance and finally increases the data interpretability and at the same time minimizes the information loss. With the HVG selected from the previous step, we will perform PCA using the method from [Scater](http://bioconductor.org/packages/release/bioc/html/scater.html). As mentioned in the [Normalization section](#normalization), we will also have the selected HVG scaled in this step. SCTK also supports Seurat's PCA and ICA method, which do a similar job. Please see: [Seurat Curated Workflow](seurat_curated_workflow.html), and [Dimension Reduction Documentation](dimensionality_reduction.html). ```{=html} <!--Remember to change tabset id--> <div class="tabset" id="tabset3"> <div class="tab"> <!--Remember to change tabset id argument of `openTab`--> <button class="tablinks ia" onclick="openTab(event, 'ia', 'tabset3')">Interactive Analysis</button> <button class="tablinks console" onclick="openTab(event, 'console', 'tabset3')">Console Analysis</button> </div> <div class="tabcontent ia">
\
\
What we actually set for the parameters:
Besides the scatter plot of the top two components for all cells, the switches also add an elbow plot and heatmap for visualization
\
After running, the plots will be updated on the right side. The elbow plot and heatmaps are internally calculated by Seurat functions.
The elbow plot is a heuristic method for determining the dimensionality to be used for downstream analysis. It is a ranking of principle components based on the percentage of variance explained by each one. In this example, we can observe an "elbow" around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs.
The heatmap allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Please refer to Seurat's tutorial for more explanation on these two visualization approaches.
```{=html}
Here we use the wrapper function `scaterPCA()`. The name of transformed normalized expression matrix can be passed to `useAssay` (by default `useAssay = "logcounts"`), and the name of the HVG subset should be passed to `useFeatureSubset`. The scaling will be done by Scater by setting `scale = TRUE` (default). The output PCs will be updated in the `reducedDims` slot of the SCE object, named by `"PCA"` by default. ```{R pca} sce <- scaterPCA(sce, useFeatureSubset = "hvg2000", seed = 12345)
Afterwards, users can make scatter plots to visualize the top dimensions that explains the most variance of the original data:
```{R plotPCA} plotDimRed(sce, useReduction = "PCA")
```{=html} </div> </div>
Uniform Manifold Approximation and Projection (UMAP) and t-distributed Stochastic Neighbor Embedding (tSNE) are the most commonly used methods to visualize the relationships between the cells in a 2-D embedding. UMAP gives a better presentation of the global structure and is more efficient, while tSNE performs better in preserving the local structure.
In this tutorial, we calculate a UMAP embedding from the principal components generated in the previous step. We will use only 10 dimensions out of the 50 dimensions generated.
``{=html}
<!--Remember to change tabset id-->
<div class="tabset" id="tabset4">
<div class="tab">
<!--Remember to change tabset id argument of
openTab`-->
## Clustering {#clustering} SCTK supports clustering cells with Shared Nearest Neighbors (SNN) graph based methods and K-Means algorithm. For SNN graph based methods, we wrap the graph constructor from `scran` library, and have options of algorithms from `igraph` that can perform clustering on the graph. Alternatively, in the [Seurat Curated Workflow](seurat_curated_workflow.html), the Seurat's SNN graph construction and community detection method are also wrapped as one method. In this tutorial, we use the scran SNN graph approach, with the community detection algorithm set as Louvain. We will use the top 10 components for identifying the clusters. Though we have generated the UMAP representation in the previous section (also using 10 PCs), users are always allowed to test clustering with different numbers of PCs. In this case, we strongly suggest regenerating the UMAP with the consistent number of PCs, so that the UMAP can best reflect the information included in the PCs. For the scran SNN graph approach, the parameter `k` can always be tweaked as it affects the connectivity of the graph. When `k` is larger, stronger connection will be constructed across the cells, thus larger and fewer clusters will be identified. Parameter `weightType` allows for different shared neighbor weighting schemes. In this tutorial, we use `"jaccard"` as it provides consistency with Seurat methods. ```{=html} <!--Remember to change tabset id--> <div class="tabset" id="tabset5"> <div class="tab"> <!--Remember to change tabset id argument of `openTab`--> <button class="tablinks ia" onclick="openTab(event, 'ia', 'tabset5')">Interactive Analysis</button> <button class="tablinks console" onclick="openTab(event, 'console', 'tabset5')">Console Analysis</button> </div> <div class="tabcontent ia">
\
\
Upon completing clustering, a scatter plot of cells on a low-dimensional representation with cluster labeling colored on each cell will be shown on the right. However, we use the dimensions used for calculation as the default. For the best visualization, we suggest users change the dimensions to the 2D embedding obtained from the same components (of the same PCA matrix) as used for clustering.
```{=html}
With `runScranSNN()` function, the dimension reduction matrix can be specified through argument `useReducedDim`, and the cluster label assignment will be updated in the `colData` slot, and the variable will be named by argument `clusterName`. The default graph clustering algorithm is Louvain, while we also provide options including Leiden, Walktrap and etc. (from `igraph`). ```{R cluster, results="hold"} sce <- runScranSNN(sce, "PCA", clusterName = "cluster", nComp = 10, weightType = "jaccard", k = 14, seed = 12345)
After getting the cluster labeling, we can then visualize it:
```{R plotCluster} plotSCEDimReduceColData(sce, "cluster", "UMAP")
```{=html} </div> </div>
In the other curated workflow that SCTK wraps (Celda Curated Workflow), we use a Bayesian hierarchical model that can bi-cluster features into modules and observations into subpopulations. For detail of this method, please refer to the Celda Documentation.
saveRDS(sce, "tutorial_pbmc3k_clustered.rds")
Users can next perform marker detection process with the cluster labels, or any other types of reasonable grouping of similar biological state of cells stored in the cell metadata. SCTK iteratively finds markers for each cluster by running differential expression (DE) analysis on the cluster against all the other clusters.
Although we selected the variable features and reduced the dimensionality of the original expression matrix, here we will get back to the transformed normalized counts of all features.
``{=html}
<!--Remember to change tabset id-->
<div class="tabset" id="tabset6">
<div class="tab">
<!--Remember to change tabset id argument of
openTab`-->
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.