knitr::opts_chunk$set(echo = TRUE, eval = TRUE, message = FALSE, warning = FALSE)
library(singleCellTK)
sce <- readRDS("tutorial_pbmc3k_qc.rds")

```{=html}

## Introduction

Single Cell Toolkit (singleCellTK, SCTK) is a package that works on single-cell RNA-seq (scRNAseq) dataset. SCTK allows users to import multiple datasets, perform quality control and a series of preprocessing, get clustering on cells and markers of clusters, and run various downstream analysis. Meanwhile, SCTK also wraps curated workflows for [celda](https://www.camplab.net/celda/) and [Seurat](https://satijalab.org/seurat/).

This tutorial takes the real-world scRNAseq dataset as an example, which consists of 2,700 Peripheral Blood Mononuclear Cells (PBMCs) collected from a healthy donor, namingly PBMC3K. This dataset is available from 10X Genomics and can be found on the 10X website.

Before heading into the next steps, we assume that users have already loaded SCTK, imported and QC'ed the PBMC3K data, following the [Import and QC Tutorial](01_import_and_qc_tutorial.html).

## Normalization {#normalization}

After removing cells of low quality, we next need to normalize the feature expression matrix. In this tutorial, we apply a global-scaling normalization method from [scater](https://bioconductor.org/packages/release/bioc/html/scater.html), `"logNormCounts"`. It normalizes the feature expression for each cell by the total expression, multiplies this by a scale factor, and log-transforms the result.

For detail about all normalization methods, please refer to the [Normalization Documentation](normalization.html)

```{=html}
<!--Remember to change tabset id-->
<div class="tabset" id="tabset1">
<div class="tab">
  <!--Remember to change tabset id argument of `openTab`-->
  <button class="tablinks ia" onclick="openTab(event, 'ia', 'tabset1')">Interactive Analysis</button>
  <button class="tablinks console" onclick="openTab(event, 'console', 'tabset1')">Console Analysis</button>
</div>

<div class="tabcontent ia">

norm\

  1. Users should enter the section for normalization from the "Normalization & Batch Correction" tab at the top navigation panel.
  2. In the "Normalization" sub-tab, choose the method "Scater - LogNormCounts" from the left panel.
  3. Users can directly press the "Run" button at the bottom right without taking any more action.

```{=html}

SCTK has a generic wrapper function for various types of normalization, called `runNormalization()`.

```{R norm, results='hold'}
sce <- runNormalization(sce, useAssay = "counts", outAssayName = "logcounts",
                        normalizationMethod = "logNormCounts")

```{=html}

SCTK also supports a number of batch correction methods. Here we don't present the workflow because PBMC3k dataset only has one batch. For examples, please refer to the [document of batch correction](batch_correction.html).

We also recommend performing the scaling method to standardize the variance, and this results in a better PCA. The UI and console function allow users to perform scaling on the expression assay. However, we don't recommend directly apply this to the full-sized expression matrix because the memory usage will be large. To accomplish this, we suggest finding the highly variable genes first in the next section, and only scaling the subset of the expression matrix as instructed.

## Variable Feature Selection

Selecting highly variable genes (HVG) for downstream analyses is recommended, since a subset of variable features enhances the biological signal for community detection and reduces amount of data being used which hence speeds up the computation. SCTK wraps the methods used in [Seurat](https://satijalab.org/seurat/) and [Scran](https://bioconductor.org/packages/release/bioc/html/scran.html). We recommend keeping at least 2,000-5,000 HVGs. In this tutorial, we use the "modelGeneVar" method from Scarn, and select the top 2,000 genes.

SCTK requires at least two steps to create the HVG subset. The first step performs variability calculation and stores the metrics, and the second step selects top genes basing on the ranking of the metrics.

```{=html}
<!--Remember to change tabset id-->
<div class="tabset" id="tabset2">
<div class="tab">
  <!--Remember to change 2 tabset id arguments of `openTab`-->
  <button class="tablinks ia" onclick="openTab(event, 'ia', 'tabset2')">Interactive Analysis</button>
  <button class="tablinks console" onclick="openTab(event, 'console', 'tabset2')">Console Analysis</button>
</div>

<div class="tabcontent ia">

hvg1\

  1. Users should enter the feature selection tab from the top navigation bar as instructed in the screenshot. And then the "Feature Selection" sub-tab.
  2. For the first step, in the top-left panel "1. Compute variability metric", select the method via "Select method for calculating variability of features". The assay to be used will be automatically updated on selection, since different method has different recommended assay type.
  3. Click on button "Run" to calculate the variability.

After calculation, the panel for selecting the top features, "2. Select number of variable features" becomes enabled. A scatter plot of features, with axes of variance and mean, will also be shown on the right panel "Plot".

hvg2\

  1. In the bottom-left panel "2. Select number of variable features", users now only need to click on the button "Run" to finish the selection for this tutorial.

After selection, the plot will be again updated, with highlighted red dots representing the selected features, texts labeling a smaller number (adjustable in the dropdown) of features with the top variability among the selected ones. The names of highlighted features are also updated in a text box at the bottom of the plotting region.

Note that:

```{=html}

For calculating the feature variablity, users can use either the wrapper function of each method (`runSeuratFindHVG()`, `runModelGeneVar()`) or using the generic wrapper `runFeatureSelection()` with argument `method` specified.

```{R hvg}
sce <- runModelGeneVar(sce, useAssay = "logcounts")
sce <- setTopHVG(sce, method = "modelGeneVar", hvgNumber = 2000, featureSubsetName = "hvg2000")

Users can also get the character vector of gene names:

```{R getTopHVG} hvg <- getTopHVG(sce, useFeatureSubset = "hvg2000") print(hvg[1:10])

The variability metrics and the top variable feature selection can also be visualized:

```{R plotTopHVG}
plotTopHVG(sce, method = "modelGeneVar", hvgNumber = 2000, labelsCount = 10)

```{=html}

## Dimension Reduction

The next step is performing a principal component analysis (PCA), which creates new uncorrelated variables that successively maximize variance and finally increases the data interpretability and at the same time minimizes the information loss.

With the HVG selected from the previous step, we will perform PCA using the method from [Scater](http://bioconductor.org/packages/release/bioc/html/scater.html). As mentioned in the [Normalization section](#normalization), we will also have the selected HVG scaled in this step.

SCTK also supports Seurat's PCA and ICA method, which do a similar job. Please see: [Seurat Curated Workflow](seurat_curated_workflow.html), and [Dimension Reduction Documentation](dimensionality_reduction.html).

```{=html}
<!--Remember to change tabset id-->
<div class="tabset" id="tabset3">
<div class="tab">
  <!--Remember to change tabset id argument of `openTab`-->
  <button class="tablinks ia" onclick="openTab(event, 'ia', 'tabset3')">Interactive Analysis</button>
  <button class="tablinks console" onclick="openTab(event, 'console', 'tabset3')">Console Analysis</button>
</div>

<div class="tabcontent ia">

pca_entry\

  1. Enter the main tab as instructed in the screenshot above. If users are following the previous steps in this tutorial, then we are already here.
  2. Enter the sub-tab for dimension reduction via "Dimensionality Reduction"

pca_entry\

  1. All of the parameters have been automatically updated for the most commonly seen scenario. Users can directly click on button "Run" for this tutorial.

What we actually set for the parameters:

Besides the scatter plot of the top two components for all cells, the switches also add an elbow plot and heatmap for visualization

pca_entry\

After running, the plots will be updated on the right side. The elbow plot and heatmaps are internally calculated by Seurat functions.

The elbow plot is a heuristic method for determining the dimensionality to be used for downstream analysis. It is a ranking of principle components based on the percentage of variance explained by each one. In this example, we can observe an "elbow" around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs.

The heatmap allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Please refer to Seurat's tutorial for more explanation on these two visualization approaches.

```{=html}

Here we use the wrapper function `scaterPCA()`. The name of transformed normalized expression matrix can be passed to `useAssay` (by default `useAssay = "logcounts"`), and the name of the HVG subset should be passed to `useFeatureSubset`. The scaling will be done by Scater by setting `scale = TRUE` (default). The output PCs will be updated in the `reducedDims` slot of the SCE object, named by `"PCA"` by default.

```{R pca}
sce <- scaterPCA(sce, useFeatureSubset = "hvg2000", seed = 12345)

Afterwards, users can make scatter plots to visualize the top dimensions that explains the most variance of the original data:

```{R plotPCA} plotDimRed(sce, useReduction = "PCA")

```{=html}
</div>
</div>

2D Embedding

Uniform Manifold Approximation and Projection (UMAP) and t-distributed Stochastic Neighbor Embedding (tSNE) are the most commonly used methods to visualize the relationships between the cells in a 2-D embedding. UMAP gives a better presentation of the global structure and is more efficient, while tSNE performs better in preserving the local structure.

In this tutorial, we calculate a UMAP embedding from the principal components generated in the previous step. We will use only 10 dimensions out of the 50 dimensions generated.

``{=html} <!--Remember to change tabset id--> <div class="tabset" id="tabset4"> <div class="tab"> <!--Remember to change tabset id argument ofopenTab`-->

wzxhzdk:10 A UMAP can be generated with the function `runUMAP()`, which is a wrapper of `calculateUMAP()` method in the `scater` package. We will now use `useReducedDim` argument to pass the PCA embedding just generated to the function. The result will be saved in `reducedDim` slot. Users can also create the tSNE embedding with `runTSNE()` function, which works in a similar way as `runUMAP()` ```{R UMAP, results="hold"} sce <- runUMAP(sce, initialDims = 10, seed = 12345) wzxhzdk:11 Note that we have another two functions `runQuickUMAP()` and `runQuickTSNE()`, which wrap the two functions mentioned above, respectively, and run directly from raw count matrix with all steps in the workflow automated internally. They can help with fast visualization at the beginning of an analysis. ```{=html}
## Clustering {#clustering}

SCTK supports clustering cells with Shared Nearest Neighbors (SNN) graph based methods and K-Means algorithm. For SNN graph based methods, we wrap the graph constructor from `scran` library, and have options of algorithms from `igraph` that can perform clustering on the graph. Alternatively, in the [Seurat Curated Workflow](seurat_curated_workflow.html), the Seurat's SNN graph construction and community detection method are also wrapped as one method.

In this tutorial, we use the scran SNN graph approach, with the community detection algorithm set as Louvain. We will use the top 10 components for identifying the clusters. Though we have generated the UMAP representation in the previous section (also using 10 PCs), users are always allowed to test clustering with different numbers of PCs. In this case, we strongly suggest regenerating the UMAP with the consistent number of PCs, so that the UMAP can best reflect the information included in the PCs.

For the scran SNN graph approach, the parameter `k` can always be tweaked as it affects the connectivity of the graph. When `k` is larger, stronger connection will be constructed across the cells, thus larger and fewer clusters will be identified. Parameter `weightType` allows for different shared neighbor weighting schemes. In this tutorial, we use `"jaccard"` as it provides consistency with Seurat methods.

```{=html}
<!--Remember to change tabset id-->
<div class="tabset" id="tabset5">
<div class="tab">
  <!--Remember to change tabset id argument of `openTab`-->
  <button class="tablinks ia" onclick="openTab(event, 'ia', 'tabset5')">Interactive Analysis</button>
  <button class="tablinks console" onclick="openTab(event, 'console', 'tabset5')">Console Analysis</button>
</div>

<div class="tabcontent ia">

cluster_entry\

  1. Users should enter the clustering tab via the top navigation bar as instructed in the screenshot above.

cluster_run\

  1. Select the same PCA matrix as we've done in the 2D Embedding section, at the selection input "Select Input Matrix"
  2. Tweak the parameters as we described in the introduction of this section.
  3. Click on button "Run" to start clustering the data.

Upon completing clustering, a scatter plot of cells on a low-dimensional representation with cluster labeling colored on each cell will be shown on the right. However, we use the dimensions used for calculation as the default. For the best visualization, we suggest users change the dimensions to the 2D embedding obtained from the same components (of the same PCA matrix) as used for clustering.

  1. In the dropdown panel of the plotting region, select the UMAP we previously generated from "Use Reduction", and then click on button "Update".

```{=html}

With `runScranSNN()` function, the dimension reduction matrix can be specified through argument `useReducedDim`, and the cluster label assignment will be updated in the `colData` slot, and the variable will be named by argument `clusterName`. The default graph clustering algorithm is Louvain, while we also provide options including Leiden, Walktrap and etc. (from `igraph`).

```{R cluster, results="hold"}
sce <- runScranSNN(sce, "PCA", clusterName = "cluster", nComp = 10,
                   weightType = "jaccard", k = 14, seed = 12345)

After getting the cluster labeling, we can then visualize it:

```{R plotCluster} plotSCEDimReduceColData(sce, "cluster", "UMAP")

```{=html}
</div>
</div>

In the other curated workflow that SCTK wraps (Celda Curated Workflow), we use a Bayesian hierarchical model that can bi-cluster features into modules and observations into subpopulations. For detail of this method, please refer to the Celda Documentation.

saveRDS(sce, "tutorial_pbmc3k_clustered.rds")

Marker Detection

Users can next perform marker detection process with the cluster labels, or any other types of reasonable grouping of similar biological state of cells stored in the cell metadata. SCTK iteratively finds markers for each cluster by running differential expression (DE) analysis on the cluster against all the other clusters.

Although we selected the variable features and reduced the dimensionality of the original expression matrix, here we will get back to the transformed normalized counts of all features.

``{=html} <!--Remember to change tabset id--> <div class="tabset" id="tabset6"> <div class="tab"> <!--Remember to change tabset id argument ofopenTab`-->

wzxhzdk:16 ```{R findMarker, message=FALSE, warning=FALSE, results="hold"} sce <- runFindMarker(sce, useAssay = "logcounts", method = "wilcox", cluster = "cluster") wzxhzdk:17 For visualization, SCTK has a specific heatmap plotting function for detected markers. The heatmap will take the log-normalized feature expression matrix for the values (scaled when plotting), and subset the features to only the top markers of each cluster. For the organization of the heatmap, the function groups the markers by the belonging to the clusters, and groups the cells, obviously, by clusters. ```{R markerHeatmap, results="hold", fig.height= 8, fig.width=9} plotFindMarkerHeatmap(sce, topN = 5, log2fcThreshold = 0.5, fdrThreshold = 0.05, minClustExprPerc = 0.5, maxCtrlExprPerc = 0.5, minMeanExpr = 0, rowLabel = TRUE) wzxhzdk:18 As we can observe from the marker heatmaps above, most of the clusters can be easily differentiated by the top marker genes of each cluster. However, "cluster 1" and "cluster 2"are seemingly having less difference between each other than compared to other cell clusters, in terms of the marker genes presented in the heatmap (top 5 markers for each cluster). To further explore the differential expression between them, we could do a cluster-by-cluster DE analysis in the next section. ## Differential Expression Analysis {#differential-expression-analysis} Beyond finding the markers for each cluster, usually people are also interested in the differential expression between specific clusters and even groups of clusters. SCTK provides a relatively flexible way for conducting such analysis. There are multiple ways to specify the groups of cells to be compared: by either directly specifying indices of cells, or selecting groups of cells with a categorical variable annotated in the cell metadata. However, when running DE analysis, SCTK requires a relatively sophisticated inputs for naming the analyses, including the names of the two groups, and the name of one analysis, though these are not necessary for calling the algorithms alone. By requiring these, the results of the analyses can be stored with human readable identifiers and within one single data container (SCE object), and can be more manageable. See the [DE documentation](differential_expression.html) for the detail. Continued from the previous section, we will now set "cluster 1" and "cluster 2" as the two groups to perform the analysis. ```{=html}
wzxhzdk:19 `runDEAnalysis()` is a generic function for performing differential expression (DE) analysis between two selected groups of cells. This is also used by `runFindMarker()`. By running DE analysis, the function will take the expression assay with all features and perform statistical analysis to discover quantitative changes in expression levels between selected groups. ```{R de, results="hold"} sce <- runDEAnalysis(inSCE = sce, method = "wilcox", useAssay = "logcounts", class = "cluster", classGroup1 = c(1), classGroup2 = c(2), groupName1 = "cluster1", groupName2 = "cluster2", analysisName = "cluster1_VS_2") wzxhzdk:20 ```{=html}
wzxhzdk:21 ![celltype_entry](ui_screenshots/ui_tutorial/celltype_entry.png)\ ![celltype_run](ui_screenshots/ui_tutorial/celltype_run.png)\ 1. Click on the **"Differential Expression & Cell Type Labeling"** tab at the top navigation bar, and then click on **"Cell Type Labeling"** to enter the tab. 2. Though this is optional, users could change the labeling level from "main" to "fine" at selection **"Labeling level"**. This option refers to what cell type annotation in the reference dataset the user dataset should be mapped to. It is also important to check the option **"Feature type"**, which asks for the default feature ID type. 3. Users can click on **"Run"** to start the labeling at this point. Users can visualize the labeling via a UMAP, by going back to the [clustering tab](#clustering). ![celltype_vis](ui_screenshots/ui_tutorial/celltype_vis.png)\ 4. In the right main visualization panel of the clustering tab, choose **"Select from All Present Annotation"**. 5. Select the variable that SingleR just produced. There will be three types of labeling, ending with `first.labels`, `labels` or `pruned.labels`. 6. Click on **"Update"** to show the SingleR labels on the UMAP ```{=html}
wzxhzdk:22 The result of labeling will again be saved to `colData` slot of the SCE object. There will be multiple types of information being stored, including a scoring matrix named by `"SingleR_{ref}_{level}_scores"`, where `{ref}` refers to the reference selected, `{level}` refers to the `level` argument being passed to the function. And three `"SingleR_{ref}_{level}_{typeOfLabels}"` columns, where `{typeOfLabels}` includes `first.labels`, `labels`, `pruned.labels`. For the difference, please refer to `?SingleR::classifySingleR`. We can also plot the labeling on to the UMAP we got previously. ```{R singleRUMAP, message=FALSE, warning=FALSE, fig.width=12} plotSCEDimReduceColData(sce, colorBy = "SingleR_hpca_fine_pruned.labels", reducedDimName = "UMAP") wzxhzdk:23 ## Differential Abundance The goal of the differential abundance analysis is to determine if categorical variables such as phenotype or sample (e.g. treatment or control) are enriched in different cell clusters using the Fisher's exact test (FET). Any categorical variables in `colData(sce)` can be used by passing the variable name to argument `cluster` or `variable`. Categories in the phenotype variable (used for `variable`) can be passed to `case` or `control`. For this example, we try to examine whether there are any [cell clusters](#clustering) enriched in an annotated cell type [using SingleR](#cell-type-labeling). ```{R diffAbundance, fig.width=10} # Returns a vector where TRUE for a type of NK cell isNKCell <- startsWith(colData(sce)$SingleR_hpca_fine_pruned.labels, "NK_cell") # Returns a vector where TRUE for a type of T cell isTCell <- startsWith(colData(sce)$SingleR_hpca_fine_pruned.labels, "T_cell") # Make up a variable with only categories of "NK_cell", "T_cell" and "Other". colData(sce)$cell_type <- ifelse(isNKCell, "NK_cell", "Other") colData(sce)$cell_type[isTCell] <- "T_cell" colData(sce)$cell_type[is.na(colData(sce)$cell_type)] <- "Other" # Run differential abundance test sce <- diffAbundanceFET(sce, cluster = "cluster", variable = "cell_type", case = "NK_cell", control = "T_cell", analysisName = "NK_VS_T") wzxhzdk:24 ```{R getDiffAbundanceShow, echo=FALSE} res <- getDiffAbundanceResults(sce, analysisName = "NK_VS_T") knitr::kable(format(res, digits = 3)) wzxhzdk:25 ## Pathway Analysis For the purpose of gene set enrichment (GSE) analysis, SCTK imports VAM and GSVA as options. Both of them work by loading some user-defined gene sets ([more about loading gene sets](import_genesets.html)), and calculating the enrichment score of each gene set in each cell. Here we will go through calling VAM to inspect the enrichment of the hallmark gene sets available in [MSigDB database](https://www.gsea-msigdb.org/gsea/msigdb/). Please refer to the [document](pathwayAnalysis.html) for the detail of running pathway analysis with VAM or GSVA. SCTK also supports GSEA method from EnrichR, which queries a given gene list and finds the significant overlap against the curated gene sets from multiple gene set libraries. For detail, please refer to [EnrichR Documentation](enrichR.html). ```{=html}
wzxhzdk:26 ```{R VAM} sce <- importGeneSetsFromMSigDB(sce, categoryIDs = "H", species = "Homo sapiens") sce <- runVAM(sce, geneSetCollectionName = "H", useAssay = "logcounts") wzxhzdk:27 Note that available options for argument `resultName` and `geneset` can be printed with getter functions below. ```{R pathwayGetter, eval=FALSE} getPathwayResultNames(sce) getGenesetNamesFromCollection(sce, geneSetCollectionName = "H") wzxhzdk:28 ### Performing DE Analysis on Pathway Analysis Scores {#performing-de-analysis-on-pathway-analysis-scores} SCTK allows users to select the score matrix from pathway analysis to perform differential expression analysis. In this process, each geneset will be treated as the feature, and comparison groups can still be set in the same way as introduced in [DE analysis section](#differential-expression-analysis). ```{=html}
wzxhzdk:29 To run DE analysis with matrix in the `reducedDims` slot, users only need to simply specify the matrix name to argument `useReducedDim` (`useAssay` will be ignored in this case). ```{R pathwayDE} sce <- runDEAnalysis(inSCE = sce, method = "wilcox", useReducedDim = "VAM_H_CDF", class = "cluster", classGroup1 = c(1, 2, 5), classGroup2 = c(3, 6), groupName1 = "T_cells", groupName2 = "monocytes", analysisName = "T_cells_VS_monocytes_VAM_H_CDF", overwrite=T) wzxhzdk:30 ```{=html}
wzxhzdk:31

compbiomed/singleCellTK documentation built on May 8, 2024, 6:58 p.m.