Started on r format(Sys.time(), "%Y-%m-%d %H:%M:%S")

# input for this report: sce
library(clustree)
library(kableExtra)
library(NMF)
library(pheatmap)
library(viridis)
library(tidyverse)
library(cowplot)
library(scran)
library(RColorBrewer)
library(plotly)
library(SingleR)
library(SingleCellExperiment)
library(scater)
library(Seurat)
library(AUCell)
library(HDF5Array)
library(ezRun)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, knitr.table.format = "html")
pvalue_allMarkers <- 0.05
sce <- loadHDF5SummarizedExperiment("sce_h5")
param = metadata(sce)$param
output <- metadata(sce)$output
species <- getSpecies(param$refBuild)
var_height = 1 #This ensures that chunks that use the length of this variable to set the fig size, don't fail.

Analysis results {.tabset}

Quality control

Selected QC metrics

We use several common QC metrics to identify low-quality cells based on their expression profiles. The metrics that were chosen are described below.

  1. The library size is defined as the total sum of counts across all relevant features for each cell. Cells with small library sizes are of low quality as the RNA has been lost at some point during library preparation.
  2. The number of expressed features in each cell is defined as the number of genes with non-zero counts for that cell. Any cell with very few expressed genes is likely to be of poor quality as the diverse transcript population has not been successfully captured.
  3. The proportions of mitochondrial and ribosomal genes per cell. High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), presumably because of the loss of cytoplasmic RNA from perforated cells.


Diagnostic plots

A key assumption here is that the QC metrics are independent of the biological state of each cell. Poor values (e.g., low library sizes, high mitochondrial or ribosomal proportions) are presumed to be driven by technical factors rather than biological processes, meaning that the subsequent removal of cells will not misrepresent the biology in downstream analyses. Major violations of this assumption would potentially result in the loss of cell types that have, say, systematically low RNA content or high numbers of mitochondria. We can check for such violations using some diagnostics plots. In the most ideal case, we would see normal distributions that would justify the thresholds used in outlier detection. A large proportion of cells in another mode suggests that the QC metrics might be correlated with some biological state, potentially leading to the loss of distinct cell types during filtering.

plotColData(sce, x="Batch", y="nCount_RNA") + scale_y_log10() + ggtitle("Number of UMIs")
plotColData(sce, x="Batch", y="nFeature_RNA") + scale_y_log10() + ggtitle("Detected genes")
#plotColData(sce, x="Batch", y="percent_mito") + ggtitle("Mito percent")
#plotColData(sce, x="Batch", y="percent_ribo") + ggtitle("Ribo percent")

It is also worth to plot the proportion of mitochondrial counts against the library size for each cell. The aim is to confirm that there are no cells with both large total counts and large mitochondrial counts, to ensure that we are not inadvertently removing high-quality cells that happen to be highly metabolically active.

#plotColData(sce, x="nCount_RNA", y="percent_mito", colour_by="discard")


Dimensionality reduction

Dimensionality reduction aims to reduce the number of separate dimensions in the data. This is possible because different genes are correlated if they are affected by the same biological process. Thus, we do not need to store separate information for individual genes, but can instead compress multiple features into a single dimension. This reduces computational work in downstream analyses, as calculations only need to be performed for a few dimensions rather than thousands of genes; reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data, and enables effective plotting of the data.


Transforming the data and feature selection

We used the SCtransform method from the Seurat package for normalizing, estimating the variance of the raw filtered data, and identifying the most variable genes. By default, SCtransform accounts for cellular sequencing depth, or nUMIs.
r if(ezIsSpecified(param$SCT.regress.CellCycle) && param$SCT.regress.CellCycle) { "We already checked cell cycle and decided that it does represent a major source of variation in our data, and this may influence clustering. Therefore, we regressed out variation due to cell cycle." } As a result, SCTransform ranked the genes by residual variance and returned the 3000 most variant genes.


Principal component analysis

Next, we perform PCA on the scaled data. By default, only the previously determined variable features are used as input but can be defined using the pcGenes argument if you wish to choose a different subset. Seurat clusters cells based on their PCA scores. The top principal components, therefore, represent a robust compression of the dataset. The numbers of PCs that should be retained for downstream analyses typically range from 10 to 50. However, identifying the true dimensionality of a dataset can be challenging, that's why we recommend considering the ‘Elbow plot’ approach. a ranking of principal components based on the percentage of variance explained by each one. The assumption is that each of the top PCs capturing biological signal should explain much more variance than the remaining PCs. Thus, there should be a sharp drop in the percentage of variance explained when we move past the last “biological” PC. This manifests as an elbow in the scree plot, the location of which serves as a natural choice for a number of PCs.

#plot(metadata(sce)$PCA_stdev, xlab="PC", ylab="Standard Deviation", pch=16)


Visualization

Another application of dimensionality reduction is to compress the data into 2 (sometimes 3) dimensions for plotting. The simplest visualization approach is to plot the top 2 PCs in a PCA.

plotReducedDim(sce, dimred="PCA", colour_by="cellType")

The problem is that PCA is a linear technique, i.e., only variation along a line in high-dimensional space is captured by each PC. As such, it cannot efficiently pack differences in many dimensions into the first 2 PCs. The de facto standard for visualization of scRNA-seq data is the t-stochastic neighbor embedding (t-SNE) method (Van der Maaten and Hinton 2008). Unlike PCA, it is not restricted to linear transformations, nor is it obliged to accurately represent distances between distance populations. This means that it has much more freedom in how it arranges cells in low-dimensional space, enabling it to separate many distinct clusters in a complex population. The uniform manifold approximation and projection (UMAP) method (McInnes, Healy, and Melville 2018) is an alternative to t-SNE for non-linear dimensionality reduction. Compared to t-SNE, UMAP visualization tends to have more compact visual clusters with more empty space between them. It also attempts to preserve more of the global structure than t-SNE. It is arguable whether the UMAP or t-SNE visualizations are more useful or aesthetically pleasing. UMAP aims to preserve more global structure but this necessarily reduces resolution within each visual cluster. However, UMAP is unarguably much faster, and for that reason alone, it is increasingly displacing t-SNE as the method of choice for visualizing large scRNA-seq data sets. We will use both UMAP and TSNE to visualize the clustering output. Another application of dimensionality reduction is to compress the data into 2 (sometimes 3) dimensions for plotting. The simplest visualization approach is to plot the top 2 PCs in a PCA.



Clustering

In order to find clusters of cells we first built a graph called K-nearest neighbor (KNN), where each node is a cell that is connected to its nearest neighbors in the high-dimensional space. Edges are weighted based on the similarity between the cells involved, with higher weight given to cells that are more closely related. This step takes as input the previously defined dimensionality of the dataset (first r param$npcs PCs). We then applied algorithms to identify “communities” of cells that are more connected to cells in the same community than they are to cells of different communities. Each community represents a cluster that we can use for downstream interpretation.

We can visualize the distribution of clusters in the TSNE and UMAP plots. However, we should not perform downstream analyses directly on their coordinates. These plots are most useful for checking whether two clusters are actually neighboring subclusters or whether a cluster can be split into further subclusters.

plotReducedDim(sce, dimred="TSNE", colour_by="cellType", text_by="cellType", 
               text_size=5, add_legend=FALSE)
plotReducedDim(sce, dimred="UMAP", colour_by="cellType", text_by="cellType", 
               text_size=5, add_legend=FALSE)



The number of cells in each cluster and sample is represented in this barplot.


cellIdents_perSample <- as.data.frame(colData(sce)[,c('cellType', 'orig.ident')])
barplot = ggplot(data=cellIdents_perSample, aes(x=cellType, fill=orig.ident)) + geom_bar(stat="Count")
barplot + labs(x="", y = "Number of cells", fill = "Sample") + theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1))

cells_prop = cellsProportion(sce, groupVar1 = "ident", groupVar2 = "Batch")
kable(cells_prop,row.names=FALSE, format="html",caption="Cell proportions") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right")

Cluster assessment

Segregation of clusters by various sources of uninteresting variation.

Once we have created the clusters we need to asses if the clustering was driven by technical artifacts or uninteresting biological variability, such as cell cycle, mitochondrial gene expression. We can explore whether the cells cluster by the different cell cycle phases. In such a case, we would have clusters where most of the cells would be in one specific phase. This bias could be taken into account when normalizing and transforming the data prior to clustering. We can also look at the total number of reads, genes detected and mitochondrial gene expression. The clusters should be more or less even but if we observe big differences among some of them for these metrics, we will keep an eye on them and see if the cell types we identify later can explain the differences.

sce$logRNA = log10(sce$nCount_RNA)
sce$logGenes = log10(sce$nFeature_RNA)

if (!is.null(sce$CellCycle)){
  p1 = plotColData(sce, x="ident", y="CellCycle", colour_by="ident") + ggtitle("Cell cycle vs ident")
  p2 = plotReducedDim(sce, dimred="UMAP", colour_by="CellCycle", text_by="cellType", text_size=5, add_legend=TRUE)
  print(p1)
  print(p2)
}
plotColData(sce, x="ident", y="nCount_RNA", colour_by="ident", add_legend=FALSE) + ggtitle("Number of UMIs vs ident")
plotReducedDim(sce, dimred="UMAP", colour_by="logRNA", text_by="cellType", text_size=5)
plotColData(sce, x="ident", y="nFeature_RNA", colour_by="ident", add_legend=FALSE) + ggtitle("Detected genes vs ident")
plotReducedDim(sce, dimred="UMAP", colour_by="logGenes", text_by="cellType", text_size=5)
# plotColData(sce, x="ident", y="percent_mito", colour_by="ident", add_legend=FALSE) + ggtitle("Mito percentage vs ident")
# plotReducedDim(sce, dimred="UMAP", colour_by="percent_mito", text_by="cellType", text_size=5)
# plotColData(sce, x="ident", y="percent_ribo", colour_by="ident", add_legend=FALSE) + ggtitle("Ribo percentage vs ident")
# plotReducedDim(sce, dimred="UMAP", colour_by="percent_ribo", text_by="cellType", text_size=5)
if (ezIsSpecified(param$controlSeqs)) {
  genesToPlot <- gsub("_", "-", param$controlSeqs)
  genesToPlot <- intersect(genesToPlot, rownames(sce))
  plotExpression(sce, genesToPlot, x="cellType", colour_by = "cellType", add_legend=FALSE)
  #plotReducedDim(sce, dimred="UMAP", colour_by=genesToPlot[1], text_by="cellType", text_size=5)
}

Cluster markers

cat("We found positive markers that defined clusters compared to all other cells via differential expression. The test we used was the Wilcoxon Rank Sum test. Genes with an average, at least 0.25-fold difference (log-scale) between the cells in the tested cluster and the rest of the cells and an adjusted p-value < 0.05 were declared as significant.")
cat("We found positive markers that defined clusters compared to all other cells via differential expression using a logistic regression test and including in the model the cell cycle as the batch effect. Genes with an average, at least 0.25-fold difference (log-scale) between the cells in the tested cluster and the rest of the cells and an adjusted p-value < 0.05 were declared as significant.")

Expression differences of cluster marker genes

posMarkers = read_tsv("pos_markers.tsv")
posMarkers$cluster = as.factor(posMarkers$cluster)
posMarkers$gene = as.factor(posMarkers$gene)
ezInteractiveTableRmd(posMarkers, digits=4)

Marker plots

Here, we use a heatmap and a dotplot to visualize simultaneously the top 5 markers in each cluster. Be aware that some genes may be in the top markers for different clusters.


top5 <- posMarkers %>% group_by(cluster) 
top5 <- slice_max(top5, n = 5, order_by = avg_log2FC)
genesToPlot <- c(gsub("_", "-", param$controlSeqs), unique(as.character(top5$gene)))
genesToPlot <- intersect(genesToPlot, rownames(sce))

plotHeatmap(sce, features=unique(genesToPlot), center=TRUE, zlim=c(-2, 2),
            colour_columns_by = c("cellType"), 
            order_columns_by = c("cellType"),
            cluster_rows=TRUE,
            fontsize_row=8)

plotDots(sce, features=genesToPlot, group="cellType") + theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1)) +
  labs(x="", y="")

Cell annotation

The most challenging task in scRNA-seq data analysis is the interpretation of the results. Once we have obtained a set of clusters we want to determine what biological state is represented by each of them. To do this, we have implemmented 3 different approaches which are explained below. However, these methods should be complemented with the biological knowledge from the researcher.


Using cluster markers

This approach consists in performing a gene set enrichment analysis on the marker genes defining each cluster. This identifies the pathways and processes that are (relatively) active in each cluster based on the upregulation of the associated genes compared to other clusters. For this, we use the tool Enrichr.

genesPerCluster <- split(posMarkers$gene, posMarkers$cluster)
jsCall = paste0('enrich({list: "', sapply(genesPerCluster, paste, collapse="\\n"), '", popup: true});')
enrichrCalls <- paste0("<a href='javascript:void(0)' onClick='", jsCall, 
                         "'>Analyse at Enrichr website</a>")
enrichrTable <- tibble(Cluster=names(genesPerCluster),
                         "# of posMarkers"=lengths(genesPerCluster),
                         "Enrichr link"=enrichrCalls)
kable(enrichrTable, format="html", escape=FALSE,
        caption=paste0("GeneSet enrichment: genes with pvalue ", pvalue_allMarkers)) %>%
kable_styling("striped", full_width = F, position = "left")


Interactive explorer


The iSEE (Interactive SummarizedExperiment Explorer) explorer provides a general visual interface for exploring single cell data. iSEE allows users to simultaneously visualize multiple aspects of a given data set, including experimental data, metadata, and analysis results. Dynamic linking and point selection facilitate the flexible exploration of interactions between different data aspects. [Rue-Albrecht K, Marini F, Soneson C, Lun ATL (2018). “iSEE: Interactive SummarizedExperiment Explorer.” F1000Research, 7, 741. doi: 10.12688/f1000research.14966.1.]

The iSEE shiny app can be accessed through this link iSEE explorer{target="_blank"}

Data availability

Mean expression of every gene across the cells in each cluster

geneMeans

Positive markers of each cluster

posMarkers

The final Single Cell Experiment Object is here

Parameters

param[c( "DE.method")]

SessionInfo

ezSessionInfo()


uzh/ezRun documentation built on May 4, 2024, 3:23 p.m.