In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Started on r format(Sys.time(), "%Y-%m-%d %H:%M:%S")

library(SummarizedExperiment)
library(ggridges)
library(cowplot)
library(kableExtra)
library(pheatmap)
library(scater)
library(RColorBrewer)
library(plotly)
library(tidyverse)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, knitr.table.format = "html")

#to test
# library(HDF5Array)
# library(SingleR)
# library(AUCell)
# library(ezRun, lib.loc = "~/myRpackages")
# sce <- loadHDF5SummarizedExperiment("sce_h5")
# param = metadata(sce)$param
# pvalue_allMarkers <- 0.05

output <- metadata(sce)$output

jsFile = system.file("extdata/enrichr.js", package="ezRun", mustWork=TRUE)
invisible(file.copy(from=jsFile, to=basename(jsFile), overwrite=TRUE))
cat(paste0("<SCRIPT language=\"JavaScript\" SRC=\"", basename(jsFile), "\"></SCRIPT>"))

Analysis results {.tabset}

Unwanted sources of variation

The goal of the clustering analysis is to keep the major sources of variation in the dataset that should define the cell types, while restricting the variation due to uninteresting sources (sequencing depth, cell cycle differences, mitochondrial expression, batch effects, etc.). A common practice is to regress out these covariates prior to downstream analyses. Here, we will only asses cell cycle and batch effects as the main potential sources of variation since the method we will later use for normalization and variance stabilization removes the variation due to sequencing depth (total nUMIs per cell). While regressing out the other sources (i.e. mitochondrial abundances) has proved to have a negative impact, leading to increased correlation with covariates and decreased clustering accuracy (Pierre-Luc Germain, Anthony Sonrel, Mark D. Robinson: pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single-cell RNA-seq preprocessing tools. bioRxiv doi: https://doi.org/10.1101/2020.02.02.930578).

1. Cell cycle

In most applications, the cell cycle is a minor factor of variation, secondary to differences between cell types. Any attempt at removal would also need to assume that the cell cycle effect is orthogonal to other biological processes. For example, regression would potentially remove interesting signal if cell cycle activity varied across clusters or conditions, with a prime example being the increased proliferation of activated T cells (Richard, A. C., A. T. L. Lun, W. W. Y. Lau, B. Gottgens, J. C. Marioni, and G. M. Griffiths. 2018. “T cell cytolytic capacity is independent of initial stimulation strength.” Nat. Immunol. 19 (8): 849–58.). We suggest only performing cell cycle adjustment on an as-needed basis in populations with clear cell cycle effects. A TSNE can help us to determine whether cell cycle is a major source of variation in our dataset.

plotReducedDim(sce, dimred="TSNE_NOCORRECTED", colour_by="CellCycle", label_format = c("TSNE_1","TSNE_2"))

2. Batch effects

On the other hand, we always look at our cells before deciding whether we need to perform integration. Te main goal of dataset integration is to identify shared cell states that are present across different datasets, in case they were collected from different individuals, experimental conditions, technologies, or even species. Large single-cell RNA sequencing projects usually need to generate data across multiple batches due to logistical constraints. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results. The TSNEs and the barplot below can help us to visualize if the clusters are balanced and composed by cells from the different batches. If we see clusters that are comprised of cells from a single batch, this indicates that cells of the same type are artificially separated due to technical differences between batches. In this case, we may also consider that there are cell types that are unique to each batch. If a cluster only contains cells from a single batch, one can always debate whether that is caused by technical differences or if there is truly a batch-specific subpopulation.

plotReducedDim(sce, dimred =  "TSNE_NOCORRECTED", colour_by = "Plate", label_format = c("TSNE_1","TSNE_2"))

cellIdents_perSample = data.frame(colData(sce)[,c("ident_noCorrected", "Plate")])
barplot = ggplot(data=cellIdents_perSample, aes(x=cellIdents_perSample[,1], fill=Plate)) + geom_bar(stat="Count")
barplot + labs(x="Cluster", y = "Number of cells", fill = "Plate")

Clustering

cat("We started by merging all the samples in one dataset and then used the SCtransform method from the Seurat package for normalizing, estimating the variance of the raw filtered data, and identifying the most variable genes. By default, SCtransform accounts for cellular sequencing depth, or nUMIs.")
if(identical(param$SCT.regress,"CellCycle")) {
  cat("We already checked cell cycle and decided that it does represent a major source of variation in our data, and this may influence clustering. Therefore, we regressed out variation due to cell cycle")
}
cat("As a result, SCTransform ranked the genes by residual variance and returned the 3000 most variant genes. Next, we performed PCA on the scaled data using the previously determined variable features. Taking as a distance metric the previously identified PCs, the cells clusters were then identified using a graph-based clustering approach where the cells are embedded in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘communities’. The resolution is an important argument that sets the \"granularity\" of the downstream clustering and will need to be optimized for every individual experiment. Increased resolution values lead to a greater number of clusters, which is often required for larger datasets.\n")
cat("\n")
cat("The UMAPs below place similar cells together in low-dimensional space. The first UMAP represents cells according to the condition and the second one shows the graph-based common clusters that were found among the datasets.\n")

cat("After inspecting the datasets and observing that cells clustered by sample we decided to integrate samples using shared highly variable genes. Oftentimes, when clustering cells from multiple conditions there are condition-specific clusters and integration can help ensure the same cell types cluster together.\n We started by normalizing, estimating the variance of the raw filtered data, and identifying the most variable genes in each sample separately. For this, we used the SCtransform method from the Seurat package which accounts for cellular sequencing depth, or nUMIs by default.")
cat("\n")
if(identical(param$SCT.regress,"CellCycle")) {
  cat("We already checked cell cycle and decided that it does represent a major source of variation in our data, and this may influence clustering. Therefore, we regressed out variation due to cell cycle")
}
cat("\n")
cat("To integrate, we applied the following steps:")
cat("\n\n")
cat("**1. Perform canonical correlation analysis (CCA):** CCA identifies shared sources of variation between the conditions/groups. It is a form of PCA, in that it identifies the greatest sources of variation in the data, but only if it is shared or conserved across the conditions/groups (using the 3000 most variant genes from each sample).")
cat("\n\n")
cat("**2. Identify anchors or mutual nearest neighbors (MNNs) across datasets (sometimes incorrect anchors are identified):** MNNs are like good friends. For each cell in one sample, the cell's closest neighbor in the other sample is identified based on gene expression values as it's best neighbor.The reciprical analysis is performed, and if the two cells are 'best friends' in both directions, then those cells will be marked as anchors to 'anchor' the two datasets together.")
cat("\n\n")
cat("**3. Filter anchors to remove incorrect anchors:** Assess the similarity between anchor pairs by the overlap in their local neighborhoods (incorrect anchors will have low scores)")
cat("\n\n")
cat("**4. Integrate the conditions/datasets:** Use anchors and corresponding scores to transform the cell expression values, allowing for the integration of the different samples.")
cat("\n\n")
cat("If cell types are present in one dataset, but not the other, then the cells will still appear as a separate sample-specific cluster.")
cat("\n\n")
cat("Finally, the cells clusters were identified using a graph-based clustering approach where the cells are embedded in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘communities’. The resolution is an important argument that sets the \"granularity\" of the downstream clustering and will need to be optimized for every individual experiment. Increased resolution values lead to a greater number of clusters, which is often required for larger datasets.")
cat("\n")
cat("\n")
cat("The TSNEs below place similar cells together in low-dimensional space after the integration of the datasets. The first TSNE represents cells according to the condition and the second TSNE shows the graph-based common clusters that were found among the datasets. The presence of visual clusters containing cells from the different datasets provides a comforting illusion that the integration was successful.")

plotReducedDim(sce, dimred="TSNE", colour_by="Condition")
plotReducedDim(sce, dimred="TSNE", colour_by="ident", text_by="ident", text_size=5)

The number of cells in each cluster and condition is represented on this barplot.

cellIdents_perSample = data.frame(colData(sce)[,c("ident", "Condition")])
barplot = ggplot(data=cellIdents_perSample, aes(x=cellIdents_perSample[,1], fill=Condition)) + geom_bar(stat="Count")
barplot + labs(x="Cluster", y = "Number of cells", fill = "Condition")

cells_prop = cellsProportion(sce, groupVar1 = "ident", groupVar2="Condition")
kable(cells_prop,row.names=FALSE, format="html",caption="Cell proportions") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right")

Cluster markers

posMarkers = read_tsv("pos_markers.tsv")

cat("We found positive markers that defined clusters compared to all other cells via differential expression. The test we used was the Wilcoxon Rank Sum test. Genes with an average, at least 0.25-fold difference (log-scale) between the cells in the tested cluster and the rest of the cells and an adjusted p-value < 0.05 were declared as significant.")

cat(paste0("We found positive markers that defined clusters compared to all other cells via differential expression using a logistic regression test and including in the model the ", param$DE.regress, " as the batch effect. Genes with an average, at least 0.25-fold difference (log-scale) between the cells in the tested cluster and the rest of the cells and an adjusted p-value < 0.05 were declared as significant."))

caption ="Expression differences of cluster marker genes"
ezInteractiveTableRmd(posMarkers, digits=4, title=caption)

Markers plots

Here, we use a heatmap and a dotplot to visualize simultaneously the top 5 markers in each cluster. Be aware that some genes may be in the top markers for different clusters.

top5 <- posMarkers %>% group_by(cluster) %>% top_n(n = 5, wt = diff_pct)

tr_cnts <- expm1(logcounts(sce))
geneMeans <- rowsum(DelayedArray::t(tr_cnts), group=sce$seurat_clusters)
geneMeans <- sweep(geneMeans, 1, STATS=table(sce$seurat_clusters)[rownames(geneMeans)], FUN="/")
geneMeans <- log1p(t(geneMeans))
colnames(geneMeans) <- paste("cluster", colnames(geneMeans), sep="_")
pheatmap(geneMeans[unique(top5$gene), ],fontsize_row = 8)

plotDots(sce, features=top5$gene, group="seurat_clusters")

Conserved markers

Identify cell type marker genes that are conserved across conditions. Differential gene expression tests are performed for each group and then, the p-values are combined using meta-analysis methods.

conservedMarkers <- read_tsv("conserved_markers.tsv")
# p_val_adj_fields <- paste0(unique(sce$Condition), "_p_val_adj")
# conservedMarkers <- conservedMarkers[conservedMarkers[,p_val_adj_fields[1]] < 0.05 & conservedMarkers[,p_val_adj_fields[2]] < 0.05,]

caption ="Conserved cell type markers"
ezInteractiveTableRmd(conservedMarkers, title=caption)

Conserved markers plots

Here, we use a heatmap and a dotplot to visualize simultaneously the top 5 conserved markers in each cluster (with highest average fold change across the two groups). Be aware that some genes may be in the top markers for different clusters.

# pct_cols_con1 = paste0(unique(sce$Condition)[1],c("_pct.1", "_pct.2"))
# pct_cols_con2 = paste0(unique(sce$Condition)[1],c("_pct.1", "_pct.2"))
# diff_pct_cond1 = abs(conservedMarkers[, pct_cols_con1[1]]-conservedMarkers[, pct_cols_con1[2]])
# diff_pct_cond2 = abs(conservedMarkers[, pct_cols_con2[1]]-conservedMarkers[, pct_cols_con2[2]])
# conservedMarkers <- conservedMarkers[order(diff_pct_cond1, diff_pct_cond2, decreasing = TRUE),]
# top5 <- conservedMarkers %>% group_by(cluster) %>% slice_head(n = 5)

top5 <- conservedMarkers %>% group_by(cluster) %>% top_n(n = 5, wt = avg_avg_fc)

pheatmap(geneMeans[unique(top5$gene), ],fontsize_row = 8)
plotDots(sce, features=top5$gene, group="seurat_clusters")

Differential expressed genes

After identifying common cell types across conditions, we can look for genes that change in different conditions for cells of the same type. The comparison made was r paste(unique(sce$Condition), collapse = " vs ")

diffGenes <- read_tsv("differential_genes.tsv")
diffGenes <- diffGenes[diffGenes$p_val_adj<0.05,]
caption ="Differential expressed genes per cluster"
ezInteractiveTableRmd(diffGenes, title=caption)

Differential genes plots

The 50 dysregulated genes with the highest difference in expression between the two conditions are represented in the heatmap and the violin plots. The heatmap shows global differences of these genes across the conditions, while the violin plots show cluster specific changes across conditions.

most_diffGenes <- head(unique(diffGenes$gene), 50)
plotHeatmap(sce, order_columns_by = "Condition", features = most_diffGenes, fontsize=8)

# for (gene in most_diffGenes) {
#   print(plotExpression(sce, gene, x="seurat_clusters", colour_by="Condition"))
# }
plotExpression(sce, most_diffGenes, x="seurat_clusters", colour_by="Condition", ncol=3)

# for(cluster in unique(diffGenes$cluster)) {
#  diffGenes_cluster = diffGenes[diffGenes$cluster == cluster,]
#  unique_dges = setdiff(diffGenes_cluster$gene, diffGenes[!grepl(paste0("^",cluster, "$"), diffGenes$cluster),]$gene)
#  if(length(unique_dges) == 0)
#    next
# 
#  cat("\n")
#  cat(paste0("#### Unique DEGs in cluster "), cluster, "\n")
#  print(plotExpression(logNormCounts(summed.filt), features=unique_dges, x="Condition", colour_by="Condition",other_fields="ident") + facet_wrap(~ident))
#   cat("\n\n")
# }

Cells annotation

The most challenging task in scRNA-seq data analysis is the interpretation of the results. Once we have obtained a set of clusters we want to determine what biological state is represented by each of them. To do this, we have implemmented 2 different approaches which are explained below. However, these methods should be complemented with the biological knowledge from the researcher.

Using cluster markers

This approach consists in performing a gene set enrichment analysis on the marker genes defining each cluster. This identifies the pathways and processes that are (relatively) active in each cluster based on the upregulation of the associated genes compared to other clusters. For this, we use the tool Enrichr.

genesPerCluster <- split(posMarkers$gene, posMarkers$cluster)
jsCall = paste0('enrich({list: "', sapply(genesPerCluster, paste, collapse="\\n"), '", popup: true});')
enrichrCalls <- paste0("<a href='javascript:void(0)' onClick='", jsCall, 
                         "'>Analyse at Enrichr website</a>")
enrichrTable <- tibble(Cluster=names(genesPerCluster),
                         "# of posMarkers"=lengths(genesPerCluster),
                         "Enrichr link"=enrichrCalls)
kable(enrichrTable, format="html", escape=FALSE,
        caption=paste0("GeneSet enrichment: genes with pvalue ", pvalue_allMarkers)) %>%
kable_styling("striped", full_width = F, position = "left")

Using reference data

Another strategy for annotation is to compare the single-cell expression profiles with previously annotated reference datasets. Labels can then be assigned to each cell in our uncharacterized test dataset based on the most similar reference sample(s). This annotation can be performed on single cells or instead, it may be aggregated into cluster-level profiles prior to annotation. To do this, we use the SingleR method (Aran et al. 2019) for cell type annotation. This method assigns labels to cells based on the reference samples with the highest Spearman rank correlations and thus can be considered a rank-based variant of k-nearest-neighbor classification. To reduce noise, SingleR identifies marker genes between pairs of labels and computes the correlation using only those markers. It also performs a fine-tuning step for each cell where the calculation of the correlations is repeated with just the marker genes for the top-scoring labels. This aims to resolve any ambiguity between those labels by removing noise from irrelevant markers for other labels. SingleR contains several built-in reference datasets, mostly assembled from bulk RNA-seq or microarray data of sorted cell types. These built-in references are often good enough for most applications, provided that they contain the cell types that are expected in the test population.

singler.results.single <- metadata(sce)$singler.results$singler.results.single
singler.results.cluster <- metadata(sce)$singler.results$singler.results.cluster
singler.single.labels <- singler.results.single$labels
singler.cluster.labels<- singler.results.cluster$labels[match(colData(sce)[,"ident"], rownames(singler.results.cluster))]
all.markers <- metadata(singler.results.cluster)$de.genes

cat("The two heatmaps below display the scores for all individual cells (left) and each original cluster (right) across all reference labels, which allows users to inspect the confidence of the predicted labels across the dataset. The bar Labels on the top shows the actual assigned label.\n")
cat("\n")
#while (dev.cur()>1) dev.off()
plotScoreHeatmap(singler.results.single)
plotScoreHeatmap(singler.results.cluster, clusters=rownames(singler.results.cluster))

cat("The single cell (top) and cluster (bottom) annotations are also shown on the TSNEs. Place the mouse over the cells to get information such as their TSNE coordinates, original cluster, the cells name and the label assigned by SingleR. You can also zoom in specific areas of the TSNE by dragging and drop with the mouse.\n")

cellInfo <- tibble(Cells=colnames(sce), Cluster=colData(sce)[,"ident"],
                     SingleR.labels.cluster=singler.cluster.labels, SingleR.labels.single=singler.single.labels)  %>%
    left_join(as_tibble(reducedDim(sce, "TSNE"), rownames="Cells"))

nrOfLabels_cluster <- length(unique(cellInfo$SingleR.labels.cluster))
nrOfLabels_single <- length(unique(cellInfo$SingleR.labels.single))

if(nrOfLabels_single <= 9){
  colsLabels <- brewer.pal(nrOfLabels_single, "Set1")
}else{
  colsLabels <- colorRampPalette(brewer.pal(9, "Set1"))(nrOfLabels_single)
}

x <- list(title="tSNE_1", zeroline=FALSE)
y <- list(title="tSNE_1", zeroline=FALSE)

p1 <- plot_ly(cellInfo, x = ~tSNE_1, y = ~tSNE_2, color=~SingleR.labels.single,
        text = ~paste("Cluster: ", Cluster, 
                      "\nCell: ", Cells,
                      "\nSingleR.labels.cluster: ", SingleR.labels.single),
        type = 'scatter', mode = "markers", marker=list(size=5, opacity=0.5),
        colors=colsLabels) %>%layout(xaxis=x, yaxis=y)
p1
p2 <- plot_ly(cellInfo, x = ~tSNE_1, y = ~tSNE_2, color=~SingleR.labels.cluster,
        text = ~paste("Cluster: ", Cluster, 
                      "\nCell: ", Cells,
                      "\nSingleR.labels.cluster: ", SingleR.labels.cluster),
        type = 'scatter', mode = "markers", marker=list(size=5, opacity=0.5),
        colors=colsLabels) %>%layout(xaxis=x, yaxis=y)

p2

Interactive explorer

The iSEE (Interactive SummarizedExperiment Explorer) explorer provides a general visual interface for exploring single cell data. iSEE allows users to simultaneously visualize multiple aspects of a given data set, including experimental data, metadata, and analysis results. Dynamic linking and point selection facilitate the flexible exploration of interactions between different data aspects. [Rue-Albrecht K, Marini F, Soneson C, Lun ATL (2018). “iSEE: Interactive SummarizedExperiment Explorer.” F1000Research, 7, 741. doi: 10.12688/f1000research.14966.1.]

The iSEE shiny app can be accessed through this link iSEE explorer{target="_blank"}

Data availability

The final Single Cell Experiment Object is here

Parameters

param[c("npcs", "resolution", "batchCorrection")]

SessionInfo

sessionInfo()

uzh/ezRun documentation built on May 4, 2024, 3:23 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

uzh/ezRun
An R meta-package for the analysis of Next Generation Sequencing Data

In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Analysis results {.tabset}

Unwanted sources of variation

1. Cell cycle

2. Batch effects

Clustering

The number of cells in each cluster and condition is represented on this barplot.

Cluster markers

Markers plots

Conserved markers

Conserved markers plots

Differential expressed genes

Differential genes plots

Cells annotation

Using cluster markers

Using reference data

Interactive explorer

Data availability

Mean expression of every gene across the cells in each cluster

Mean expression of every gene across all the cells

Positive markers of each cluster

Coordinates of every cell on the TSNE

The final Single Cell Experiment Object is here

Parameters

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

uzh/ezRun An R meta-package for the analysis of Next Generation Sequencing Data

In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Analysis results {.tabset}

Unwanted sources of variation

1. Cell cycle

2. Batch effects

Clustering

The number of cells in each cluster and condition is represented on this barplot.

Cluster markers

Markers plots

Conserved markers

Conserved markers plots

Differential expressed genes

Differential genes plots

Cells annotation

Using cluster markers

Using reference data

Interactive explorer

Data availability

Mean expression of every gene across the cells in each cluster

Mean expression of every gene across all the cells

Positive markers of each cluster

Coordinates of every cell on the TSNE

The final Single Cell Experiment Object is here

Parameters

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

uzh/ezRun
An R meta-package for the analysis of Next Generation Sequencing Data