In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Started on r format(Sys.time(), "%Y-%m-%d %H:%M:%S")

library(Seurat)
library(ggplot2)
library(patchwork)
library(dplyr)
library(RColorBrewer)
library(kableExtra)
library(clustree)
library(ezRun)
library(cowplot)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, knitr.table.format = "html")

param = readRDS("param.rds")
scData <- readRDS("scData.rds")
DefaultAssay(scData) = "SCT"

Analysis results {.tabset}

Quality control

We can compare the different slides based on different QC metrics. Here we plot the number of reads and genes, mitochondrial and ribosomal content on violin plots and on the tissue sections.

VlnPlot(scData, features = c("nCount_Spatial", "nFeature_Spatial", "percent_mito", "percent_ribo"), group.by = "Batch", pt.size = 0.1) + NoLegend() + plot_layout(ncol=4)

SpatialFeaturePlot(scData, features = c("nCount_Spatial", "nFeature_Spatial","percent_mito","percent_ribo"))+ plot_layout(nrow=4, ncol=length(unique(scData$Batch))) & theme(legend.text = element_text(size = 6))

Batch effects

We always look at our spots before deciding whether we need to perform integration. Te main goal of dataset integration is to identify shared cell states that are present across different datasets, in case they were collected from different individuals, experimental conditions, technologies, or even species. Large single-cell RNA sequencing projects usually need to generate data across multiple batches due to logistical constraints. This results in systematic differences in the observed expression in spots from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results. The UMAPs and the barplot below can help us visualize if the clusters are balanced and composed by spots from the different batches. If we see clusters that are comprised of spots from a single batch, this indicates that spots of the same cell type type are artificially separated due to technical differences between batches. In this case, we may also consider that there are cell types that are unique to each batch. If a cluster only contains spots from a single batch, one can always debate whether that is caused by technical differences or if there is truly a batch-specific subpopulation.

DimPlot(scData, reduction =  "umap_noCorrected", group.by = "Batch") + xlab("UMAP 1") + ylab("UMAP 2")
cellIdents_perBatch = data.frame(scData@meta.data[,c("ident_noCorrected", "Batch")])
barplot = ggplot(data=cellIdents_perBatch, aes(x=cellIdents_perBatch[,1], fill=Batch)) + geom_bar(stat="Count")
barplot + labs(x="Cluster", y = "Number of spots", fill = "Batch")
cells_prop = cellsProportion(scData, groupVar1 = "ident_noCorrected", groupVar2 = "Batch")
kable(cells_prop, digits = 4, row.names=FALSE, format="html",caption="Cell proportions per batch") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right")

#if there are only two samples these plots are the same as the previous ones
DimPlot(scData, reduction =  "umap_noCorrected", group.by = "Condition") + xlab("UMAP 1") + ylab("UMAP 2")
cellIdents_perCondition = data.frame(scData@meta.data[,c("ident_noCorrected", "Condition")])
barplot = ggplot(data=cellIdents_perCondition, aes(x=cellIdents_perCondition[,1], fill=Condition)) + geom_bar(stat="Count")
barplot + labs(x="Cluster", y = "Number of spots", fill = "Condition")
cells_prop = cellsProportion(scData, groupVar1 = "ident_noCorrected", groupVar2 = "Condition")
kable(cells_prop,row.names=FALSE, format="html",caption="Cell proportions per condition") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right")

Clustering

cat("We started by merging all the samples in one dataset and then used the SCtransform method from the Seurat package for normalizing, estimating the variance of the raw filtered data, and identifying the most variable genes. By default, SCtransform accounts for cellular sequencing depth, or nUMIs.")
if(ezIsSpecified(param$SCT.regress.CellCycle) && param$SCT.regress.CellCycle) {
  cat("We already checked cell cycle and decided that it does represent a major source of variation in our data, and this may influence clustering. Therefore, we regressed out variation due to cell cycle")
}
cat("As a result, SCTransform ranked the genes by residual variance and returned the 3000 most variant genes. Next, we performed PCA on the scaled data using the previously determined variable features. Taking as a distance metric the previously identified PCs, the spots clusters were then identified using a graph-based clustering approach where the spots are embedded in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between spots with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘communities’. The resolution is an important argument that sets the \"granularity\" of the downstream clustering and will need to be optimized for every individual experiment. Increased resolution values lead to a greater number of clusters, which is often required for larger datasets.\n")
cat("\n")
cat("The UMAPs below place similar spots together in low-dimensional space. The first UMAP represents spots according to the condition and the second one shows the graph-based common clusters that were found among the datasets.\n")

cat("After inspecting the datasets and observing that spots clustered by sample we decided to integrate samples using shared highly variable genes. Oftentimes, when clustering spots from multiple conditions there are condition-specific clusters and integration can help ensure the same cell types cluster together.\n We started by normalizing, estimating the variance of the raw filtered data, and identifying the most variable genes in each sample separately. For this, we used the SCtransform method from the Seurat package which accounts for cellular sequencing depth, or nUMIs by default.")
cat("\n")
if(ezIsSpecified(param$SCT.regress.CellCycle) && param$SCT.regress.CellCycle) {
  cat("We already checked cell cycle and decided that it does represent a major source of variation in our data, and this may influence clustering. Therefore, we regressed out variation due to cell cycle")
}
cat("\n")
cat("To integrate, we applied the following steps:")
cat("\n\n")
cat("**1. Perform canonical correlation analysis (CCA):** CCA identifies shared sources of variation between the conditions/groups. It is a form of PCA, in that it identifies the greatest sources of variation in the data, but only if it is shared or conserved across the conditions/groups (using the 3000 most variant genes from each sample).")
cat("\n\n")
cat("**2. Identify anchors or mutual nearest neighbors (MNNs) across datasets (sometimes incorrect anchors are identified):** MNNs are like good friends. For each cell in one sample, the cell's closest neighbor in the other sample is identified based on gene expression values as it's best neighbor.The reciprical analysis is performed, and if the two spots are 'best friends' in both directions, then those spots will be marked as anchors to 'anchor' the two datasets together.")
cat("\n\n")
cat("**3. Filter anchors to remove incorrect anchors:** Assess the similarity between anchor pairs by the overlap in their local neighborhoods (incorrect anchors will have low scores)")
cat("\n\n")
cat("**4. Integrate the conditions/datasets:** Use anchors and corresponding scores to transform the cell expression values, allowing for the integration of the different samples.")
cat("\n\n")
cat("If cell types are present in one dataset, but not the other, then the spots will still appear as a separate sample-specific cluster.")
cat("\n\n")
cat("Finally, the spots clusters were identified using a graph-based clustering approach where the spots are embedded in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between spots with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘communities’. The resolution is an important argument that sets the \"granularity\" of the downstream clustering and will need to be optimized for every individual experiment. Increased resolution values lead to a greater number of clusters, which is often required for larger datasets.")
cat("\n")
cat("\n")
cat("The UMAPs below place similar spots together in low-dimensional space after the integration of the datasets. The first UMAP represents spots according to the condition and the second UMAP shows the graph-based common clusters that were found among the datasets. The presence of visual clusters containing spots from the different datasets provides a comforting illusion that the integration was successful.")

DimPlot(scData, reduction="umap", group.by ="seurat_clusters")
DimPlot(scData, reduction="umap", group.by ="Batch")
cellIdents_perBatch = data.frame(scData@meta.data[,c("seurat_clusters", "Batch")])
barplot = ggplot(data=cellIdents_perBatch, aes(x=cellIdents_perBatch[,1], fill=Batch)) + geom_bar(stat="Count")
barplot + labs(x="Cluster", y = "Number of spots", fill = "Batch")
cells_prop = cellsProportion(scData, groupVar1 = "seurat_clusters", groupVar2="Batch")
kable(cells_prop,row.names=FALSE, format="html",caption="Cell proportions") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right")

DimPlot(scData,  reduction="umap", group.by ="Condition")
cellIdents_perCondition = data.frame(scData@meta.data[,c("seurat_clusters", "Condition")])
barplot = ggplot(data=cellIdents_perCondition, aes(x=cellIdents_perCondition[,1], fill=Condition)) + geom_bar(stat="Count")
barplot + labs(x="Cluster", y = "Number of spots", fill = "Condition")
cells_prop = cellsProportion(scData, groupVar1 = "seurat_clusters", groupVar2="Condition")
kable(cells_prop,row.names=FALSE, format="html",caption="Cell proportions") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right")

SpatialDimPlot(scData, label = TRUE, label.size = 3) & NoLegend()

We can also plot each cluster separately

SpatialDimPlot(scData, cells.highlight = CellsByIdentities(scData), facet.highlight = TRUE)

Cluster markers

cat("We found positive markers that defined clusters compared to all other spots via differential expression. The test we used was the Wilcoxon Rank Sum test. Genes with an average, at least 0.25-fold difference (log-scale) between the spots in the tested cluster and the rest of the spots and an adjusted p-value < 0.05 were declared as significant.")

cat("We found positive markers that defined clusters compared to all other spots via differential expression using a logistic regression test and including in the model the cell cycle as the batch effect. Genes with an average, at least 0.25-fold difference (log-scale) between the spots in the tested cluster and the rest of the spots and an adjusted p-value < 0.05 were declared as significant.")

Expression differences of cluster marker genes

posMarkers = read_tsv("pos_markers.tsv")
posMarkers$cluster = as.factor(posMarkers$cluster)
posMarkers$gene = as.factor(posMarkers$gene)
ezInteractiveTableRmd(posMarkers, digits=4)

Markers plots

We plot on the tissue the top 5 markers in each cluster. Be aware that some genes may be in the top markers for different clusters.

eachCluster <- 0
for (eachCluster in levels(posMarkers$cluster)) {
  cat("\n")
  cat("#### Cluster ", eachCluster, "\n")
  cat("\n")
  markersPerCluster <- dplyr::filter(posMarkers, cluster == eachCluster) %>%
    dplyr::arrange(desc(avg_log2FC)) %>%
    select(gene) %>%
    pull()
  markersPerCluster <- head(markersPerCluster, 5)
  eachMarker <- markersPerCluster[1]
  for (eachMarker in markersPerCluster) {
    sp = SpatialFeaturePlot(object = scData, features = eachMarker, alpha = c(0.1, 1))
    print(sp)
  }
  cat("\n")
}

Spatially Variable Genes plots

We also identify molecular features that correlate with spatial location within a tissue. Here we show the expression of the top 10 features identified by this measure. The full list can be found in the Data availability section.

spatialMarkers = read_tsv("spatial_markers.tsv")
top.features <- head(unique(spatialMarkers[order(spatialMarkers$MeanRank),'GeneSymbol']),10)
for(gene in top.features) {
   sp = SpatialFeaturePlot(object = scData, features = gene, alpha = c(0.1, 1))
   print(sp)
}

Data availability

Positive markers of each cluster

posMarkers

The final Seurat Object is here

Spatially variable genes

spatialMarkers

Parameters

param[c("npcs", "resolution", "batchCorrection", "SCT.regress.CellCycle", "DE.method", "DE.regress")]

SessionInfo

ezSessionInfo()

uzh/ezRun documentation built on May 4, 2024, 3:23 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

uzh/ezRun
An R meta-package for the analysis of Next Generation Sequencing Data

In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Analysis results {.tabset}

Quality control

Batch effects

Clustering

Cluster markers

Expression differences of cluster marker genes

Markers plots

Spatially Variable Genes plots

Data availability

Positive markers of each cluster

The final Seurat Object is here

Spatially variable genes

Parameters

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

uzh/ezRun An R meta-package for the analysis of Next Generation Sequencing Data

In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Analysis results {.tabset}

Quality control

Batch effects

Clustering

Cluster markers

Expression differences of cluster marker genes

Markers plots

Spatially Variable Genes plots

Data availability

Positive markers of each cluster

The final Seurat Object is here

Spatially variable genes

Parameters

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

uzh/ezRun
An R meta-package for the analysis of Next Generation Sequencing Data