In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Started on r format(Sys.time(), "%Y-%m-%d %H:%M:%S")

library(clustree)
library(kableExtra)
library(NMF)
library(pheatmap)
library(viridis)
library(cowplot)
library(scran)
library(RColorBrewer)
library(plotly)
library(tidyverse)
library(SingleR)
library(scater)
library(Seurat)
library(AUCell)
library(ezRun)
library(HDF5Array)
library(enrichR)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, knitr.table.format = "html")

scData <- readRDS("scData.rds")
param = readRDS("param.rds")
output = readRDS("output.rds")
species <- getSpecies(param$refBuild)
if (file.exists("singler.results.rds")){
  singler.results <- readRDS("singler.results.rds")
} else {
  singler.results <- NULL
}

if ("Azimuth.celltype.l1" %in% colnames(scData@meta.data)){
    aziResults <- TRUE    
} else {
    aziResults <- NULL
}

if (file.exists("cells.AUC.rds")){
  cells.AUC <- readRDS("cells.AUC.rds")
} else {
  cells.AUC <- NULL
}

if (file.exists("enrichRout.rds")){
  enrichRout <- readRDS("enrichRout.rds")
  if(length(enrichRout) == 1){ ##enrichR call creates empty file
     enrichRout <- NULL 
  }
} else {
  enrichRout <- NULL
}


posMarkers <- readxl::read_xlsx("posMarkers.xlsx")
# Process the posMarkers
posMarkers$gene = as.factor(posMarkers$gene)
# make sure the loaded cluster uses proper integer ordering if all levels are integer!
clusterSet <- posMarkers$cluster %>% as.integer() %>% unique() %>% sort(na.last=TRUE)
if (any(is.na(clusterSet))){
  posMarkers$cluster = as.factor(posMarkers$cluster)
} else {
  posMarkers$cluster = factor(posMarkers$cluster, levels=clusterSet)
}


var_height = 1 #This ensures that chunks that use the length of this variable to set the fig size, don't fail.
pvalue_allMarkers <- 0.01

Analysis results {.tabset}

Batch effects

We always look at our cells before deciding whether we need to perform integration. The main goal of dataset integration is to identify shared cell states that are present across different datasets, in case they were collected from different individuals, experimental conditions, technologies, or even species. Large single-cell RNA sequencing projects usually need to generate data across multiple batches due to logistical constraints. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results. The UMAPs and the barplot below can help us to visualize if the clusters are balanced and composed by cells from the different batches. If we see clusters that are comprised of cells from a single batch, this indicates that cells of the same type are artificially separated due to technical differences between batches. In this case, we may also consider that there are cell types that are unique to each batch. If a cluster only contains cells from a single batch, one can always debate whether that is caused by technical differences or if there is truly a batch-specific subpopulation.

if(!('Batch' %in% colnames(scData@meta.data)))
    scData@meta.data[['Batch']] <- scData@meta.data[['Sample']]
DimPlot(scData, reduction =  "umap_noCorrected", group.by = "Batch")
DimPlot(scData, reduction="umap_noCorrected", group.by="ident_noCorrected", label = TRUE)
cellIdents_perBatch = data.frame(scData@meta.data[,c("ident_noCorrected", "Batch")])
barplot = ggplot(data=cellIdents_perBatch, aes(x=cellIdents_perBatch[,1], fill=Batch)) + geom_bar(stat="Count")
barplot + labs(x="Cluster", y = "Number of cells", fill = "Batch")
cells_prop = cellsProportion(scData, groupVar1 = "ident_noCorrected", groupVar2 = "Batch")
kable(cells_prop,row.names=FALSE, format="html",caption="Cell proportions per batch") %>% kable_styling(bootstrap_options = "striped") %>% scroll_box(width = "100%", height = "800px")
#bootstrap_options = "striped", full_width = F, position = "float_right"

#if there are only two samples these plots are the same as the previous ones
DimPlot(scData, reduction =  "umap_noCorrected", group.by = "Condition")
cellIdents_perCondition = data.frame(scData@meta.data[,c("ident_noCorrected", "Condition")])
barplot = ggplot(data=cellIdents_perCondition, aes(x=cellIdents_perCondition[,1], fill=Condition)) + geom_bar(stat="Count")
barplot + labs(x="Cluster", y = "Number of cells", fill = "Condition")
cells_prop = cellsProportion(scData, groupVar1 = "ident_noCorrected", groupVar2 = "Condition")
kable(cells_prop,row.names=FALSE, format="html",caption="Cell proportions per condition") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right")

Clustering

cat("We started by merging all the samples in one dataset and then used the SCtransform method from the Seurat package for normalizing, estimating the variance of the raw filtered data, and identifying the most variable genes. By default, SCtransform accounts for cellular sequencing depth, or nUMIs.")
if(ezIsSpecified(param$SCT.regress.CellCycle) && 
      param$SCT.regress.CellCycle) {
  cat("We already checked cell cycle and decided that it does represent a major source of variation in our data, and this may influence clustering. Therefore, we regressed out variation due to cell cycle")
}
cat("As a result, SCTransform ranked the genes by residual variance and returned the 3000 most variant genes. Next, we performed PCA on the scaled data using the previously determined variable features. Taking as a distance metric the previously identified PCs, the cells clusters were then identified using a graph-based clustering approach where the cells are embedded in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘communities’. The resolution is an important argument that sets the \"granularity\" of the downstream clustering and will need to be optimized for every individual experiment. Increased resolution values lead to a greater number of clusters, which is often required for larger datasets.\n")
cat("\n")
cat("The UMAPs below place similar cells together in low-dimensional space. The first UMAP represents cells according to the condition and the second one shows the graph-based common clusters that were found among the datasets.\n")

cat("After inspecting the datasets and observing that cells clustered by sample we decided to integrate samples using shared highly variable genes. Oftentimes, when clustering cells from multiple conditions there are condition-specific clusters and integration can help ensure the same cell types cluster together.\n We started by normalizing, estimating the variance of the raw filtered data, and identifying the most variable genes in each sample separately. For this, we used the SCtransform method from the Seurat package which accounts for cellular sequencing depth, or nUMIs by default.")
cat("\n")
if(ezIsSpecified(param$SCT.regress.CellCycle) && 
      param$SCT.regress.CellCycle) {
  cat("We already checked cell cycle and decided that it does represent a major source of variation in our data, and this may influence clustering. Therefore, we regressed out variation due to cell cycle")
}
cat("\n")
cat("To integrate, we applied the following steps:")
cat("\n\n")
cat(paste("**1. Perform integration method",param$integrationMethod, "**")) 
    #CCA identifies shared sources of variation between the conditions/groups. It is a form of PCA, in that it identifies the greatest sources of variation in the data, #but only if it is shared or conserved across the conditions/groups (using the 3000 most variant genes from each sample).")
cat("\n\n")
cat("**2. Identify anchors or mutual nearest neighbors (MNNs) across datasets (sometimes incorrect anchors are identified):** MNNs are like good friends. For each cell in one sample, the cell's closest neighbor in the other sample is identified based on gene expression values as it's best neighbor.The reciprical analysis is performed, and if the two cells are 'best friends' in both directions, then those cells will be marked as anchors to 'anchor' the two datasets together.")
cat("\n\n")
cat("**3. Filter anchors to remove incorrect anchors:** Assess the similarity between anchor pairs by the overlap in their local neighborhoods (incorrect anchors will have low scores)")
cat("\n\n")
cat("**4. Integrate the conditions/datasets:** Use anchors and corresponding scores to transform the cell expression values, allowing for the integration of the different samples.")
cat("\n\n")
cat("If cell types are present in one dataset, but not the other, then the cells will still appear as a separate sample-specific cluster.")
cat("\n\n")
cat("Finally, the cells clusters were identified using a graph-based clustering approach where the cells are embedded in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘communities’. The resolution is an important argument that sets the \"granularity\" of the downstream clustering and will need to be optimized for every individual experiment. Increased resolution values lead to a greater number of clusters, which is often required for larger datasets.")
cat("\n")
cat("\n")
cat("The UMAPs below place similar cells together in low-dimensional space after the integration of the datasets. The first UMAP represents cells according to the condition and the second UMAP shows the graph-based common clusters that were found among the datasets. The presence of visual clusters containing cells from the different datasets provides a comforting illusion that the integration was successful.")

DimPlot(scData, reduction="umap", group.by="Batch")
p2 <- DimPlot(scData, reduction="umap", label = FALSE, pt.size = 1.5) + labs(color = "ident")
LabelClusters(p2, id = "ident",  fontface = "bold", color = "black", size = 6)

cellIdents_perBatch = data.frame(scData@meta.data[,c("ident", "Batch")])
barplot = ggplot(data=cellIdents_perBatch, aes(x=cellIdents_perBatch[,1], fill=Batch)) + geom_bar(stat="Count")
barplot + labs(x="Cluster", y = "Number of cells", fill = "Batch")
cells_prop = cellsProportion(scData, groupVar1 = "ident", groupVar2="Batch")
cluster_prop <- cellsProportion(scData, groupVar1 = "Condition", groupVar2="ident")
cluster_prop <- cluster_prop[,grep('Condition|_fraction', colnames(cluster_prop))]

kable(cells_prop,row.names=FALSE, format="html",caption="Cell proportions per batch") %>% kable_styling(bootstrap_options = "striped") %>% scroll_box(width = "100%", height = "800px")

kable(cluster_prop,row.names=FALSE, format="html",caption="Cell proportions per cluster") %>% kable_styling(bootstrap_options = "striped") %>% scroll_box(width = "100%", height = "100%")

DimPlot(scData, reduction="umap", group.by="Condition")
cellIdents_perCondition = data.frame(scData@meta.data[,c("ident", "Condition")])
barplot = ggplot(data=cellIdents_perCondition, aes(x=cellIdents_perCondition[,1], fill=Condition)) + geom_bar(stat="Count")
barplot + labs(x="Cluster", y = "Number of cells", fill = "Condition")
cells_prop = cellsProportion(scData, groupVar1 = "ident", groupVar2="Condition")

kable(cells_prop,row.names=FALSE, format="html",caption="Cell proportions") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "float_right")

pct <- scData[["pca"]]@stdev / sum(scData[["pca"]]@stdev) * 100
cumu <- cumsum(pct)
plot_df <- data.frame(pct = pct, cumu = cumu, rank = 1:length(pct))

# Elbow plot
p <- ggplot(plot_df, aes(cumu, pct, label = rank, color = rank > param$npcs))
p <- p + geom_vline(xintercept = cumu[param$npcs], color = "grey") + geom_text() 
p <- p + xlab(paste("Variance based on 50 PCs of", param$nfeatures, "features")) +  ylab("Cumulative variance in % explained per PC") 
p <- p + theme_bw() + labs(color = paste("PC >", param$npcs)) + theme(legend.position = "bottom")
p <- p + annotate(geom="text", y=max(plot_df$pct), x=max(plot_df$cumu[param$npcs]), label=paste('Variance used:', round(plot_df$cumu[param$npcs], digits = 2), '%'), color="red")
p

Cluster assessment

Segregation of clusters by various sources of uninteresting variation.

Once we have created the clusters we need to asses if the clustering was driven by technical artifacts or uninteresting biological variability, such as cell cycle, mitochondrial gene expression. We can explore whether the cells cluster by the different cell cycle phases. In such a case, we would have clusters where most of the cells would be in one specific phase. This bias could be taken into account when normalizing and transforming the data prior to clustering. We can also look at the total number of reads, genes detected and mitochondrial gene expression. The clusters should be more or less even but if we observe big differences among some of them for these metrics, we will keep an eye on them and see if the cell types we identify later can explain the differences.

if (any(!is.na(scData$CellCycle))){
    DimPlot(scData, label=TRUE, split.by = "CellCycle", reduction="umap")
}

VlnPlot(scData, features ="nCount_RNA") + ggtitle("Number of UMIs vs cluster") + NoLegend()
FeaturePlot(scData, features="nCount_RNA", reduction="umap", label = TRUE)
VlnPlot(scData, features ="nFeature_RNA") + ggtitle("Number of genes vs cluster") + NoLegend()
FeaturePlot(scData, features="nFeature_RNA", reduction="umap", label = TRUE)
VlnPlot(scData, features ="percent_mito") + ggtitle("Mitochondrial percentage vs cluster") + NoLegend()
FeaturePlot(scData, features="percent_mito", reduction="umap", label = TRUE)
VlnPlot(scData, features ="percent_riboprot") + ggtitle("Ribosomal percentage vs cluster") + NoLegend()
FeaturePlot(scData, features="percent_riboprot", reduction="umap", label = TRUE)
if (ezIsSpecified(param$controlSeqs)) {
  genesToPlot <- gsub("_", "-", param$controlSeqs)
  genesToPlot <- intersect(genesToPlot, rownames(scData))
  FeaturePlot(scData, genesToPlot, reduction="umap") + NoLegend()
}

Cluster resolution

One of the most important parameters when clustering is k, the number of nearest neighbors used to construct the graph. This controls the resolution of the clustering where higher k yields a more inter-connected graph and broader clusters. Users can experiment with different values of k to obtain a satisfactory resolution. We recommend increasing the resolution when a rare population is expected. Below, it is shown a clustering tree that helps us to visualize the relationships between clusters at a range of resolutions. Each cluster forms a node in the tree and edges are constructed by considering the cells in a cluster at a lower resolution that end up in a cluster at the next highest resolution. By connecting clusters in this way, we can see how clusters are related to each other, which are clearly distinct and which are unstable. The size of each node is related to the number of cells in each cluster and the color indicates the clustering resolution. Edges are colored according to the number of cells they represent and the transparency shows the incoming node proportion, the number of cells in the edge divided by the number of samples in the node it points to.

if(!(param$integrationMethod %in% c("none","Harmony"))) {
   clustree::clustree(scData, prefix = "integrated_snn_res.")
} else {
  clustree::clustree(scData, prefix = "SCT_snn_res.")
}

Cluster markers

cat("We found positive markers that defined clusters compared to all other cells via differential expression. The test we used was the Wilcoxon Rank Sum test. Genes with an average, at least 0.25-fold difference (log-scale) between the cells in the tested cluster and the rest of the cells and an adjusted p-value < 0.05 were declared as significant.")

cat("We found positive markers that defined clusters compared to all other cells via differential expression using a logistic regression test and including in the model the cell cycle as the batch effect. Genes with an average, at least 0.25-fold difference (log-scale) between the cells in the tested cluster and the rest of the cells and an adjusted p-value < 0.05 were declared as significant.")

Expression differences of cluster marker genes

ezInteractiveTableRmd(posMarkers, digits=4)

Marker plots

Here, we use a heatmap and a dotplot to visualize simultaneously the top 5 markers in each cluster. Be aware that some genes may be in the top markers for different clusters.

# prepare top 5 markers for each cluster
top5 <- posMarkers %>%
  mutate(cluster = as.character(cluster))
if (all(varhandle::check.numeric(top5$cluster))) {
  top5 <- top5 %>% mutate(cluster = as.numeric(cluster))  # Convert back to numeric if possible
}
top5 <- top5 %>%
  group_by(cluster) %>%
  slice_max(n = 5, order_by = tibble(diff_pct, avg_log2FC))

# specifiy genes to plot
genesToPlot <- c(gsub("_", "-", param$controlSeqs), unique(as.character(top5$gene)))
genesToPlot <- intersect(genesToPlot, rownames(scData))
if (ncol(scData) > 3e4) {
  # Heatmap and Dotplot may not work above 30'000 cells, so we downsample
  cat("We first downsample to a representative 30'000 cells.")
  scDataToPlot <- scData[, sample(colnames(scData), size=3e4, replace=F)]
} else {
  scDataToPlot <- scData
}

DoHeatmap(scDataToPlot, features=unique(genesToPlot))

DotPlot(scDataToPlot, features=genesToPlot) + coord_flip()

EnrichR

Cell type scoring using the EnrichR tool. This approach consists in performing a gene set enrichment analysis on the marker genes defining each cluster. This identifies the pathways and processes that are (relatively) active in each cluster based on the upregulation of the associated genes compared to other clusters.

genesPerCluster <- split(posMarkers$gene, posMarkers$cluster)
jsCall = paste0('enrich({list: "', sapply(genesPerCluster, paste, collapse="\\n"), '", popup: true});')
enrichrCalls <- paste0("<a href='javascript:void(0)' onClick='", jsCall, 
                         "'>Analyse at Enrichr website</a>")

enrichrTable <- tibble(Cluster=names(genesPerCluster),
                         "# of posMarkers"=lengths(genesPerCluster),
                         "Enrichr link"=enrichrCalls)
if (!is.null(enrichRout)){
  enrichRTerm <- as.data.frame(do.call(rbind, lapply(enrichRout, as.vector)))
  enrichRTerm <- map_df(enrichRTerm, ~ map_df(.x, ~ replace(.x, is.null(.x), NA)), .id = "database")
  enrichRTerm <- enrichRTerm %>%
    group_by(., Cluster, database) %>%
    summarise(topTerms = paste(Term, collapse = "; ")) %>%
    as.data.frame()
  enrichRTerm <- dcast(enrichRTerm, ... ~ database)
  enrichrTable <- merge(enrichrTable, enrichRTerm, by = "Cluster")
}
kable(enrichrTable, format="html", escape=FALSE,
        caption=paste0("GeneSet enrichment")) %>%
kable_styling("striped", full_width = F, position = "left")

for (cluster in names(enrichRout)) {
  enrichRDataFrame <- as.data.frame(do.call(rbind, enrichRout[[cluster]]))
  enrichRDataFrame$database <- sapply(rownames(enrichRDataFrame), function(x) str_split(x, "[.]")[[1]][1])
  print(ggplot(as.data.table(enrichRDataFrame), aes(database, Term))
  +
    geom_point(aes(color = OverlapGenesN, size = Combined.Score))
    +
    scale_color_viridis()
    +
    scale_fill_viridis()
    +
    labs(y = "Term", x = "Cluster", title = paste0("Cluster: ", cluster))
    +
    theme_bw()
    +
    theme(
      axis.title.x = element_blank(),
      axis.text.x = element_text(angle = 45, hjust = 1)
    ))
}

SingleR

Matching of clusters to annotated reference datasets using SingleR.

clusterInfosList <- list()
plotlyPlots <- list()
nTopMarkers <- 10
for (r in names(singler.results)) {
  cat("##### Heatmap: ", r, "\n\n")
  print(plotScoreHeatmap(singler.results[[r]]$single.fine))
  print(plotScoreHeatmap(singler.results[[r]]$cluster.fine, show_colnames = TRUE))
  cat("\n\n")
  singler.single.labels <- singler.results[[r]]$single.fine$labels
  singler.cluster.labels <- singler.results[[r]]$cluster.fine$labels[match(Idents(scData), rownames(singler.results[[r]]$cluster.fine))]

  cellInfo <- tibble(
    Cells = colnames(scData), 
    Cluster = Idents(scData), 
    SingleR.labels.cluster = singler.cluster.labels, 
    SingleR.labels.single = singler.single.labels)  %>%
      left_join(as_tibble(scData@reductions$umap@cell.embeddings, rownames="Cells"))

  nrOfLabels_cluster <- length(unique(cellInfo$SingleR.labels.cluster))
  nrOfLabels_single <- length(unique(cellInfo$SingleR.labels.single))

  if(nrOfLabels_single <= 9){
    colsLabels <- brewer.pal(nrOfLabels_single, "Set1")
  }else{
    colsLabels <- colorRampPalette(brewer.pal(9, "Set1"))(nrOfLabels_single)
  }

  x <- list(title="umap_1", zeroline=FALSE)
  y <- list(title="umap_2", zeroline=FALSE)

  p1 <- plot_ly(cellInfo, x = ~umap_1, y = ~umap_2, color=~SingleR.labels.single,
                text = ~paste("Cluster: ", Cluster,
                              "\nCell: ", Cells,
                              "\nSingleR.labels.cluster: ", 
                              SingleR.labels.single),
                type = 'scatter', mode = "markers", marker=list(size=5, opacity=0.5),
                colors=colsLabels, width=800, height=500) %>%
    layout(xaxis=x, yaxis=y, title=paste0(r, ": Single"))
  p2 <- plot_ly(cellInfo, x = ~umap_1, y = ~umap_2, color=~SingleR.labels.cluster,
                text = ~paste("Cluster: ", Cluster,
                              "\nCell: ", Cells,
                              "\nSingleR.labels.cluster: ", SingleR.labels.cluster),
                type = 'scatter', mode = "markers", marker=list(size=5, opacity=0.5),
                colors=colsLabels, width=800, height=500) %>%
    layout(xaxis=x, yaxis=y, title=paste0(r, ": Cluster"))
  plotlyPlots <- append(plotlyPlots, list(p1, p2))
  cat("\n\n")
}

cat("##### UMAPs with SingleR Labels\n\n")
cat("The single cell (top) and cluster (bottom) annotations are also shown on the UMAPs. Place the mouse over the cells to get information such as their UMAP coordinates, original cluster, the cells name and the label assigned by SingleR. You can also zoom in specific areas of the UMAP by dragging and drop with the mouse.\n")
tl <- htmltools::tagList(plotlyPlots)
tl
cat("\n\n")

cat("##### Table of SingleR Labels\n\n")
cat(sprintf("The following table summarises the cluster-level labels for each reference (%s) and displays the top %s genes for that cluster in terms of their average log2-fold change. Note the top markers for every cluster will be identical regardless of reference used since this information is specific to the cluster and not the annotation reference.", paste(names(singler.results), collapse=", "), nTopMarkers))
ezInteractiveTableRmd(bind_rows(clusterInfosList))
cat("\n\n")

Azimuth

p1 <- DimPlot(scData, reduction="umap", label = FALSE, pt.size = 1.5) + labs(color = "ident")
LabelClusters(p1, id = "ident",  fontface = "bold", color = "black", size = 6)


p2 <- DimPlot(scData, label=FALSE, group.by = "Azimuth.celltype.l1", repel = TRUE, reduction = "umap", pt.size = 1.5) 
if(length(unique(scData@meta.data$Azimuth.celltype.l1))>10){
    LabelClusters(p2, id = "Azimuth.celltype.l1",  fontface = "bold", color = "black", size = 5) + NoLegend()
} else {
        LabelClusters(p2, id = "Azimuth.celltype.l1",  fontface = "bold", color = "black", size = 5)
    }

if ("Azimuth.celltype.l2" %in% colnames(scData@meta.data)){
    p3 <- DimPlot(scData, label=FALSE, group.by = "Azimuth.celltype.l2", repel = TRUE, reduction = "umap", pt.size = 1.5) 
    if(length(unique(scData@meta.data$Azimuth.celltype.l2))>10){
    LabelClusters(p3, id = "Azimuth.celltype.l2",  fontface = "bold", color = "black", size = 5) + NoLegend()
    } else {
        LabelClusters(p3, id = "Azimuth.celltype.l2",  fontface = "bold", color = "black", size = 5)
    }
}
if ("Azimuth.celltype.l3" %in% colnames(scData@meta.data)){
  p4 <- DimPlot(scData, label=FALSE, group.by = "Azimuth.celltype.l3", repel = TRUE, reduction = "umap", pt.size = 1.5)
  if(length(unique(scData@meta.data$Azimuth.celltype.l3))>10){
    LabelClusters(p4, id = "Azimuth.celltype.l3",  fontface = "bold", color = "black", size = 5) + NoLegend()
  } else {
      LabelClusters(p4, id = "Azimuth.celltype.l3",  fontface = "bold", color = "black", size = 5)
  }

  }
if ("Azimuth.celltype.l4" %in% colnames(scData@meta.data)){
  p5 <- DimPlot(scData, label=FALSE, group.by = "Azimuth.celltype.l4", repel = TRUE, reduction = "umap", pt.size = 1.5)
    if(length(unique(scData@meta.data$Azimuth.celltype.l4))>10){
    LabelClusters(p5, id = "Azimuth.celltype.l4",  fontface = "bold", color = "black", size = 5) + NoLegend()
    } else {
        LabelClusters(p5, id = "Azimuth.celltype.l4",  fontface = "bold", color = "black", size = 5)
    }
}

AUCell

Matching the cells to marker gene sets that are known to be characteristic for cell types. We use marker gene sets from the CellMarker database (Zhang X., Lan Y., Xu J., Quan F., Zhao E., Deng C., et al. (2019). CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 47, D721–d728. 10.1093/nar/gky900).

tab <- NULL
try({
  ## in some extreme situations the default value of smallesPopPercent crashes 
  cells_assignment <- AUCell_exploreThresholds(cells.AUC, plotHist=FALSE, assign=TRUE, nCores = param$cores) 
  cellsAssigned <- lapply(cells_assignment, function(x) x$assignment)
  assignmentTable <- reshape2::melt(cellsAssigned, value.name="cell")
  colnames(assignmentTable)[2] <- "geneSet"
  tras_cells.AUC <- t(assay(cells.AUC))
  full.labels <- colnames(tras_cells.AUC)[max.col(tras_cells.AUC)]
  tab <- table(full.labels, Idents(scData))
  var_height <- shrinkToRange(nrow(tab)*0.5, c(6,16))
})

cat("We can explore the cell assignment results using different plots. Below, we show a heatmap that represents the number of cells (in log scale) from each cluster that were assigned to the different cell types. After calculating an AUC score for each cell and cell type, we assign cell type identity by taking the cell type with the top AUC as the label for that cell. Some cell types may be missing because no cells obtained their top AUC score for it.")
cat("\n\n")

if (!is.null(tab) && nrow(tab) > 2)
   pheatmap(log10(tab+10), color=viridis::viridis(100))

cat("The plots below show for every cell type:\n")
cat('\n')
cat("1) The distribution of the AUC values in the cells. The ideal situation will be a bi-modal distribution, in which most cells in the dataset have a low “AUC” compared to a population of cells with a higher value.  The size of the gene-set will also affect the results. With smaller gene-genes (fewer genes), it is more likely to get cells with AUC = 0. While this is the case of the “perfect markers” it is also easier to get it by chance with small datasets. The vertical bars correspond to several thresholds that could be used to consider a gene-set ‘active’. The thickest vertical line indicates the threshold selected by default: the highest value to reduce the false positives.\n")
cat('\n')
cat("2) The t-SNE can be colored based on the AUC scores. To highlight the cluster of cells that are more likely of the cell type according to the signatures, we split the cells into cells that passed the assignment threshold (colored in blue), and cells that didn’t (colored in gray).\n")
cat('\n')
cat("3) The last TSNE represents the AUC scores values. The darker a cell is the higher AUC score it obtained, i.e. the cell is more enriched in that cell type.")
cat('\n')
cellsUmap <- scData@reductions$umap@cell.embeddings
minAucThresh = 0.2
filtered_cells.AUC <- cells.AUC[rowSums(assay(cells.AUC) >= minAucThresh)>0, ]
maxAucScore = rowMax(assay(filtered_cells.AUC))
scoreOrder = order(maxAucScore, decreasing = TRUE)
nPlots = 50

AUCell_plotTSNE(tSNE=cellsUmap,
                cellsAUC=filtered_cells.AUC[head(scoreOrder, nPlots), ],
                 plots = c("histogram", "binaryAUC"))

Interactive explorer

The link below will take you to an interactive Shiny app which can be used to make production-ready plots and figures. You can also use this app to explore your own markers and create corresponding feature plots.

single cell explorer{target="_blank"}

Previous Shiny apps (no longer recommended)

These apps are no longer maintained. Use at your own risk, or for legacy datasets.

simple explorer{target="_blank"}

iSEE explorer{target="_blank"}

Data availability

clusterInfos.xlsx -- use this xls file to manually relabel the clusters

Positive markers of each cluster

posMarkers

The final Seurat object is here

Parameters

param[c("npcs","pcGenes","resolution", "SCT.regress.CellCycle", "DE.method", "integrationMethod")]

SessionInfo

ezSessionInfo()

uzh/ezRun documentation built on May 4, 2024, 3:23 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

uzh/ezRun
An R meta-package for the analysis of Next Generation Sequencing Data

In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Analysis results {.tabset}

Batch effects

Clustering

Cluster assessment

Segregation of clusters by various sources of uninteresting variation.

Cluster resolution

Cluster markers

Expression differences of cluster marker genes

Marker plots

EnrichR

SingleR

Azimuth

AUCell

Interactive explorer

Previous Shiny apps (no longer recommended)

Data availability

Positive markers of each cluster

The final Seurat object is here

Parameters

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

uzh/ezRun An R meta-package for the analysis of Next Generation Sequencing Data

In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Analysis results {.tabset}

Batch effects

Clustering

Cluster assessment

Segregation of clusters by various sources of uninteresting variation.

Cluster resolution

Cluster markers

Expression differences of cluster marker genes

Marker plots

EnrichR

SingleR

Azimuth

AUCell

Interactive explorer

Previous Shiny apps (no longer recommended)

Data availability

Positive markers of each cluster

The final Seurat object is here

Parameters

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

uzh/ezRun
An R meta-package for the analysis of Next Generation Sequencing Data