In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Started on r format(Sys.time(), "%Y-%m-%d %H:%M:%S")

# input for this report: sce
library(ezRun)
library(cowplot)
library(kableExtra)
library(fastICA)
library(Matrix)
library(RColorBrewer)
library(SingleCellExperiment)
library(scater)
library(Seurat)
library(SingleR)
library(plotly)
library(tidyverse)
## ----------------------------------------------
## debug
# sce <- readRDS("/scratch/SCSeurat_57583--n_Feature_50-5000--mt_percent_10_2021-05-19--14-44-54_o24771_1_1-14_Glia_Skin_temp32153/sce.rds")
# sce <- sce[ ,sample.int(ncol(sce), round(ncol(sce)/10))]
## end of debug
debug <- FALSE

## Prepare data
param <- metadata(sce)$param
input <- metadata(sce)$input
output <- metadata(sce)$output
rownames(sce) <- uniquifyFeatureNames(
  rowData(sce)$ID,
  rowData(sce)$Symbol
)

scData <- CreateSeuratObject(
  counts = assay(sce, "counts"),
  project = paste(input$getNames(), collapse = ", "),
  min.cells = param$minCellsPerGene,
  min.features = 0,
  meta.data = as.data.frame(colData(sce))
)
pvalue_allMarkers <- 0.01
pvalue_all2allMarkers <- 0.01

jsFile <- system.file("extdata/enrichr.js", package = "ezRun", mustWork = TRUE)
invisible(file.copy(from = jsFile, to = basename(jsFile), overwrite = TRUE))
cat(paste0("<SCRIPT language=\"JavaScript\" SRC=\"", basename(jsFile), "\"></SCRIPT>"))

This clustering report is largely based on R package Seurat.

Clustering Results {.tabset}

Preprocessing

Quality control on the cells

There are r ncol(sce) cells in total.

Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community include

The number of unique genes detected in each cell. Low-quality cells or empty droplets will often have very few genes Cell doublets or multiplets may exhibit an aberrantly high gene count
Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes)
The percentage of reads that map to the mitochondrial genome
- Low-quality / dying cells often exhibit extensive mitochondrial contamination

mitoPattern <- ifelse(length(grep("^MT-", rownames(scData))) > 0, "^MT-", "^mt-")
scData[["percent.mt"]] <- PercentageFeatureSet(scData, pattern = mitoPattern)
p <- VlnPlot(scData, features = "nFeature_RNA") + theme(legend.position = "none") +
  geom_hline(yintercept = param$minGenesPerCell, colour = "red") +
  geom_hline(yintercept = param$maxGenesPerCell, colour = "red")
p

p <- VlnPlot(scData, features = "nCount_RNA") + theme(legend.position = "none")
p

p <- VlnPlot(scData, features = "percent.mt") + theme(legend.position = "none") +
  geom_hline(yintercept = param$maxMitoPercent, colour = "red")
p

cellsToKeep <- scData$nFeature_RNA > param$minGenesPerCell &
  scData$nFeature_RNA < param$maxGenesPerCell &
  scData$percent.mt < param$maxMitoPercent
scData <- scData[, cellsToKeep]

Here, we filtered cells that have unique feature counts over r param$maxGenesPerCell or less than r param$minGenesPerCell. We filtered cells that have more than r param$maxMitoPercent% mitochondrial counts. In total, r ncol(sce) - ncol(scData) cells were removed and r ncol(scData) cells were kept.

Normalizing the data

After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method "LogNormalize" that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (r getSeuratScalingFactor(param$scProtocol) here), and log-transforms the result with natural log.

scData <- NormalizeData(scData,
  normalization.method = "LogNormalize",
  scale.factor = getSeuratScalingFactor(param$scProtocol)
)

Identification of highly variable features (feature selection)

We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). Focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.

The procedure in Seurat3 is described in detail here, and improves on previous versions by directly modeling the mean-variance relationship inherent in single-cell data, and is implemented in the FindVariableFeatures function. By default, we return 2,000 features per dataset. These will be used in downstream analysis, like PCA.

scData <- FindVariableFeatures(scData,
  selection.method = "vst",
  nfeatures = 2000
)
top10 <- head(VariableFeatures(scData), 10)
p1 <- VariableFeaturePlot(scData)
p2 <- LabelPoints(plot = p1, points = top10, repel = TRUE)
p <- plot_grid(p1, p2, labels = "AUTO")
p

Scaling the data

Next, we apply a linear transformation ("scaling") that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData function:

Shifts the expression of each gene, so that the mean expression across cells is 0
Scales the expression of each gene, so that the variance across cells is 1
- This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate

if ("CellCycle" %in% param$vars.to.regress) {
  ## TODO: change gene names for mouse
  scData <- CellCycleScoring(scData,
    s.features = cc.genes$s.genes,
    g2m.features = cc.genes$g2m.genes
  )
  param$vars.to.regress <- setdiff(
    append(
      param$vars.to.regress,
      c("S.Score", "G2M.Score")
    ),
    "CellCycle"
  )
}

if (ezIsSpecified(param$vars.to.regress)) {
  scData <- ScaleData(scData, vars.to.regress = param$vars.to.regress)
} else {
  scData <- ScaleData(scData)
}

Perform linear dimensional reduction

Next we perform PCA on the scaled data. By default, only the previously determined variable features are used as input.

scData <- RunPCA(scData)

Determine the 'dimensionality' of the dataset

To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a 'metafeature' that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset.

In Macosko et al, we implemented a resampling test inspired by the JackStraw procedure. We randomly permute a subset of the data (1% by default) and rerun PCA, constructing a 'null distribution' of feature scores, and repeat this procedure. We identify 'significant' PCs as those who have a strong enrichment of low p-value features.

The JackStrawPlot function provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). 'Significant' PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line).

scData <- JackStraw(scData, num.replicate = 100)
scData <- ScoreJackStraw(scData, dims = 1:20)
p <- JackStrawPlot(scData, dims = 1:20)
p

An alternative heuristic method generates an 'Elbow plot': a ranking of principle components based on the percentage of variance explained by each one (ElbowPlot function).

p <- ElbowPlot(scData, ndims = 50)
p

Cluster the cells

Seurat v3 applies a graph-based clustering approach, building upon initial strategies in Macosko et al. Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, our approach to partioning the cellular distance matrix into clusters has dramatically improved. Our approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data SNN-Cliq, Xu and Su, Bioinformatics, 2015 and CyTOF data PhenoGraph, Levine et al., Cell, 2015. Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected 'quasi-cliques' or 'communities'.

As in PhenoGraph, we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors function, and takes as input the previously defined dimensionality of the dataset (first r param$pcs PCs).

scData <- FindNeighbors(scData, dims = 1:param$pcs)
scData <- FindClusters(scData, resolution = param$resolution)

Run non-linear dimensional reduction (UMAP/tSNE)

Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.

scData <- RunUMAP(scData,
  dims = 1:param$pcs,
  n.neighbors = getPerplexity(ncol(scData))
)
scData <- RunTSNE(scData,
  dims = 1:param$pcs,
  perplexity = getPerplexity(ncol(scData))
)

coordinates <- left_join(
  tibble(
    Cells = colnames(scData),
    Cluster = Idents(scData)
  ),
  as_tibble(Embeddings(scData, "umap"), rownames = "Cells")
) %>%
  left_join(as_tibble(Embeddings(scData, "tsne"), rownames = "Cells"))

QC on low-dimensional space

UMAP

FeaturePlot(
  object = scData, features = "nFeature_RNA",
  reduction = "umap"
)
FeaturePlot(
  object = scData, features = "nCount_RNA",
  reduction = "umap"
)
FeaturePlot(
  object = scData, features = "percent.mt",
  reduction = "umap"
)
if('S.Score' %in% colnames(scData[[]])){
  DimPlot(scData, reduction = "umap", group.by = "Phase") +
  ggtitle("Cell cycle")
}
FeaturePlot(scData,
  features = "scDblFinder.score",
  reduction = "umap"
)
DimPlot(scData, reduction = "umap", group.by = "scDblFinder.class") +
  ggtitle("scDblFinder.class")

TSNE

FeaturePlot(
  object = scData, features = "nFeature_RNA",
  reduction = "tsne"
)
FeaturePlot(
  object = scData, features = "nCount_RNA",
  reduction = "tsne"
)
FeaturePlot(
  object = scData, features = "percent.mt",
  reduction = "tsne"
)
if('S.Score' %in% colnames(scData[[]])){
  DimPlot(scData, reduction = "tsne", group.by = "Phase") +
  ggtitle("Cell cycle")
}
FeaturePlot(scData,
  features = "scDblFinder.score",
  reduction = "tsne"
)
DimPlot(scData, reduction = "tsne", group.by = "scDblFinder.class") +
  ggtitle("scDblFinder.class")

Clustering

p1 <- DimPlot(scData, reduction = "umap", label = TRUE)
p2 <- DimPlot(scData, reduction = "tsne", label = TRUE)
p <- plot_grid(p1, p2, labels = "AUTO", ncol = 2)
p

kable(tibble(
  Cluster = names(summary(Idents(scData))),
  "# of cells" = summary(Idents(scData))
),
row.names = FALSE, format = "html",
caption = "Number of cells in each cluster"
) %>%
  kable_styling(
    bootstrap_options = "striped", full_width = F,
    position = "float_right"
  )

Cells annotation

We use Bioconductor package SingleR to annotate cells from a set of reference data. The available references are described here https://bioconductor.org/packages/release/bioc/vignettes/SingleR/inst/doc/SingleR.html#5_available_references.

hpca.se <- switch(param$singleRReference,
  None = "None",
  HumanPrimaryCellAtlasData = HumanPrimaryCellAtlasData(),
  BlueprintEncodeData = BlueprintEncodeData(),
  MonacoImmuneData = MonacoImmuneData(),
  DatabaseImmuneCellExpressionData = DatabaseImmuneCellExpressionData(),
  NovershternHematopoieticData = NovershternHematopoieticData(),

  MouseRNAseqData = MouseRNAseqData(),
  ImmGenData = ImmGenData()
)
if (param$singleRReference == "None") {
  doSingleR <- FALSE
} else {
  doSingleR <- TRUE
}

if (doSingleR) {
  singler.results <- SingleR(
    test = GetAssayData(scData), ref = hpca.se,
    labels = hpca.se$label.fine
  )
  scData[["SingleR.labels"]] <- singler.results$labels
  cellInfo <- tibble(
    Cells = colnames(scData), Cluster = Idents(scData),
    SingleR.labels = scData$SingleR.labels
  ) %>%
    left_join(as_tibble(Embeddings(scData, "umap"), rownames = "Cells")) %>%
    left_join(as_tibble(Embeddings(scData, "tsne"), rownames = "Cells"))
}

nrOfLabels <- length(unique(cellInfo$SingleR.labels))
if (nrOfLabels <= 9) {
  colsLabels <- brewer.pal(nrOfLabels, "Set1")
} else {
  colsLabels <- colorRampPalette(brewer.pal(9, "Set1"))(nrOfLabels)
}
x <- list(title = "UMAP_1", zeroline = FALSE)
y <- list(title = "UMAP_2", zeroline = FALSE)
p1 <- plot_ly(cellInfo,
  x = ~UMAP_1, y = ~UMAP_2, color = ~SingleR.labels,
  text = ~ paste(
    "Cluster: ", Cluster,
    "\nCell: ", Cells,
    "\nSingleR.labels: ", SingleR.labels
  ),
  type = "scatter", mode = "markers", marker = list(size = 5, opacity = 0.5),
  colors = colsLabels
) %>% layout(xaxis = x, yaxis = y)
p1

x <- list(title = "tSNE_1", zeroline = FALSE)
y <- list(title = "tSNE_1", zeroline = FALSE)
p2 <- plot_ly(cellInfo,
  x = ~tSNE_1, y = ~tSNE_2, color = ~SingleR.labels,
  text = ~ paste(
    "Cluster: ", Cluster,
    "\nCell: ", Cells,
    "\nSingleR.labels: ", SingleR.labels
  ),
  type = "scatter", mode = "markers", marker = list(size = 5, opacity = 0.5),
  colors = colsLabels
) %>% layout(xaxis = x, yaxis = y)
p2

Marker genes

We only look at upregulated genes in each cluster, as these are more useful for positive identification of cell types in a heterogeneous population. The minimal fraction in either two populations is 0.25. The minimal logFC is 0.25. The minimal pvalue is r pvalue_allMarkers.

markers <- FindAllMarkers(scData,
  only.pos = TRUE, min.pct = 0.25,
  logfc.threshold = 0.25, return.thresh = pvalue_allMarkers
)
posMarkersFn <- "pos_markers.tsv"
write_tsv(as_tibble(markers), file = posMarkersFn)

if (doEnrichr(param)) {
  genesPerCluster <- split(markers$gene, markers$cluster)
  jsCall <- paste0('enrich({list: "', sapply(genesPerCluster, paste, collapse = "\\n"), '", popup: true});')
  enrichrCalls <- paste0(
    "<a href='javascript:void(0)' onClick='", jsCall,
    "'>Analyse at Enrichr website</a>"
  )
  enrichrTable <- tibble(
    Cluster = names(genesPerCluster),
    "# of markers" = lengths(genesPerCluster),
    "Enrichr link" = enrichrCalls
  )
  kable(enrichrTable,
    format = "html", escape = FALSE,
    caption = paste0("GeneSet enrichment: genes with pvalue ", pvalue_allMarkers)
  ) %>%
    kable_styling("striped", full_width = F, position = "left")
}

caption <- "Expression differences of cluster marker genes"
ezInteractiveTableRmd(markers, digits = 4, title = caption)

Top cluster markers on low-dimensional space

The top 10 (ranked by logFC) markers from each cluster.

eachCluster <- 0
for (eachCluster in unique(markers$cluster)) {
  cat("\n")
  cat("#### Cluster ", eachCluster, "\n")
  cat("\n")
  markersPerCluster <- dplyr::filter(markers, cluster == eachCluster) %>%
    dplyr::arrange(desc(avg_log2FC)) %>%
    select(gene) %>%
    pull()
  markersPerCluster <- head(markersPerCluster, 10)
  eachMarker <- markersPerCluster[1]
  for (eachMarker in markersPerCluster) {
    p <- VlnPlot(object = scData, features = eachMarker) + theme(legend.position = "none")
    print(p)
    p <- FeaturePlot(object = scData, features = eachMarker, reduction = "umap")
    print(p)
    p <- FeaturePlot(object = scData, features = eachMarker, reduction = "tsne")
    print(p)
  }
  cat("\n")
}

Heatmap of cluster marker genes

Max 10 marker genes are shown for each cluster.

top10 <- markers %>%
  group_by(cluster) %>%
  top_n(10, avg_log2FC)
DoHeatmap(scData, features = top10$gene) + NoLegend()

Expression of selected genes

Each dot is a cell. On the right-hand plot, the color corresponds to the gene expression levels.

if (length(param$knownMarkers) > 0) {
  for (i in 1:length(param$knownMarkers)) {
    cat("\n")
    cat(paste("####", names(param$knownMarkers)[i]))
    cat("\n")
    check_markers <- param$knownMarkers[[i]]
    check_markers <- rownames(scData)[toupper(rownames(scData)) %in%
      toupper(check_markers)]
    if (length(check_markers) > 0) {
      for (j in 1:length(check_markers)) {
        p <- VlnPlot(object = scData, features = check_markers[j]) +
          theme(legend.position = "none")
        print(p)

        p <- FeaturePlot(object = scData, features = check_markers[j], reduction = "umap")
        print(p)

        p <- FeaturePlot(object = scData, features = check_markers[j], reduction = "tsne")
        print(p)
      }
    }
    cat("\n")
  }
}

Trajectory reconstruction and pseudotime inference

fastICA

The fastICA performs ndependent Component Analysis (ICA) and Projection Pursuit. The algorithm can be used to find the projection pursuit directions. Projection pursuit is a technique for finding 'interesting' directions in multi-dimensional datasets. These projections and are useful for visualizing the dataset and in density estimation and regression. Interesting directions are those which show the least Gaussian distribution, which is what the FastICA algorithm does.

if (param$runPseudoTime) {
  cd <- as.matrix(GetAssayData(scData))
  set.seed(10)
  a <- fastICA(cd, 2,
    alg.typ = "parallel", fun = "logcosh",
    alpha = 1, method = "C", row.norm = FALSE, maxit = 200,
    tol = 0.0001, verbose = FALSE
  )
  scData$fastICA_pseudotime <- a$A[1, ]
}

if (param$runPseudoTime) {
  coordinates <- left_join(
    coordinates,
    enframe(scData$fastICA_pseudotime,
      name = "Cells",
      value = "fastICA_pseudotime"
    )
  )
  p <- ggplot(coordinates) +
    aes(UMAP_1, UMAP_2) +
    geom_point(
      shape = 21, size = 3, colour = "black",
      aes(fill = fastICA_pseudotime), alpha = 0.7
    ) +
    scale_fill_gradient(low = "yellow", high = "blue") +
    xlab("UMAP_1") +
    ylab("UMAP_2") +
    theme_minimal_grid()
  print(p)

  p <- ggplot(coordinates) +
    aes(tSNE_1, tSNE_2) +
    geom_point(
      shape = 21, size = 3, colour = "black",
      aes(fill = fastICA_pseudotime), alpha = 0.7
    ) +
    scale_fill_gradient(low = "yellow", high = "blue") +
    xlab("tSNE_1") +
    ylab("tSNE_2") +
    theme_minimal_grid()
  print(p)
}

Interactive explorer

qs::qsave(x = scData, file = basename(output$getColumn("Live Report")), nthreads = 4)

Shiny explorer

Data availability

tr_cnts <- expm1(GetAssayData(scData))
geneMeans <- rowsum(t(as.matrix(tr_cnts)), group = Idents(scData))
geneMeans <- sweep(geneMeans, 1,
  STATS = table(Idents(scData))[rownames(geneMeans)], FUN = "/"
)
geneMeans <- log1p(t(geneMeans))
colnames(geneMeans) <- paste("cluster", colnames(geneMeans), sep = "_")
geneMeanPerClusterFn <- "gene_means_per_cluster.txt"
write_tsv(as_tibble(geneMeans, rownames = "gene_name"),
  file = geneMeanPerClusterFn
)

geneMeans <- Matrix::rowMeans(tr_cnts)
geneMeans <- log1p(geneMeans)
geneMeansFn <- "gene_means.txt"
write_tsv(enframe(geneMeans, name = "gene_name", value = "gene_expression"),
  file = geneMeansFn
)

coordinatesFn <- "coordinates_data.tsv"
write_tsv(coordinates, file = coordinatesFn)

Average gene expression per cluster: r geneMeanPerClusterFn
Average gene expression: r geneMeansFn
Positive markers of all clusters: r posMarkersFn
Coordinates of cells in low dimensional space: r coordinatesFn

Parameters

param[c(
  "minCellsPerGene", "minGenesPerCell", "maxGenesPerCell", "maxMitoPercent",
  "pcs", "vars.to.regress", "resolution"
)]

SessionInfo

sessionInfo()

uzh/ezRun documentation built on June 14, 2025, 1:29 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

uzh/ezRun
An R meta-package for the analysis of Next Generation Sequencing Data

In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Clustering Results {.tabset}

Preprocessing

Quality control on the cells

Normalizing the data

Identification of highly variable features (feature selection)

Scaling the data

Perform linear dimensional reduction

Determine the 'dimensionality' of the dataset

Cluster the cells

Run non-linear dimensional reduction (UMAP/tSNE)

QC on low-dimensional space

UMAP

TSNE

Clustering

Cells annotation

Marker genes

Top cluster markers on low-dimensional space

Heatmap of cluster marker genes

Expression of selected genes

Trajectory reconstruction and pseudotime inference

fastICA

Interactive explorer

Data availability

Parameters

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

uzh/ezRun An R meta-package for the analysis of Next Generation Sequencing Data

In uzh/ezRun: An R meta-package for the analysis of Next Generation Sequencing Data

Clustering Results {.tabset}

Preprocessing

Quality control on the cells

Normalizing the data

Identification of highly variable features (feature selection)

Scaling the data

Perform linear dimensional reduction

Determine the 'dimensionality' of the dataset

Cluster the cells

Run non-linear dimensional reduction (UMAP/tSNE)

QC on low-dimensional space

UMAP

TSNE

Clustering

Cells annotation

Marker genes

Top cluster markers on low-dimensional space

Heatmap of cluster marker genes

Expression of selected genes

Trajectory reconstruction and pseudotime inference

fastICA

Interactive explorer

Data availability

Parameters

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

uzh/ezRun
An R meta-package for the analysis of Next Generation Sequencing Data