In bioinfocz/scdrake: A pipeline for droplet-based single-cell RNA-seq data secondary analysis implemented in the drake Make-like toolkit for R language

suppressPackageStartupMessages(library(magrittr))
if (rlang::is_true(getOption("knitr.in.progress"))) {
  params_ <- scdrake::scdrake_list(params)
}
drake_cache_dir <- params_$drake_cache_dir

drake::loadd(
  config_main, config_norm_clustering, doublet_score,
  dimred_plots_clustering_files, dimred_plots_clustering_files_out,
  dimred_plots_clustering_united_files, dimred_plots_clustering_united_files_out,
  cluster_graph_louvain_clustree_file, cluster_graph_leiden_clustree_file,
  cluster_sc3_clustree_file, cluster_sc3_cluster_stability_plots_file,
  cluster_kmeans_kbest_k, cluster_kmeans_k_clustree_file, cluster_kmeans_kbest_gaps_plot_file,
  dimred_plots_other_vars_files, dimred_plots_other_vars_files_out,
  selected_markers_plots_files, selected_markers_plots_files_out,
  dimred_plots_cell_annotation_files, dimred_plots_cell_annotation_files_out,
  cell_annotation_diagnostic_plots, cell_annotation_diagnostic_plots_files,
  dimred_plots_cell_annotation_files, dimred_plots_cell_annotation_files,

  path = drake_cache_dir
)

cfg <- config_norm_clustering

sce <- drake::readd(sce_final_norm_clustering, path = drake_cache_dir)
sce_metadata <- S4Vectors::metadata(sce)
sce_colData <- SingleCellExperiment::colData(sce)
sce_rowData <- SingleCellExperiment::rowData(sce)
cc_genes_valid <- all(!is.na(sce_colData$phase))
normalization_type_sce <- sce_metadata$normalization_type
hvg_metric <- sce_metadata$hvg_metric
hvg_selection <- sce_metadata$hvg_selection
hvg_selection_value <- sce_metadata$hvg_selection_value
report_html_file <- cfg$NORM_CLUSTERING_REPORT_HTML_FILE

any_clustering_enabled <- any(
  cfg$CLUSTER_GRAPH_LOUVAIN_ENABLED, cfg$CLUSTER_GRAPH_WALKTRAP_ENABLED, cfg$CLUSTER_GRAPH_LEIDEN_ENABLED,
  cfg$CLUSTER_KMEANS_K_ENABLED, cfg$CLUSTER_KMEANS_KBEST_ENABLED,
  cfg$CLUSTER_SC3_ENABLED
)

Input data overview

Just to review data from the preceding pipeline step (01 - quality control):

cat(drake::readd(sce_final_input_qc_info, path = drake_cache_dir)$str)

Cell cycle phase assignment

Assign each cell a score, based on its expression of G2/M and S phase markers. These marker sets should be anticorrelated in their expression levels, and cells expressing neither are likely not cycling and in G1 phase. In some cases, cell cycle can be the primary factor determining the cell heterogeneity and effectively masking the differences between cell subpopulations we want to study.

You can view the assigned cell cycle phases in the Dimensionality_reduction_plots section below.

if (!cc_genes_valid) {
  scdrake::catn("**Cell cycle score and phases could not be computed for your dataset.** It is possible that it doesn't have any expressed cell-cycle genes.")
}

r scdrake::format_used_functions(c("Seurat::cc.genes.updated.2019", "Seurat::CellCycleScoring()"))

Normalization

Systematic differences in sequencing coverage between libraries are often observed in single-cell RNA sequencing data. They typically arise from technical differences in cDNA capture or PCR amplification efficiency across cells, attributable to the difficulty of achieving consistent library preparation with minimal starting material. Normalization aims to remove these differences such that they do not interfere with comparisons of the expression profiles between cells. This ensures that any observed heterogeneity or differential expression within the cell population are driven by biology and not technical biases.

More information in OSCA

Used normalization method: "r normalization_type_sce"

if (cfg$NORMALIZATION_TYPE == "none") {
  cat("(Normalization was performed in a previous pipeline run.)\n\n")
}

if (normalization_type_sce == "scran") {
  cat("**`scran`: normalization by deconvolution**\n\n")
  cat(
    "Cell-specific biases are normalized using the `scuttle::computePooledFactors()` method,",
    "which implements the deconvolution strategy for scaling normalization",
    "([*A. T. Lun, Bach, and Marioni 2016*](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7)).",
    "This computes size factors that are used to scale the counts in each cell.",
    "The assumption is that most genes are not differentially expressed (DE) between cells,",
    "such that any differences in expression across the majority of genes represents some technical bias that should be removed."
  )
  used_functions <- c("scuttle::computePooledFactors()", "scuttle::logNormCounts()")
  if (cfg$SCRAN_USE_QUICKCLUSTER) {
    used_functions <- c("scran::quickCluster()", used_functions)
  }
  scdrake::format_used_functions(used_functions, do_cat = TRUE)
} else if (normalization_type_sce == "sctransform") {
  cat("**`sctransform`: regularized negative binomial regression to normalize UMI count data.**\n\n")
  cat("[*Hafemeister & Satija 2019*](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1)\n\n")
  scdrake::format_used_functions("Seurat::SCTransform()", do_cat = TRUE)
}

Highly variable genes (HVGs) selection

We often use scRNA-seq data in exploratory analyses to characterize heterogeneity across cells. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles, which involves aggregating per-gene differences into a single (dis)similarity metric between a pair of cells. The choice of genes to use in this calculation has a major impact on the behavior of the metric and the performance of downstream methods. We want to select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure, and to reduce the size of the data to improve computational efficiency of later steps.

In STR, we can identify spatially variable genes (SVGs). We define SVGs as genes with spatially correlated patterns of expression across the tissue area. Based on paper from Li et al 2021 we decided to generate a combined set of HVGs and Spatialy variable genes (SVGs).

More information in OSCA

scdrake::catg0('**HVG metric: "{hvg_metric}"**\n\n')

if (hvg_metric == "gene_var") {
  var_field <- "bio"

  cat(
    "`scran::modelGeneVar()` models the variance of the log-expression profiles for each gene,",
    "decomposing it into technical and biological components based on a fitted mean-variance trend.\n\n"
  )

  scdrake::plot_hvg_fit(sce, "var")
  hvg_used_functions <- "scran::modelGeneVar()"
} else if (hvg_metric == "gene_cv2") {
  var_field <- "ratio"

  cat(
    "`scran::modelGeneCV2()` models the squared coefficient of variation (CV2) of the normalized expression profiles for each gene,",
    "fitting a trend to account for the mean-variance relationship across genes.\n\n"
  )

  scdrake::plot_hvg_fit(sce, "cv2")
  hvg_used_functions <- "scran::modelGeneCV2()"
} else if (hvg_metric == "sctransform") {
  hvg_used_functions <- "Seurat::SCTransform()"
  scdrake::catg0("HVGs (n = {cfg$SCT_N_HVG}) were selected by the `sctransform` method.")
}

if (cc_genes_valid && !rlang::is_null(sce_metadata$hvg_rm_cc_genes) && sce_metadata$hvg_rm_cc_genes) {
  scdrake::catg0(
    "Using the percentage of variance explained by the cell cycle phase in the expression profile for each gene, ",
    "we removed {length(sce_metadata$hvg_rm_cc_genes_ids)} genes with percentage > ",
    "{sce_metadata$hvg_cc_genes_var_expl_threshold} prior to HVG selection.\n\n",
    "This strategy is further explained in [OSCA](https://bioconductor.org/books/3.15/OSCA.advanced/cell-cycle-assignment.html#removing-cell-cycle-related-genes)"
  )

  phase_variance_explained <- sce_rowData[sce_rowData$is_cc_related, c("ENSEMBL", "SYMBOL", "phase_variance_explained")] %>%
    as.data.frame() %>%
    dplyr::arrange(-phase_variance_explained) %>%
    scdrake::render_bootstrap_table(row.names = FALSE) %>%
    as.character()

  cat("<details>\n  <summary class='used-functions'>Show cell cycle-related genes \u25be</summary>\n\n")
  cat(phase_variance_explained)
  cat("\n\n</details>")

  p_phase_var_explained <- ggplot2::ggplot(sce_rowData %>% as.data.frame()) +
    ggplot2::geom_histogram(ggplot2::aes(x = phase_variance_explained), binwidth = 0.5) +
    ggplot2::geom_vline(xintercept = sce_metadata$hvg_cc_genes_var_expl_threshold, color = "red") +
    ggplot2::scale_x_continuous(breaks = seq_len(ceiling(max(sce_rowData$phase_variance_explained)))) +
    ggplot2::ggtitle(
      "Histogram of variance explained by cell cycle phase",
      subtitle = stringr::str_wrap("Genes on the right of the red line are marked as cell cycle-related and removed from HVGs.")
    ) +
    ggplot2::theme_bw()
  print(p_phase_var_explained)

  hvg_used_functions <- c(hvg_used_functions, "scater::getVarianceExplained()")
}

if (hvg_metric %in% c("gene_var", "gene_cv2")) {
  scdrake::catg0('\n\nBased on "{hvg_metric}", HVGs were selected by: ')

  if (hvg_selection == "top") {
    scdrake::catg0("top {hvg_selection_value} HVGs.\n\n")
  } else if (hvg_selection == "significance") {
    scdrake::catg0("FDR < {hvg_selection_value}\n\n")
  } else if (hvg_selection == "threshold") {
    scdrake::catg0("variance or CV2 > {hvg_selection_value}\n\n")
  }

  scdrake::catg0("**Found {length(sce_metadata$hvg_ids)} HVGs.**\n\n")
}

Plot of HVGs:

drake::readd(hvg_plot, path = drake_cache_dir)

r scdrake::format_used_functions(hvg_used_functions)

Doublet score assignment

The scran::doubletCluster() function identifes clusters with expression profiles lying between two other clusters. Considering every possible triplet of clusters, the method uses the number of DE genes, the median library size, and the proporion of cells in the cluster to mark clusters as possible doublets.

if (normalization_type_sce == "scran" && cfg$SCRAN_USE_QUICKCLUSTER) {
  cat("Prior to normalization, quick clustering was performed. We can use those clusters to look at doublet score within them:\n\n")
  boxplot(doublet_score ~ cluster_quickcluster, data = sce_colData)
}

if (rlang::is_true(sce_metadata$has_filtered_doublets)) {
  n_doublets <- sum(sce_colData$is_doublet)
  doublets_pct <- (n_doublets / ncol(sce)) * 100
  scdrake::catn(
    glue::glue("**Discarded {n_doublets} cells ({doublets_pct} % of all cells) with doublet score above {sce_metadata$max_doublet_score}**")
  )
} else {
  scdrake::catn(glue::glue("**Cells were not filtered by doublet score.**"))
}

r scdrake::format_used_functions("scDblFinder::computeDoubletDensity()")

Dimensionality reduction

As the name suggests, dimensionality reduction aims to reduce the number of separate dimensions in the data. See this chapter in OSCA that provides an intuitive explanation of the motivation behind, along with basic introduction to PCA, t-SNE and UMAP.

PCA

Principal components analysis (PCA) discovers axes in high-dimensional space that capture the largest amount of variation. In case of scRNA-seq, we basically compress multiple features into several dimensions. This reduces computational work in downstream analyses like clustering and other DR methods (UMAP and t-SNE), as calculations only need to be performed for a few dimensions rather than thousands of genes. It also reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data.

PCs selection

By definition, the top PCs capture the dominant factors of heterogeneity in the data set. In the context of scRNA-seq, our assumption is that biological processes affect multiple genes in a coordinated manner. This means that the earlier PCs are likely to represent biological structure as more variation can be captured by considering the correlated behavior of many genes. We use the earlier PCs in our downstream analyses, which concentrates the biological signal to simultaneously reduce computational work and remove noise.

There are several methods how to select the first PCs:

Elbow point method: a simple heuristic for choosing PCs involves identifying the elbow point in the percentage of variance explained by successive PCs. This refers to the "elbow" in the curve of a scree plot as shown.
Technical variance method: use the technical component estimates to determine the proportion of variance that should be retained. This is implemented in scran::denoisePCA(), which takes the estimates returned by scran::modelGeneVar().
Forced: use a predefined number of PCs.

drake::readd(pca_selected_pcs_plot, path = drake_cache_dir)

r sce_metadata$pca_selected_pcs PCs were selected using the "r sce_metadata$pca_selection_method" method

r scdrake::format_used_functions(c("scater::runPCA()", "PCAtools::findElbowPoint()", "scran::getDenoisedPCs()"))

if (any_clustering_enabled) {
  cat(knitr::knit_child(here::here("Rmd/common/clustering/clustering.Rmd"), quiet = TRUE))
  cat("\n\n#\n\n***\n\n")
}

if (!is.null(cfg$NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER)) {
  res <- scdrake::generate_dimred_plots_section(
    dimred_plots_other_vars_files = dimred_plots_other_vars_files,
    selected_markers_plots_files = selected_markers_plots_files,
    dimred_plots_rel_start = fs::path_dir(cfg$NORM_CLUSTERING_REPORT_HTML_FILE),
    selected_markers_files_rel_start = fs::path_dir(cfg$NORM_CLUSTERING_REPORT_HTML_FILE),
    main_header = "Dimensionality reduction plots"
  )

  cat("\n\n#\n\n***\n\n")
}

if (!is.null(cfg$CELL_ANNOTATION_SOURCES)) {
  cell_annotation_text <- str_space(
    "We used the [SingleR](https://bioconductor.org/packages/3.15/bioc/html/SingleR.html) package to predict cell types in the dataset.",
    "Given a reference dataset of samples (single-cell or bulk) with known labels, `SinglerR` assigns those labels to",
    "new cells from a test dataset based on similarities in their expression profiles.",
    "You can find more information in the [SingleR book](https://bioconductor.org/books/3.15/SingleRBook/).\n\n",
    "The used references are shown below in the tabs. Each have several diagnostic plots:\n\n",
    "- Score heatmaps show distribution of predicted cell types in computed clusters (if any), along with per-cell annotation scores\n",
    "- Marker heatmaps show genes that are markers for a given cell type in both the reference and current datasets,",
    "i.e. those markers have driven the decision to label cells by the chosen cell type\n",
    "- Delta scores show poor-quality or ambiguous assignments based on the per-cell 'delta', i.e., the difference between",
    "the score for the assigned label and the median across all labels for each cell.",
    "See [OSCA](https://bioconductor.org/books/3.15/SingleRBook/annotation-diagnostics.html#based-on-the-deltas-across-cells) for more details"
  )

  res <- scdrake::generate_cell_annotation_plots_section(
    dimred_plots_cell_annotation_files = dimred_plots_cell_annotation_files,
    cell_annotation_diagnostic_plots = cell_annotation_diagnostic_plots,
    dimred_plots_rel_start = fs::path_dir(cfg$NORM_CLUSTERING_REPORT_HTML_FILE),
    cell_annotation_diagnostic_plots_rel_start = fs::path_dir(cfg$NORM_CLUSTERING_REPORT_HTML_FILE),
    main_header = "Cell annotation",
    text = cell_annotation_text
  )

  cat("\n\n#\n\n***\n\n")
}

if (isTRUE(cfg$MANUAL_ANNOTATION)) {
 scdrake::md_header("Marker-based cell annotation", 1, extra = "{.tabset}")
   scdrake::catn(
    glue::glue("**Annotation was done for {cfg$ANNOTATION_CLUSTERING}**"))
   cat("\n\n")
  cat("For marker-based annotation we modified an implemented function from the Giotto package. The enrichment Z score is calculated by using method (PAGE) from Kim SY et al., BMC bioinformatics, 2005 as $$ Z = \frac{((Sm – mu)*m^\frac{1}{2})}{delta} $$. \n
 For each gene in each spot/cell, mu is the fold change values versus the mean expression
 and delta is the standard deviation. Sm is the mean fold change value of a specific marker gene set
 and  m is the size of a given marker gene set.")
}
  cat("\n\n")
if (isTRUE(cfg$SPATIAL)) {
scdrake::create_a_link(href = drake::readd(plot_annotation)$anot_plot_out_pdf_file[3],text = "**Cell annotation plot**", href_rel_start = fs::path_dir(cfg$NORM_CLUSTERING_REPORT_HTML_FILE),do_cat = TRUE)  
 cat("\n\n")
 scdrake::create_a_link(href = drake::readd(plot_annotation)$anot_plot_out_pdf_file[4],text = "**Enrichment cells plot**", href_rel_start = fs::path_dir(cfg$NORM_CLUSTERING_REPORT_HTML_FILE),do_cat = TRUE)  
 cat("\n\n")
} 
scdrake::create_img_link(drake::readd(plot_annotation)$anot_plot_out_pdf_file[1],img_src = drake::readd(plot_annotation)$anot_plot_out_png_file[1],img_width = "450px",href_rel_start = fs::path_dir(cfg$NORM_CLUSTERING_REPORT_HTML_FILE),do_cat = TRUE)


scdrake::create_img_link(drake::readd(plot_annotation)$anot_plot_out_pdf_file[2],img_src = drake::readd(plot_annotation)$anot_plot_out_png_file[2],img_width = "450px",href_rel_start = fs::path_dir(cfg$NORM_CLUSTERING_REPORT_HTML_FILE),do_cat = TRUE)

Show input parameters

Main config

print(config_main)

Normalization and clustering config

print(cfg)

bioinfocz/scdrake documentation built on Sept. 19, 2024, 4:43 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

bioinfocz/scdrake
A pipeline for droplet-based single-cell RNA-seq data secondary analysis implemented in the drake Make-like toolkit for R language

In bioinfocz/scdrake: A pipeline for droplet-based single-cell RNA-seq data secondary analysis implemented in the drake Make-like toolkit for R language

Input data overview

Cell cycle phase assignment

Normalization

Highly variable genes (HVGs) selection

Doublet score assignment

Dimensionality reduction

PCA

PCs selection

Main config

Normalization and clustering config

R Package Documentation

Browse R Packages

We want your feedback!

bioinfocz/scdrake A pipeline for droplet-based single-cell RNA-seq data secondary analysis implemented in the drake Make-like toolkit for R language

In bioinfocz/scdrake: A pipeline for droplet-based single-cell RNA-seq data secondary analysis implemented in the drake Make-like toolkit for R language

Input data overview

Cell cycle phase assignment

Normalization

Highly variable genes (HVGs) selection

Doublet score assignment

Dimensionality reduction

PCA

PCs selection

Main config

Normalization and clustering config

R Package Documentation

Browse R Packages

We want your feedback!

bioinfocz/scdrake
A pipeline for droplet-based single-cell RNA-seq data secondary analysis implemented in the drake Make-like toolkit for R language