In bioinfocz/scdrake: A pipeline for droplet-based single-cell RNA-seq data secondary analysis implemented in the drake Make-like toolkit for R language

suppressPackageStartupMessages(library(magrittr))
if (rlang::is_true(getOption("knitr.in.progress"))) {
  params_ <- scdrake::scdrake_list(params)
}
drake_cache_dir <- params_$drake_cache_dir

drake::loadd(
  config_main, config_input_qc, empty_droplets, sce_valid_cells_info, barcode_ranks,
  qc_filter, custom_filter, sce_qc_filter_rowSums, sce_custom_filter_rowSums,
  path = drake_cache_dir
)

cfg <- config_input_qc
empty_droplets_enabled <- cfg$EMPTY_DROPLETS_ENABLED
cell_filtering_enabled <- cfg$ENABLE_CELL_FILTERING
gene_filtering_enabled <- cfg$ENABLE_GENE_FILTERING

input_type <- cfg$INPUT_DATA$type
filtering_type <- ifelse(cfg$SAVE_DATASET_SENSITIVE_FILTERING, "dataset-sensitive", "custom")

if (input_type == "cellranger") {
  scdrake::md_header("Input data: 10x Genomics Cell Ranger data", 1)
  cat(scdrake::str_space(
    "The feature-barcode matrix was imported from",
    "[Cell Ranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger)",
    "output (the official quantification tool from 10x Genomics)."
  ))
} else if (input_type == "table") {
  scdrake::md_header("Input data: delimited text (table)", 1)
  cat("The feature-barcode matrix was imported from a delimited file.")
} else if (input_type == "sce") {
  scdrake::md_header("Input data: `SingleCellExperiment` object", 1)
  cat("The object holding experimental data (feature-barcode matrix, gene annotation etc.) was imported from a Rds file.")
}

Each row of feature-barcode matrix corresponds to a gene, while each column corresponds to a cell barcode. Summary of imported data:

cat(drake::readd(sce_raw_info, path = drake_cache_dir)$str)

r scdrake::format_used_functions("DropletUtils::read10xCounts()")

Empty droplets

Empty droplets often contain RNA from the ambient solution, resulting in non-zero counts after debarcoding. It is desired to discard such droplets.

Barcode rank plot

A useful diagnostic for droplet-based data is the barcode rank plot, which shows the total UMI (log-)count for each barcode on the y-axis and the (log-)rank on the x-axis. This is effectively a transposed empirical cumulative density plot with log-transformed axes. It is useful as it allows examine the distribution of total UMI counts across barcodes, focusing on those with the largest counts.

uniq <- !duplicated(barcode_ranks$rank)
plot(barcode_ranks$rank[uniq], barcode_ranks$total[uniq], log = "xy", xlab = "Rank", ylab = "Total")
o <- order(barcode_ranks$rank)
lines(barcode_ranks$rank[o], barcode_ranks$fitted[o], col = "red")

abline(h = metadata(barcode_ranks)$knee, col = "dodgerblue", lty = 2)
abline(h = metadata(barcode_ranks)$inflection, col = "forestgreen", lty = 2)
if (empty_droplets_enabled) {
  abline(h = cfg$EMPTY_DROPLETS_LOWER, col = "firebrick", lty = 2)
  legend(
    "bottomleft",
    lty = 2,
    col = c("dodgerblue", "forestgreen", "firebrick"),
    legend = c("knee", "inflection", "emptyDroplets lower bound")
  )
} else {
  legend(
    "bottomleft",
    lty = 2,
    col = c("dodgerblue", "forestgreen"),
    legend = c("knee", "inflection")
  )
}

The knee and inflection points on the curve mark the transition between two components of the total UMI count distribution. This is assumed to represent the difference between empty droplets with little RNA and cell-containing droplets with much more RNA.

if (empty_droplets_enabled) {
  cat(
    "The emptyDroplets lower bound specifies at or below which number of the total UMI count all barcodes",
    "are assumed to correspond to empty droplets."
  )
} else {
  cat("Removal of empty droplets was disabled. You can enable it by setting `EMPTY_DROPLETS_ENABLED` parameter to `TRUE`.")
}

```r)}

***

# Gene + Cell quality filtering

## Pre-filtering QC

Given sets of mitochondrial and ribosomal genes in the data, the `scater` package automatically calculates
several per-cell QC metrics:

- Number of UMI.
- Number of detected genes (non-zero UMI count).
- Percentage of expressed mitochondrial genes ($\frac {UMI_{mitochondrial}} {UMI_{sum}} * 100$).

Then we can use two different methods to filter cells based on the metrics above:

- **Custom filtering**: a standard approach is to filter cells with low amount of reads as well as genes that are
present in at least a certain amount of cells, using fixed thresholds. While simple, using fixed thresholds requires
knowledge of the experiment and of the experimental protocol.
- **Dataset-sensitive filtering**: an alternative approach is to use adaptive, data-driven thresholds to identify
outlying cells, based on the set of QC metrics just calculated. We identify cells that are outliers for the various
QC metrics, based on the median absolute deviation (MAD) from the median value of each QC metric across all cells.
Specifically, a value is considered an outlier if it is more than `r cfg$MAD_THRESHOLD` MADs from the median in
the "problematic" direction.

Additionaly, extremely high number of detected genes could indicate doublets (more sensitive doublet detection is
done after library normalization). However, depending on the cell type composition in your sample,
you may have cells with higher number of genes (and also higher counts) from one cell type.

Now we can plot some of the QC features. Cells are colored by `discard_qc`, meaning if a cell would be discarded by
MAD thresholding on a QC metric.

```r
cowplot::plot_grid(plotlist = drake::readd(sce_unfiltered_plotlist, path = drake_cache_dir), ncol = 2)

r scdrake::format_used_functions("scuttle::perCellQCMetrics()")

Filtering {.tabset}

Dataset-sensitive filters

Cell filtering

```r)}

```r)}
cat("Cell filtering was disabled.")

Gene filtering

```r)}

```r)}
cat("Gene filtering was disabled.")

Custom filters

Cell filtering

```r)}

```r)}
cat("Cell filtering was disabled.")

Gene filtering

```r)}

```r)}
cat("Gene filtering was disabled.")

Post-filtering QC

Final filtering selection: using r filtering_type filtering.

cat(drake::readd(sce_final_input_qc_info, path = drake_cache_dir)$str)

Cell and gene number history

scdrake::render_bootstrap_table(drake::readd(sce_history, path = drake_cache_dir), full_width = FALSE, position = "left")

print(drake::readd(sce_history_plot, path = drake_cache_dir))

Dataset-sensitive filtering

Plots of QC metrics after dataset-sensitive filtering. discard_custom means if given cell was discarded in custom filtering.

cowplot::plot_grid(plotlist = drake::readd(sce_qc_filter_genes_plotlist, path = drake_cache_dir), ncol = 2)

Filtering based on custom filters

Plots of QC metrics after custom filtering. discard_qc means if given cell was discarded in dataset-sensitive filtering.

cowplot::plot_grid(plotlist = drake::readd(sce_custom_filter_genes_plotlist, path = drake_cache_dir), ncol = 2)

Gene annotation

Used annotation package: r config_main$ANNOTATION_PKG (vr sessioninfo::package_info(config_main$ANNOTATION_PKG, dependencies = FALSE)$loadedversion)
If a single ENSEMBL ID has multiple symbols, gene descriptions, or ENTREZ IDs, they are collapsed by comma (,).
ENSEMBL ID is used as a symbol for ENSEMBL IDs with unknown symbols.
ENSEMBL ID is appended to symbols having multiple ENSEMBL IDs (e.g. TBCE has both ENSG00000285053 and ENSG00000284770 ENSEMBL IDs assigned -> its symbol is changed to TBCE_ENSG00000285053 and TBCE_ENSG00000284770).

drake::readd(gene_annotation, path = drake_cache_dir) %>%
  head() %>%
  scdrake::render_bootstrap_table()

Show input parameters

Main config

print(config_main)

Input QC config

print(cfg)

bioinfocz/scdrake documentation built on Sept. 19, 2024, 4:43 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

bioinfocz/scdrake
A pipeline for droplet-based single-cell RNA-seq data secondary analysis implemented in the drake Make-like toolkit for R language

In bioinfocz/scdrake: A pipeline for droplet-based single-cell RNA-seq data secondary analysis implemented in the drake Make-like toolkit for R language

Empty droplets

Barcode rank plot

Filtering {.tabset}

Dataset-sensitive filters

Cell filtering

Gene filtering

Custom filters

Cell filtering

Gene filtering

Post-filtering QC

Cell and gene number history

Dataset-sensitive filtering

Filtering based on custom filters

Gene annotation

Main config

Input QC config

R Package Documentation

Browse R Packages

We want your feedback!

bioinfocz/scdrake A pipeline for droplet-based single-cell RNA-seq data secondary analysis implemented in the drake Make-like toolkit for R language

In bioinfocz/scdrake: A pipeline for droplet-based single-cell RNA-seq data secondary analysis implemented in the drake Make-like toolkit for R language

Empty droplets

Barcode rank plot

Filtering {.tabset}

Dataset-sensitive filters

Cell filtering

Gene filtering

Custom filters

Cell filtering

Gene filtering

Post-filtering QC

Cell and gene number history

Dataset-sensitive filtering

Filtering based on custom filters

Gene annotation

Main config

Input QC config

R Package Documentation

Browse R Packages

We want your feedback!

bioinfocz/scdrake
A pipeline for droplet-based single-cell RNA-seq data secondary analysis implemented in the drake Make-like toolkit for R language